The answers presented so far are very nice, but I was also expecting an emphasis on a particular difference between parallel and distributed processing: the code executed. Considering parallel processes, the code executed is the same, regardless of the level of parallelism (instruction, data, task). You write a single code, and it will be executed by different threads/processors, e.g., while computing matrices products, or generating permutations.
On the other hand, distributed computing involves the execution of different algorithms/programs at the same time in different processors (from one or more machines). Such computations are later merged into a intermediate/final results by using the available means of data communication/synchronization (shared memory, network). Further, distributed computing is very appealing for BigData processing, as it allows for exploiting disk parallelism (usually the bottleneck for large databases).
Finally, for the level of parallelism, it may be taken rather as a constraint on the synchronization. For example, in GPGPU, which is single-instruction multiple-data (SIMD), the parallelism occurs by having different inputs for a single instruction, each pair (data_i, instruction) being executed by a different thread. Such is the restraint that, in case of divergent branches, it is necessary to discard lots of unnecessary computations, until the threads reconverge. For CPU threads, though, they commonly diverge; yet, one may use synchronization structures to grant concurrent execution of specific sections of the code.