I've assumed, in the example shown below, that the gene duplication event happens by recombination between sister chromosomes when they are aligned during meiosis. That's not the only possibility but it's easy to understand.
These sorts of gene duplication events appear to be quite common judging from the frequency of copy number variations in complex genomes (Redon et al., 2006; MacDonald et al., 2013).
In most cases, two copies of a gene are not needed so if one of them acquires a mutation it will not be detrimental. The initial mutation could inactivate the promoter causing the gene to be not transcribed. Or, it could affect splicing, modification, RNA stability, or translation. All of these inactivating mutations will be neutral as long as the other gene remains active.
Over time, the pseudogene will acquire more and more mutations because there's no selection for function. This "birth" and "death" of new genes is the standard scenario for understanding the evolution of gene families [The birth and death of salmon genes ][Birth-and-Death Evolution in Mammalian Gene Families ][The Evolution of Gene Families]. In rare cases, the extra copy can acquire a different function or be expressed preferentially in different cells leading to a situation where both copies become necessary and both are preserved by negative selection.
Young pseudogenes that form in this way from parental protein-coding genes might still be transcribed and might even produce an inactive protein. They are still junk DNA even if they are transcribed efficiently. This distinguishes these kinds of pseudogenes from processed pseudogenes that are dead on arrival. Somewhere between 9% and 15% of the pseudogenes in the mammalian genomes are transcribed (Pei et al., 2012; Susi et al., 2014) but there are significant differences between the two types. About 16% of duplicated pseudogenes are transcribed while only 6% of processed pseudogenes are transcribed (Pei et al., 2012).
Transcription of duplicated pseudogenes is most likely to be an indication of dying genes that are still transcribed while transcription of processed pseudogenes is most likely associated with insertion into a transcribed region of the chromosome. In some cases, the insertion occurs in the intron of another gene. (Introns make up about 25% of the genome so many random insertions will occur in introns.)
About 90% of all pseudogenes are NOT associated with transcription factor binding sites, active chromatin markers, or RNA polymerase binding sites consistent with the idea that they are inactive DNA sequences (Pei et al., 2012). This is an important point—the evidence shows that the vast majority of pseudogenes are, in fact, pseudogenes. Only a small fraction of the remaining 10% appear to have a biological function.1
The other distinguishing characteristic is that pseudogenes arising from gene duplication events are almost always located near the parent gene. Recall that processed pseudogenes can integrate at any site in the genome and are almost never found near their parent.
The classic examples of these sorts of pseudogenes are the single pseudogene at the β-globin locus in humans on chromosome 11 and the three pseudogenes at the α-globin locus on chromosome 16. These were among the very first protein-coding pseudogenes to be recognized. (The pseudogenes are identified by ψ, the Greek letter psi.)
Most pseudogenes in mammals are processed pseudogenes (>70%) whereas in nematode, Drosophila, and zebrafish genomes the dominant type of pseudogene derives from duplication events followed by inactivation of one copy (Sisu et al., 2014). This is broadly consistent with the size of the genome and the number of transposons. The more transposons the more reverse transcriptase will be produced in the nucleus and this favors production of processed pseudogenes.
Support for this idea comes from examining the age of processed pseudogenes and the age of transposons as inferred from their sequences. In human, macaque, and mouse genomes there is a burst of processed pseudogene formation at the same time as bursts of retrotransposon insertions (Sisu et al. 2014).
Duplicated pseudogenes seem to arise at a much more constant rate consistent with the idea that they are the result of accidental recombination events.
1. If there are 14,000 pseudogenes derived from protein-coding parents then about 12,600 of these show no evidence of biological function. Even if all 1,400 of the remaining pseudogenes were functional it would only account for less than 0.2% of the human genome.
Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H., Shapero, M.H., Carson, A.R., and Chen, W. (2006) Global variation in copy number in the human genome. Nature, 444:444-454. [doi: 10.1038/nature05329]
MacDonald, J.R., Ziman, R., Yuen, R.K., Feuk, L., and Scherer, S.W. (2014) The Database of Genomic Variants: a curated collection of structural variation in the human genome. Nucleic acids research, 42(D1):D986-D992. [doi: 10.1093/nar/gkt958]
Pei, B., Sisu, C., Frankish, A., Howald, C., Habegger, L., Mu, X.J., Harte, R., Balasubramanian, S., Tanzer, A., and Diekhans, M. (2012) The GENCODE pseudogene resource. Genome Biol, 13(9):R51. [doi: 10.1186/gb-2012-13-9-r51]
Sisu, C., Pei, B., Leng, J., Frankish, A., Zhang, Y., Balasubramanian, S., Harte, R., Wang, D., Rutenberg-Schoenberg, M., and Clark, W. (2014) Comparative analysis of pseudogenes across three phyla. Proceedings of the National Academy of Sciences, 111(37), 13361-13366. [doi: 10.1073/pnas.1407293111]