New Delhi: Scientists have decoded the entire human genome. In the early 2000s, about 92 per cent of the genome had been decoded under the Human Genome Project. Now, the remaining eight per cent has been sequenced. The findings are described in six papers in the journal, Science.
What Is A Genome?
A genome is the complete set of genetic information in an organism, and provides all of the information the organism requires to function. The genome of living organisms is stored in long molecules of DNA (deoxyribonucleic acid) called chromosomes.
Genes are small sections of DNA which code for the RNA (ribonucleic acid) and protein molecules required by the organism.
The human genome consists of 23 pairs of chromosomes. Between 1990 and 2003, all 23 pairs were fully sequenced through the Human Genome Project.
In the project, DNA from a region called euchromatin was sequenced while that from a region called heterochromatin was left out. Euchromatin is a lightly packed form of chromatin (a complex of DNA and proteins that forms chromosomes within the nucleus of eukaryotic cells) that is enriched in genes, and easily transcribed. In other words, euchromatin creates protein.
Heterochromatin is highly condensed, gene-poor, and transcriptionally silent, which means it does not make proteins.
Why Was Heterochromatin Left Out?
Initially, scientists deprioritised heterochromatin. Since the euchromatic regions contained more genes and were simpler to sequence, the genomic tools available at that time found euchromatic DNA easier to parse compared to heterochromatic DNA.
In simple words, advanced tools to decode heterochromatin were not available. Those tools were developed over the years.
Even after 20 years of upgrades, eight per cent of the genome still remained unsequenced and unstudied. Roughly 151 million base pairs of sequence data scattered throughout the genome were derided by some as "junk DNA" with no clear function, and were still a black box, The Rockefeller University in New York said in a statement.
Now, researchers led by Adam Phillippy at the United States National Institutes of Health have revealed the final eight per cent of the human genome, and found that these long-missing pieces of our genome contain more than mere junk.
Heterochromatin is non-coding DNA that does not make protein, but still plays crucial roles in many cellular functions. Since heterochromatin does not make protein, it was thought to be junk DNA.
Erich D Jarvis, a researcher at The Rockefeller University, said that from the missing eight per cent, researchers are now gaining an entirely new understanding of how cells divide, allowing them to study a number of diseases they had not been able to get at before, according to the statement.
The heterochromatic sequences behind centromeres, which conduct cell division, were all marked with long runs of N for "unknown base" in the human reference genome. A centromere is a constricted region of a chromosome that separates it into a short arm (p) and a long arm (q). Jarvis said that not even all of the euchromatic genome was sequenced properly and that errors, such as false duplications, needed to be fixed.
Jarvis and other scientists were able to finish what the Human Genome Project started, with updated tools and renewed resolve. The researchers, at long last, could successfully describe a true complete human genome. The euchromatic regions of the human genome have been revised, and the heterochromatic regions are on full display.
Jarvis said that “every single base pair of a human genome is now complete."
Giulio Formenti, another researcher at The Rockefeller University, said "we are finally digging into what we once called junk DNA, because we could not understand it or look at it accurately".
The work is the result of the efforts of a global collaboration called the Telomere-2-Telomere (T2T) project, led by researchers at the National Human Genome Research Institute (NHGRI); University of California Santa Cruz; and University of Washington, Seattle.
What Is In The 8% Of The Human Genome?
With laboratory techniques, computational biology approaches and other essential research resources, scientists have decoded the DNA in heterochromatin.
The new reference genome, called T2T-CHM13, includes 99 genes likely to code for proteins and nearly 2,000 candidate genes which need to be studied further. As many as 200 million base pairs of novel sequences have been added, according to the main Science paper, "The complete sequence of a human genome".
The new reference genome also corrects thousands of structural errors in the current reference sequence.
The gaps filled by the new sequence include the entire short arms of five human chromosomes, and cover some of the most complex regions of the genome, which include highly repetitive DNA sequences found in and around chromosomal structures such as the telomeres and centromeres. A telomere is the end of a chromosome, and is made of repetitive sequences of non-coding DNA that protect the chromosome from damage. The telomeres become shorter each time a cell divides. The centromeres coordinate the separation of replicated chromosomes during cell division.
The new sequence also reveals previously undetected segmental duplications. These are long stretches of DNA that are duplicated in the DNA, and are known to play important roles in evolution and disease, the University of Washington School of Medicine, Seattle, said in a statement.
The segmental duplications are critical to understanding human evolution and genetic diversity, as well as resistance or susceptibility to many diseases. There are 20,000 genes in the human genome. Of these, 950 originate in segmental duplications.
Karen Miga, a researcher at University of California Santa Cruz, said in a university statement that these parts of the human genome that scientists have not been able to study for 20-plus years are important to our understanding of how the genome works, genetic diseases, and human diversity and evolution.
Even if some of the newly revealed regions do not include active genes, they have important functions.
Why Is The New Reference Genome Important?
The T2T Consortium has now collaborated with the Human Pangenome Reference Consortium, which aims to create a new "human pangenome reference" based on the complete genome sequences of 350 individuals, according to the University of California, Santa Cruz. The pangenome represents the entire set of genes within a species, consisting of a core genome, and the 'dispensable' genome. The core genome contains sequences shared between all individuals of the species.
The standard human reference genome is known as Genome Reference Consortium build 38 (GRCh38). It was sequenced under the Human Genome Project, and has been continually updated since the first draft in 2000. The new T2T reference will complement the standard reference genome (GRCh38).
However, the standard reference genome (GRCh38) does not represent any one individual, but was assembled from multiple donors. The Human Pangenome Project will enable the comparison of newly sequenced genomes to multiple complete genomes representing a range of human ancestries.
If you have a reference human genome that is complete, the genome of any other individual can be compared with it. If there are genetic variants, they may give clues to possible causes of genetic disease.
A significant amount of human genetic material turns out to be long, repetitive sections that occur over and over, and although every human has some repeats, not everyone has the same number of them. It is the difference in the number of repeats where most of human genetic variation is found.
Thus, genetic variations are stretches of DNA that differ from person to person. The T2T reference genomes revealed millions of genetic variations.
Miga said that many of these news variants are in genes known to contribute to disease, and that they can now be spotted because we have a more complete and accurate genome.
According to a companion paper published in Science, the new reference genome "reveals unprecedented levels of human genetic variation in genes important for neurodevelopment and human diseases."
Evan E Eichler, a Professor of Genome Sciences at the University of Washington School of Medicine, Seattle, said in a statement that a lot is not understood about disease and evolution. "95% of the puzzle being solved is good enough for some people. But I guess for me, getting that last 5% was so important because I believe so much of what we don't understand about disease, or we don't understand about evolution is disproportionately represented in that 5% of the genome that we didn't sequence first off," he said.
Eichler also said that the complete blueprint is going to revolutionise the way we think about human genomic variation, disease, and evolution.
Therefore, having a complete, gap-free sequence of the roughly three billion bases in our DNA is critical for understanding the full spectrum of human genomic variation and for understanding the genetic contributions to certain diseases, according to researchers. This will also help understand genetic evolution.