New Study Reveals Hidden Genes in Human Genome, Potentially Transforming Treatments

Edited by: TashaS Samsonova

Thousands of new genes are hidden inside the "dark matter" of our genome. A recent study indicates that some of these tiny DNA snippets can produce miniproteins, which may lead to new treatments, including vaccines and immunotherapies for severe brain cancers.

The preprint, yet to undergo peer review, originates from a global consortium dedicated to discovering potential new genes. Following the Human Genome Project's initial draft completion at the turn of the century, scientists have aimed to decode the genetic book of life. Within the four genetic letters — A, T, C, and G — lies crucial information that could aid in combating significant medical challenges, such as cancer.

Initially, the Human Genome Project revealed fewer than 30,000 genes responsible for building and maintaining human bodies, approximately a third of earlier predictions. Now, nearly two decades later, advancements in DNA sequencing technologies prompt scientists to ask, "What have we missed?"

The new study addresses this gap by investigating relatively unexplored genome regions known as "non-coding." These segments have not been associated with any proteins yet. By integrating multiple existing datasets, the researchers identified thousands of potential new genes responsible for producing approximately 3,000 miniproteins.

The functionality of these proteins remains to be determined, but preliminary studies suggest involvement in a lethal childhood brain cancer. The research team is making their tools and findings available to the broader scientific community for further investigation. Their platform extends beyond human genetics, allowing exploration of the genetic blueprints of other organisms.

Despite remaining mysteries, the results "help provide a more complete picture of the coding portion of the genome," stated Ami Bhatt from Stanford University.

Sequencing a genome resembles reading a book without punctuation. While sequencing has become more accessible due to reduced costs and improved efficiency, interpreting the data is complex. Since the Human Genome Project, researchers have sought to identify the "words" or genes that produce proteins. These DNA sequences are further divided into three-letter codons, each encoding a specific amino acid, the fundamental unit of proteins.

When activated, a gene is transcribed into messenger RNA, which conveys genetic information from DNA to the ribosome, the cell's protein synthesis factory. The process can be visualized as a bun with an RNA molecule running through it.

Initially, scientists define a gene by focusing on open reading frames, which are specific DNA sequences indicating where a gene begins and ends. This framework scans the genome for potential genes, validated through laboratory experiments based on various criteria, including the ability to produce proteins exceeding 100 amino acids in size. Sequences meeting this criterion are compiled in GENCODE, an international gene database.

Genes encoding proteins have garnered significant attention due to their relevance in understanding diseases and inspiring treatment approaches. However, a substantial portion of our genome is "non-coding," meaning large segments do not produce any recognized proteins. For years, these DNA regions were deemed junk, remnants of evolutionary history. Recent research, however, has begun revealing their hidden significance.

Some non-coding sequences regulate gene activation, while others, like telomeres, protect DNA from degradation during replication and mitigate aging effects. Despite the prevailing belief that non-coding regions do not produce proteins, emerging evidence suggests otherwise.

One study identified a missing section in non-coding regions that led to inherited bowel issues in infants. In genetically modified mice mimicking this condition, restoring the undefined DNA snippet alleviated symptoms. The authors emphasized the necessity of exploring beyond known protein-coding genes to elucidate clinical observations.

Referred to as non-canonical open reading frames (ncORFs) or "maybe-genes," these sequences have been detected across various human cell types and diseases, indicating potential physiological functions.

In 2022, the consortium initiated investigations into possible functions, aiming to expand the genetic vocabulary. Instead of sequencing the genome, they analyzed datasets that tracked RNA being converted into proteins in the ribosome, capturing the genome's actual output, including short amino acid chains typically considered too small for protein synthesis. This search yielded a catalog of over 7,000 human "maybe-genes," some of which produced microproteins identified in cancer and heart cells.

However, at that stage, the team noted, "we did not focus on the questions of protein expression or functionality." They expanded collaboration in the new study, incorporating protein science specialists from over 20 institutions worldwide to interpret the "maybe-genes."

The team also utilized various resources providing protein databases, such as the Human Proteome Organization and PeptideAtlas, and incorporated data from experiments utilizing the human immune system to identify protein fragments.

In total, the researchers analyzed over 7,000 "maybe-genes" from diverse cell types: healthy, cancerous, and immortalized lab-grown lines. At least a quarter of these "maybe-genes" translated into over 3,000 miniproteins, significantly smaller than typical proteins, with unique amino acid compositions. They appear more aligned with immune system components, suggesting potential applications in developing vaccines, autoimmune therapies, or immunotherapies.

Some of these newly identified miniproteins may lack biological roles. Nevertheless, the study presents a novel approach for scientists to interpret potential functions. For quality assurance, the team classified each miniprotein into tiers based on experimental evidence and integrated them into an existing database for further exploration.

Research into the genome's dark matter is just beginning, with numerous questions still unanswered. The authors noted, "A unique capacity of our multi-consortium collaboration is the ability to develop consensus on the key challenges that we feel need answers." For instance, certain experiments utilized cancer cells, indicating that specific "maybe-genes" may only be active in those cells, raising the question of whether they should be classified as genes.

Future analysis may benefit from deep learning and AI methods, expediting the identification process. Although gene annotation has traditionally relied on manual data inspection, the authors assert that AI can rapidly process multiple datasets, serving as an initial screening for new gene discovery.

How many new genes might scientists uncover? According to study author Thomas Martinez, "50,000 is in the realm of possibility."

Did you find an error or inaccuracy?

We will consider your comments as soon as possible.