The GENCODE Consortium expects the human genome has twice as many genes than
previously thought, many of which might have a role in cellular control
and could be important in human disease.
Human Genome Far More Active Than Thought
GENCODE Consortium discovered far more genes than previously thought
The GENCODE Consortium expects the human genome has twice as many genes than previously thought, many of which might have a role in cellular control and could be important in human disease. This remarkable discovery comes from the GENCODE Consortium, which has done a painstaking and skilled review of available data on gene activity.
Among their discoveries, the team describe
more than 10,000 novel genes,
identify genes that have 'died'
and others that are being resurrected.
The GENCODE Consortium reference
gene catalogue has been one of the
underpinnings of the larger ENCODE Project
and will be essential for the full understanding
of the role of our genes in disease.
The GENCODE Consortium is part of the ENCODE Project that, today, publishes 30 research papers describing findings from their nearly decade-long effort to describe comprehensively all the active regions of our human genome. ENCODE was launched in 2003 after the completion of the Human Genome Project, and brought together an international group of scientists tasked with identifying and describing all functional regions of the human genome sequence.
"We have uncovered a staggering array of genes in our genome, simply because we can examine many genomes in a detail that was not possible a decade ago," says Dr Jennifer Harrow, GENCODE principle investigator from the Wellcome Sanger Institute. "As sequencing technology improves, so we have much more data to explore.
"But our work remains a skilled effort to annotate correctly our human genome or, more precisely, our human genomes, for each of us differ. These vast texts of genetic information will not give up their secrets easily. GENCODE has made amazing strides to enable immediate access of its reference gene set by other researchers."
The team more accurately described the genes that
contain the genetic code to make proteins: they found
20,687 such protein-coding genes, a value that
has not changed greatly from previous work.
The new set captures far more of the alternative forms
of these genes found in different cell types.
More significant are their findings on genes that do not contain genetic code to make proteins non-coding genes and the graveyard of supposedly 'dead' genes from which some are emerging, resurrected from the catalogue of pseudogenes.
They mapped and described 9,277 long non-coding genes, a relatively new type that acts, not through producing a protein, but directly through its RNA messenger. Long non-coding RNAs derived from these genes can play a significant part in human biology and disease, but they remain only poorly understood.
The new map of such genetic components
gives researchers more avenues to explore
in their quest to understand human biology
and human disease.
Remarkably, the team think their job is not complete
and believe that there may be another 10,000
of these genes yet to be uncovered.
"Our initial work from the Human Genome Project suggested there were around 20,000 protein-coding genes and that value has not changed greatly," says Professor Roderic Guigo, GENCODE principle investigator from Centre for Genomic Regulation, Barcelona. "However GENCODE has shown that long non-coding RNAs are far more numerous and important than previously thought"
"The limited knowledge we have of the class of long non-coding RNAs suggests they might play a major role in regulating the activity of other genes. If this is generally true of this group, we have much more to explore than we imagined."
As dramatic, GENCODE has catalogued for the first time a set of more than 11,000 pseudogenes by examining the entire human genome. There is some emerging evidence that many of these genes, too, might have some biological activity.
The GENCODE team predict that at least 9% of pseudogenes may be active with some controlling the activity of other genes. Pseudogenes have been implicated in many biological activities, such as the prevention of certain elements known to be involved in the development of cancer.
"At the announcement of the Human Genome Project draft sequence, we emphasized this was the end of the beginning, that 'at present most genes - probably tens of thousands - remain a mystery'", says Dr Tim Hubbard, lead principle investigator of GENCODE from the Wellcome Trust Sanger Institute. "Today, we describe many thousands of genes for the first time."
"If the Human Genome Project was the baseline for genetics, ENCODE is the baseline for biology, and GENCODE are the parts that make the human biological machine work. Our list is essential to all those who would fix the human machine."
The GENCODE human reference set
will be updated every three months
to ensure that models are continually refined
and assessed based on new experimental data
deposited in the public databases.
Original article: ttp://www.sanger.ac.uk/about/press/2012/120905.html
GENCODE, "The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression"
Thomas Derrien1,11, Rory Johnson1,11, Giovanni Bussotti1, Andrea Tanzer1, Sarah Djebali1, Hagen Tilgner1, Gregory Guernec2, David Martin1, Angelika Merkel1, David G. Knowles1, Julien Lagarde1, Lavanya Veeravalli3, Xiaoan Ruan3, Yijun Ruan3, Timo Lassmann4, Piero Carninci4, James B. Brown5, Leonard Lipovich6, Jose M. Gonzalez7, Mark Thomas7, Carrie A. Davis8, Ramin Shiekhattar9, Thomas R. Gingeras8, Tim J. Hubbard7, Cedric Notredame1, Jennifer Harrow7 and Roderic Guigó1,10,12
1Bioinformatics and Genomics, Centre for Genomic Regulation (CRG) and UPF, 08003 Barcelona, Catalonia, Spain;
2INRA, UR1012 SCRIBE, IFR140, GenOuest, 35000 Rennes, France;
3Genome Institute of Singapore, Agency for Science, Technology and Research, Genome 138672, Singapore;
4Riken Omics Science Center, Riken Yokohama Institute, Yokohama, Kanagawa 351-0198, Japan;
5Department of Statistics, University of California, Berkeley, California 94720, USA;
6Center for Molecular Medicine and Genetics, Wayne State University, Detroit, Michigan 48201, USA;
7Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1HH, United Kingdom;
8Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA;
9The Wistar Institute, Philadelphia, Pennsylvania 19104, USA;
10Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, 08002 Barcelona, Catalonia, Spain
?11 These authors contributed equally to this work.
The human genome contains many thousands of long noncoding RNAs (lncRNAs). While several studies have demonstrated compelling biological and disease roles for individual examples, analytical and experimental approaches to investigate these genes have been hampered by the lack of comprehensive lncRNA annotation. Here, we present and analyze the most complete human lncRNA annotation to date, produced by the GENCODE consortium within the framework of the ENCODE project and comprising 9277 manually annotated genes producing 14,880 transcripts. Our analyses indicate that lncRNAs are generated through pathways similar to that of protein-coding genes, with similar histone-modification profiles, splicing signals, and exon/intron lengths. In contrast to protein-coding genes, however, lncRNAs display a striking bias toward two-exon transcripts, they are predominantly localized in the chromatin and nucleus, and a fraction appear to be preferentially processed into small RNAs. They are under stronger selective pressure than neutrally evolving sequencesparticularly in their promoter regions, which display levels of selection comparable to protein-coding genes. Importantly, about one-third seem to have arisen within the primate lineage. Comprehensive analysis of their expression in multiple human organs and brain regions shows that lncRNAs are generally lower expressed than protein-coding genes, and display more tissue-specific expression patterns, with a large fraction of tissue-specific lncRNAs expressed in the brain. Expression correlation analysis indicates that lncRNAs show particularly striking positive correlation with the expression of antisense coding genes. This GENCODE annotation represents a valuable resource for future studies of lncRNAs.
ENCODE; ENCODE Pilot Project: Overview
The National Human Genome Research Institute (NHGRI) launched a public research consortium named ENCODE, the Encyclopedia Of DNA Elements, in September 2003, to carry out a project to identify all functional elements in the human genome sequence.
The pilot phase tested and compared existing methods to rigorously analyze a defined portion of the human genome sequence. It was organized as an open consortium (See: ENCODE Pilot Project: Participants and Projects) and brought together investigators with diverse backgrounds and expertise to evaluate the relative merits of each of a diverse set of techniques, technologies and strategies. The concurrent technology development phase of the project aimed to develop new high throughput methods to identify functional elements. The goal of these efforts was to identify a suite of approaches that would allow the comprehensive identification of all the functional elements in the human genome. Through the ENCODE pilot, NHGRI assessed the abilities of different approaches to be scaled up for an effort to analyze the entire human genome and to find gaps in the ability to identify functional elements in genomic sequence.
The ENCODE Pilot Project process involved close interactions between computational and experimental scientists to evaluate a number of methods for annotating the human genome. A set of regions (See: ENCODE Pilot Project: Target Selection) representing approximately 1 percent (30 Mb) of the human genome was selected as the target for the pilot project and was analyzed by all ENCODE Pilot Project investigators. All data generated by ENCODE participants on these regions was rapidly released into public databases. The ENCODE Pilot Project Consortium was open to all academic, government and private sector scientists interested in participating in an open process to facilitate the comprehensive interpretation of the human genome sequence and who agreed to the criteria for participation for the project.