My approach to determining how much of the human genome has been sequenced

[PICTURE]

METHODS

The methodologies I have used for each data point differ. Originally, my intention was to use the same methodology for every data point, but this has not been possible, as the databases and search engines have been changing. Additionally, as time has progressed, other people have been keeping track of the progress of the human genome project. Since these people are at the databases themselves, and have written automated scripts, I use their figures to keep from having tedium myself. One drawback to using other people's data is that I have yet to find any such data with a rigorous description of the methodology (i.e., that would allow replication of the effort). I have checked, on occasion, these other figures, and they tend to be quite similar to my own. Therefore I have judged it reasonable to include them with my data. Back when less than 1% of the genome had been sequenced, the relative error between estimates was much larger; now it is insignificant.

Description of Recent Datapoints

Data points from 2001 onward are all obtained from the Human Genome Sequencing Progress Report.

Description of the 3/15/00 Datapoint

All human finished Genbank contigs greater than or equal to 10 kb were downloaded on 3/15/00. The specific batch Entrez command was:

biomol genomic [PROP] AND 10000:600000 [SLEN] AND human [ORGN] NOT gbdiv htg [PROP]

This resulted in 6559 sequences with an average length of 94,732 bp. Total length 621,349,094. This would be 20.7% of the human genome if not corrected. This number is an overcount due to redundant sequencing of portions of the genome. I estimated this overcount by doing 1,679,350 random pairwise comparisons between the 6559 sequences. Note that there are 21,506,961 possible pairwise comparisons. These comparisons were done with CrossMatch:

cross_match -minmatch 99 -minscore 970 -maxgap 10 -penalty -200

Only matches that were compatible with contig-building overlap were counted (in other words, outside of the match region the pairs may not overlap). This random sampling resulted in 203 overlaps averaging 19,254 bp. Thus, about 1 in 8,273 pairwise comparisons resulted in an overlap. That makes about 2600 overlaps in the human genome data, redundantly sequencing about 50,056 kb, or about 1.7% of the genome. Since I ignore higher-order overlaps, this is an overcount. Since I use extremely high stringency, this is an undercount. Since I did not screen for repeats, this is an overcount. Combining these factors, I judge my figure to be a slight overcount. Subtracting this estimate of redundant sequencing from the total amount, I arrive at my figure of 571,290,000 bp uniquely sequenced.

Clearly, it is possible to estimate the overlap by near exact enumeration. This would have taken slightly more effort, but is doable. Currently, this extra effort would not make a major impact on the approximation of the percent of the human genome sequenced for this datapoint, but needs to be undertaken when necessary.

For reference, Greg Schuler's count on 2/25/00 was 535 Mb. One can conclude that Greg's and my methodologies are nearly equivalent.

Description of the 5/28/99 through 1/14/00 Datapoints

The data points of 5/28/99 through 1/14/00 were taken from Greg Schuler's NCBI site.

Description of the 8/10/98 Datapoint

The methodology used for the September 15, 1998 calculations was as follows: I downloaded the Genbank flatfiles by ftp from ncbi.nlm.nih.gov. This was version 108.0 of Genbank, which includes all sequences submitted through 8/10/98. The only relevant flatfiles for my purposes were gbpri1.seq and gbpri2.seq. I simplified their format with the following command:

grep -E '^LOCUS|ORGANISM' input > output

The resulting files were then processed with a combination of Microsoft WORD and EXCEL to eliminate non-human sequences and RNA sequences. The resulting sequences were sorted by size and the statistics on the web page were tabulated.

To summarize, there are 86028 primate sequences in Genbank. 81470 of these are human. 2576 of these are greater than 10 kb. 2507 of these are DNA. These 2507 sequences were used in the calculations.

Description of the 11/18/97 Datapoint

The methodology used for the November 18, 1997 calculations was as follows. I used the following query for the "query" email server at NCBI:

DB n
TERM 200000 : 249999 [SLEN] & Homo sapiens [ORGN]
DOPT d
DISPMAX 4000

Sequence length ranges were altered as appropriate. The resulting list of 2000 gis was then submitted in batches to the ENTREZ batch server at NCBI to obtain the Genbank flatfiles. These flatfiles were then processed with a combination of grep and Microsoft EXCEL to eliminate unfinished HTG sequences and RNA sequences. The resulting sequences were sorted by size and the statistics on the web page were tabulated. A useful grep command is:

grep -E '^LOCUS|^KEY|^DEF' input > output

Note that batch ENTREZ returned more than 2000 records for the 2000 gis that I submitted. I assume that this is due to updates and gi cross-references. Most of these additional sequences were less than 10000 bp or were HTG unfinished and were thus elimnated during my down-stream processing. Perhaps they all were. I do not fully understand the gi cross-referencing system or the various NCBI search engines. In any case, I feel that my methodology was adequate to my purposes, although I have a wish list for improvements to the search engines.

Description of the 5/1/97 Datapoint

In short, this data point is the one of which I am least confident. I had to stop using GSDB because of changes with that database and search engine. This forced me back to Genbank, which had an inferior search engine as of May 1997. I used the batch entrez server to download all gis for human records. I then submitted these in batches of 20000 back to the batch entrez server. However, at times I received less than 20000 documents back (e.g. 19973) and despite multiple emails back and forth to NCBI, was never able to get a satisfactory explanation. I then sorted all the human records by size, and tabulated the relevant data. This all took a lot of local disk space and a tremendous amount of download time.

Note that by November 1997, I was able to use a slightly more efficient search methodology, due to my ability to retrieve records by size ranges through the query email server. This still had some problems but nowhere near those of the May 1997 data acquisition.

Description of the 11/12/96 Datapoint

The following SQL query was submitted to GSDB to get the necessary data:

select 	distinct 
	substring(Taxon.scientific_name, 1,15),
	substring(Source.molecule_sequenced,1,7),
	Sequence.name, Sequence.length
from 	Sequence, Source, Taxon
where 	Sequence.id = Source.sequence_id
	and Source.taxon_id = Taxon.id
	and Taxon.scientific_name like 'Homo %'
	and Source.molecule_isolated in ('DNA', 'ds-DNA')
	and Sequence.length >= 40000
	and Sequence.length < 50000

The sequence length ranges were changed as appropriate. Note that care should be used when submitting queries like this as the amount of information returned can be tremendous and the computing resources consumed are probably not negligible.

Redundancies were eliminated from previous versions of these calculations on an ad hoc basis as I became aware of them, but they are so few and small as not to affect the data significantly over 10 kb. I believe I know all the redundancies greater than 40 kb, but please let me know of any that you are aware of that I may have missed.

There were some differences not terribly important to these calculations between Genbank and the Genome Sequence Database (GSDB). GSDB is now defunct, but was a database originally administered by the DOE. Genbank and GSDB mirrored each other. GSDB had a useful SQL query engine.

Description of the 6/25/96 Datapoint

The following SQL query was submitted to GSDB to get the necessary data (note that this query became out of date shortly after this datapoint was generated due to changes in GSDB):

SELECT distinct
         EntryName  = Entry.locus_name,
         Length     = str(Sequence.length, 7),
         AccNum     = substring(Entry.accession_number, 1, 6),
         Definition = substring(Entry.description, 1, 80),
         Type       = Entry.molecule_type
    FROM Sequence, EntrySequenceLink ESL, Entry,
         EntryFeatureLink EFLTN, TaxonomyNames TN(2)
   WHERE TN.node_name LIKE "homo sapiens"
     AND TN.feature_id = EFLTN.feature_id
     AND EFLTN.entry_id = Entry.id
     AND ESL.entry_id = Entry.id
     AND ESL.sequence_id = Sequence.id
     AND Entry.is_current = 1
     AND Sequence.length >= 40000
     AND Sequence.length < 50000
     AND Entry.division = "PRI"
     AND Entry.molecule_type <> "mRNA"
     AND Entry.molecule_type <> "RNA"
     AND Entry.molecule_type <> "ss-RNA"
     AND Entry.molecule_type <> "ss-mRNA"
   ORDER BY Length

Description of the 4/15/96 Datapoint

In short, I used grep to pull out the LOCUS and SOURCE fields from NCBI's Genbank Release 94.0 primate division. Everything under 1000 bp was eliminated. All mRNA and RNA was eliminated. All ssDNA was eliminated. In other words, only DNA and circular DNA was retained. I alphabetized the records by source, and eliminated all non-human records. This was done by eye, to account for entries such as HUMHBB221, which has "Huamn" as its source. Most human records contain either "Human" or "Homo Sapiens" in their source field. This procedure still leaves some cDNA records in. I did not feel qualified to attempt to discriminate cDNA records from genomic records based on often incomplete or enigmatic SOURCE fields, so left them in. Many of these will be less than 1000bp; most will be less than 10000 bp, so they don't really affect my needs for this page.

I then sorted my remaining fields by size and summed the appropriate ranges to get my Large Contigs chart. Redundancies were originally eliminated on an ad hoc basis as I became aware of them, but they are so few and small as not to affect the data significantly over 10 kb. As of 4/96, I believe I knew all the redundancies greater than 40 kb.

Also, many thanks to Dennis Benson at NCBI, who was of great help in getting me started with the early datapoints.

Additional Comments

For purposes of this report "sequenced" means: completed, edited, annotated, and submitted to Genbank. Eric Lynch has graciously pointed out that the basic units of many repeat structures have been sequenced, such as telomeres, centromeres, and ribosomal repeats. For example, I have counted the length of HSU13369 only once. If you feel so inclined, please add a few percent to the totals to account for the telomeres and such. Mitochondrial DNA is counted. In any case, it shouldn't affect the totals much. Note that the complete human mitochondrial chromosome has been sequenced, all 16569 bp of it - HUMMTCG.

Sequences in the HTG section of Genbank are not included as I consider them (properly) to be unfinished.

Redundancies were eliminated from the original versions (i.e., 1995 and early 1996) of these calculations on an ad hoc basis as I became aware of them, but they were so few and small as not to affect the data significantly over 10 kb. From then until (but not including) 3/15/00, no correction for redundancy was made.

Back to Results