News
The Curated Archaeal Genomes Data Mart of the Integrated Microbial Genomes (IMG) contains 22 archaeal genomes curated at JGI together with the 22 original genomes collected from the public domain.
Microbial genomes have been sequenced by many different sequencing centers, and these centers used different programs and parameters for identifying protein-coding genes (CDSs). In order to standardize gene model across publicly available genomes, the Genome Biology Program (GBP) at JGI has started to curate these genomes. The curation process is carried out using a pipeline developed at JGI followed by manual revisions. The pipeline identifies three types of problems:
- Missing genes. Intergenic regions and regions containing unique genes (no hits in databases) are searched against the GenBank nonredundant protein database using blastx to identify potential new coding regions.
- Overlapping genes. Genes in prokaryotic genomes do not often overlap by more than a few base pairs. If predicted genes overlap, the start sites may not be correct, or one of the genes may not be real.
- 5' end problems. The pipeline identifies genes with 5' ends that are either shorter or longer than homologous genes. A curator determines whether the gene should be extended or truncated.
In many genomes, no pseudogenes were identified, and truncated genes were treated as real genes. During the manual curation process, pseudogenes are identified and labeled. If a gene corresponds to less than 30% of a COG hit its IMG ORF type is set as img_pseudo. Also if a gene has more than one internal stop codon or frameshift, its ORF type is set to img_pseudo. Pseudogenes identified in the original annotation are labeled pseudo. If a gene has one frameshift, it is assumed that this might be the result of a sequencing artifact, and therefore the frameshift fragments are merged to produce a full-length translation. Such genes have IMG ORF type set to frameshift. If two or more genes are combined to make a gene with a frameshift or pseudogene, the original genes are labeled obsolete and the new gene is given a unique locus tag (for example APE0258n). The Gene Details pages for the obsolete genes have links to the new gene and vice versa.
Not all CDSs identified by gene finding programs are real. A CDS that is considered not real is labeled as dubious. If a gene without hits to any other proteins overlaps with a real gene and can not be truncated to remove the overlap, it is labeled dubious. A gene is also labeled as dubious if it is less than 70 amino acids and has no hits to other proteins.
|
Genome |
Accession Number |
Authors’ CDSs |
Authors’ pseudo |
IMG CDSs |
IMG pseudo |
IMG frameshift |
IMG dubious |
CDSs added |
CDSs modified |
Modified % of authors' |
|
Aeropyrum pernix K1 |
BA000002 |
2694 |
0 |
1652 |
24 |
10 |
1087 |
77 |
1586 |
58.9 |
|
Sulfolobus acidocaldarius DSM639 |
CP000077 |
2223 |
36 |
2211 |
59 |
0 |
15 |
34 |
160 |
7.1 |
|
Sulfolobus solfataricus P2 |
AE006641 |
2995 |
0 |
2887 |
154 |
25 |
7 |
121 |
410 |
13.4 |
|
Sulfolobus tokodaii 7 |
BA000023 |
2826 |
0 |
2692 |
98 |
15 |
125 |
114 |
461 |
16.3 |
|
Pyrobaculum aerophilum IM2 |
AE009441 |
2605 |
0 |
2473 |
91 |
11 |
11 |
10 |
188 |
7.2 |
|
Archaeoglobus fulgidus DSM 4304 |
AE000782 |
2407 |
30 |
2348 |
64 |
45 |
41 |
32 |
285 |
11.7 |
|
Haloarcula marismortui ATCC 43049 chromosome I |
AY596297 |
3131 |
0 |
3163 |
24 |
3 |
26 |
90 |
362 |
11.5 |
|
Haloarcula marismortui ATCC 43049 chromosome II |
AY596298 |
281 |
0 |
273 |
10 |
0 |
1 |
7 |
34 |
11.9 |
|
Halobacterium salinarum NRC-1 chromosome |
AE004437 |
2058 |
0 |
2075 |
30 |
0 |
22 |
70 |
326 |
15.3 |
|
Methanothermobacter thermoautotrophicus Delta H |
AE000666 |
1869 |
0 |
1787 |
25 |
42 |
34 |
4 |
183 |
9.8 |
|
Methanocaldococcus jannaschii DSM 2661 |
NC000909 |
1729 |
1 |
1702 |
29 |
14 |
1 |
17 |
118 |
6.8 |
|
Methanococcus maripaludis S2 |
BX950229 |
1722 |
0 |
1705 |
8 |
5 |
14 |
11 |
49 |
3.4 |
|
Methanosarcina acetivorans C2A |
AE010299 |
4540 |
112 |
4380 |
278 |
44 |
63 |
174 |
744 |
16 |
|
Methanosarcina mazei Go1 |
AE008384 |
3371 |
0 |
3269 |
94 |
12 |
35 |
72 |
330 |
9.8 |
|
Methanopyrus kandleri AV19 |
AE009439 |
1687 |
4 |
1710 |
13 |
16 |
0 |
53 |
102 |
6 |
|
Pyrococcus abyssi GE5 |
NC000868 |
1896 |
0 |
1855 |
35 |
10 |
12 |
23 |
137 |
7.2 |
|
Pyrococcus furiosus DSM 3638 |
AE009950 |
2065 |
0 |
1971 |
86 |
32 |
12 |
44 |
263 |
12.7 |
|
Pyrococcus horikoshii OT3 |
BA000001 |
2061 |
0 |
1803 |
38 |
9 |
355 |
140 |
609 |
29.5 |
|
Thermococcus kodakaraensis KOD1 |
NC006624 |
2306 |
0 |
2270 |
26 |
8 |
1 |
2 |
58 |
2.5 |
|
Picrophilus torridus DMS9790 |
NC005877 |
1535 |
0 |
1530 |
21 |
8 |
0 |
24 |
80 |
5.2 |
|
Thermoplasma acidophilum DSM 1728 |
AL139299 |
1478 |
0 |
1482 |
24 |
16 |
21 |
63 |
214 |
14.5 |
|
Thermoplasma volcanium GSS1 |
BA000011 |
1526 |
0 |
1483 |
64 |
13 |
14 |
53 |
185 |
12.1 |
|
Nanoarchaeum equitans Kin4-M |
AE017199 |
534 |
2 |
540 |
16 |
0 |
0 |
20 |
54 |
10.7 |
