IMG: Integrated Microbial Genomes

News

The Curated Archaeal Genomes Data Mart of the Integrated Microbial Genomes (IMG) contains 22 archaeal genomes curated at JGI together with the 22 original genomes collected from the public domain.

Microbial genomes have been sequenced by many different sequencing centers, and these centers used different programs and parameters for identifying protein-coding genes (CDSs). In order to standardize gene model across publicly available genomes, the Genome Biology Program (GBP) at JGI has started to curate these genomes. The curation process is carried out using a pipeline developed at JGI followed by manual revisions. The pipeline identifies three types of problems:

  1. Missing genes. Intergenic regions and regions containing unique genes (no hits in databases) are searched against the GenBank nonredundant protein database using blastx to identify potential new coding regions.
  2. Overlapping genes. Genes in prokaryotic genomes do not often overlap by more than a few base pairs. If predicted genes overlap, the start sites may not be correct, or one of the genes may not be real.
  3. 5' end problems. The pipeline identifies genes with 5' ends that are either shorter or longer than homologous genes. A curator determines whether the gene should be extended or truncated.

In many genomes, no pseudogenes were identified, and truncated genes were treated as real genes. During the manual curation process, pseudogenes are identified and labeled. If a gene corresponds to less than 30% of a COG hit its IMG ORF type is set as img_pseudo. Also if a gene has more than one internal stop codon or frameshift, its ORF type is set to img_pseudo. Pseudogenes identified in the original annotation are labeled pseudo. If a gene has one frameshift, it is assumed that this might be the result of a sequencing artifact, and therefore the frameshift fragments are merged to produce a full-length translation. Such genes have IMG ORF type set to frameshift. If two or more genes are combined to make a gene with a frameshift or pseudogene, the original genes are labeled obsolete and the new gene is given a unique locus tag (for example APE0258n). The Gene Details pages for the obsolete genes have links to the new gene and vice versa.

Not all CDSs identified by gene finding programs are real. A CDS that is considered not real is labeled as dubious. If a gene without hits to any other proteins overlaps with a real gene and can not be truncated to remove the overlap, it is labeled dubious. A gene is also labeled as dubious if it is less than 70 amino acids and has no hits to other proteins.

Genome

Accession Number

Authors’ CDSs

Authors’ pseudo

IMG CDSs

IMG pseudo

IMG frameshift

IMG dubious

CDSs added

CDSs modified

Modified % of authors'

Aeropyrum pernix K1

BA000002

2694

0

1652

24

10

1087

77

1586

58.9

Sulfolobus acidocaldarius DSM639

CP000077

2223

36

2211

59

0

15

34

160

7.1

Sulfolobus solfataricus P2

AE006641

2995

0

2887

154

25

7

121

410

13.4

Sulfolobus tokodaii 7

BA000023

2826

0

2692

98

15

125

114

461

16.3

Pyrobaculum aerophilum IM2

AE009441

2605

0

2473

91

11

11

10

188

7.2

Archaeoglobus fulgidus DSM 4304

AE000782

2407

30

2348

64

45

41

32

285

11.7

Haloarcula marismortui ATCC 43049 chromosome I

AY596297

3131

0

3163

24

3

26

90

362

11.5

Haloarcula marismortui ATCC 43049 chromosome II

AY596298

281

0

273

10

0

1

7

34

11.9

Halobacterium salinarum NRC-1 chromosome

AE004437

2058

0

2075

30

0

22

70

326

15.3

Methanothermobacter thermoautotrophicus Delta H

AE000666

1869

0

1787

25

42

34

4

183

9.8

Methanocaldococcus jannaschii DSM 2661

NC000909

1729

1

1702

29

14

1

17

118

6.8

Methanococcus maripaludis S2

BX950229

1722

0

1705

8

5

14

11

49

3.4

Methanosarcina acetivorans C2A

AE010299

4540

112

4380

278

44

63

174

744

16

Methanosarcina mazei Go1

AE008384

3371

0

3269

94

12

35

72

330

9.8

Methanopyrus kandleri AV19

AE009439

1687

4

1710

13

16

0

53

102

6

Pyrococcus abyssi GE5

NC000868

1896

0

1855

35

10

12

23

137

7.2

Pyrococcus furiosus DSM 3638

AE009950

2065

0

1971

86

32

12

44

263

12.7

Pyrococcus horikoshii OT3

BA000001

2061

0

1803

38

9

355

140

609

29.5

Thermococcus kodakaraensis KOD1

NC006624

2306

0

2270

26

8

1

2

58

2.5

Picrophilus torridus DMS9790

NC005877

1535

0

1530

21

8

0

24

80

5.2

Thermoplasma acidophilum DSM 1728

AL139299

1478

0

1482

24

16

21

63

214

14.5

Thermoplasma volcanium GSS1

BA000011

1526

0

1483

64

13

14

53

185

12.1

Nanoarchaeum equitans Kin4-M

AE017199

534

2

540

16

0

0

20

54

10.7

News
IMG: Integrated Microbial Genomes

News

Version 1.5, June 1, 2006

This is the sixth release of the Integrated Microbial Genomes (IMG) genomic data analysis system, IMG 1.5. IMG 1.5 contains a total of 741 genomes consisting of 435 bacterial, 32 archaeal, 15 eukaryotic genomes and 259 bacterial phages. Among these genomes, 602 are finished and 139 are draft genomes.

IMG 1.5 contains 162 microbial genomes sequenced at JGI. The JGI and its collaborators have recently released the sequences of 38 (17 Finished (replaced old drafts), 21 new Drafts) genomes, bringing the total to 62 finished and 100 draft genomes sequenced by JGI. The previously released 17 draft genomes have been replaced with the finished versions. See IMG Data Evolution History for details. JGI genomes are also available through individual microbial portals at Microbial Genomics.

The finished genomes in IMG include 234 (5 new) bacterial and 1 (1 new) archaeal genomes from EBI Genome Reviews (version 48.0, April 17, 2006), 9 eukaryotic genomes from EMBL (as of January 17, 2005), 2 eukaryotic genomes from RefSeq (as of March 21, 2005), and 4 eukaryotic genomes from GenBank (as of July 27, 2005). Compared to IMG 1.4, IMG 1.5 contains 21 (15 RefSeq, 6 EBIGR) new public microbial genomes.

IMG 1.5 has been extended with a number of data analysis features, especially in the area of functional and comparative genome analysis.

IMG continues to be updated on a quarterly basis with new public and JGI genomes. The next update is scheduled for September 1st, 2006.

News
IMG: Integrated Microbial Genomes

News

The Curated Archaeal Genomes Data Mart of the Integrated Microbial Genomes (IMG) contains 22 archaeal genomes curated at JGI together with the 22 original genomes collected from the public domain.

Microbial genomes have been sequenced by many different sequencing centers, and these centers used different programs and parameters for identifying protein-coding genes (CDSs). In order to standardize gene model across publicly available genomes, the Genome Biology Program (GBP) at JGI has started to curate these genomes. The curation process is carried out using a pipeline developed at JGI followed by manual revisions. The pipeline identifies three types of problems:

  1. Missing genes. Intergenic regions and regions containing unique genes (no hits in databases) are searched against the GenBank nonredundant protein database using blastx to identify potential new coding regions.
  2. Overlapping genes. Genes in prokaryotic genomes do not often overlap by more than a few base pairs. If predicted genes overlap, the start sites may not be correct, or one of the genes may not be real.
  3. 5' end problems. The pipeline identifies genes with 5' ends that are either shorter or longer than homologous genes. A curator determines whether the gene should be extended or truncated.

In many genomes, no pseudogenes were identified, and truncated genes were treated as real genes. During the manual curation process, pseudogenes are identified and labeled. If a gene corresponds to less than 30% of a COG hit its IMG ORF type is set as img_pseudo. Also if a gene has more than one internal stop codon or frameshift, its ORF type is set to img_pseudo. Pseudogenes identified in the original annotation are labeled pseudo. If a gene has one frameshift, it is assumed that this might be the result of a sequencing artifact, and therefore the frameshift fragments are merged to produce a full-length translation. Such genes have IMG ORF type set to frameshift. If two or more genes are combined to make a gene with a frameshift or pseudogene, the original genes are labeled obsolete and the new gene is given a unique locus tag (for example APE0258n). The Gene Details pages for the obsolete genes have links to the new gene and vice versa.

Not all CDSs identified by gene finding programs are real. A CDS that is considered not real is labeled as dubious. If a gene without hits to any other proteins overlaps with a real gene and can not be truncated to remove the overlap, it is labeled dubious. A gene is also labeled as dubious if it is less than 70 amino acids and has no hits to other proteins.

Genome

Accession Number

Authors’ CDSs

Authors’ pseudo

IMG CDSs

IMG pseudo

IMG frameshift

IMG dubious

CDSs added

CDSs modified

Modified % of authors'

Aeropyrum pernix K1

BA000002

2694

0

1652

24

10

1087

77

1586

58.9

Sulfolobus acidocaldarius DSM639

CP000077

2223

36

2211

59

0

15

34

160

7.1

Sulfolobus solfataricus P2

AE006641

2995

0

2887

154

25

7

121

410

13.4

Sulfolobus tokodaii 7

BA000023

2826

0

2692

98

15

125

114

461

16.3

Pyrobaculum aerophilum IM2

AE009441

2605

0

2473

91

11

11

10

188

7.2

Archaeoglobus fulgidus DSM 4304

AE000782

2407

30

2348

64

45

41

32

285

11.7

Haloarcula marismortui ATCC 43049 chromosome I

AY596297

3131

0

3163

24

3

26

90

362

11.5

Haloarcula marismortui ATCC 43049 chromosome II

AY596298

281

0

273

10

0

1

7

34

11.9

Halobacterium salinarum NRC-1 chromosome

AE004437

2058

0

2075

30

0

22

70

326

15.3

Methanothermobacter thermoautotrophicus Delta H

AE000666

1869

0

1787

25

42

34

4

183

9.8

Methanocaldococcus jannaschii DSM 2661

NC000909

1729

1

1702

29

14

1

17

118

6.8

Methanococcus maripaludis S2

BX950229

1722

0

1705

8

5

14

11

49

3.4

Methanosarcina acetivorans C2A

AE010299

4540

112

4380

278

44

63

174

744

16

Methanosarcina mazei Go1

AE008384

3371

0

3269

94

12

35

72

330

9.8

Methanopyrus kandleri AV19

AE009439

1687

4

1710

13

16

0

53

102

6

Pyrococcus abyssi GE5

NC000868

1896

0

1855

35

10

12

23

137

7.2

Pyrococcus furiosus DSM 3638

AE009950

2065

0

1971

86

32

12

44

263

12.7

Pyrococcus horikoshii OT3

BA000001

2061

0

1803

38

9

355

140

609

29.5

Thermococcus kodakaraensis KOD1

NC006624

2306

0

2270

26

8

1

2

58

2.5

Picrophilus torridus DMS9790

NC005877

1535

0

1530

21

8

0

24

80

5.2

Thermoplasma acidophilum DSM 1728

AL139299

1478

0

1482

24

16

21

63

214

14.5

Thermoplasma volcanium GSS1

BA000011

1526

0

1483

64

13

14

53

185

12.1

Nanoarchaeum equitans Kin4-M

AE017199

534

2

540

16

0

0

20

54

10.7