IMG: Integrated Microbial Genomes
IMG: Integrated Microbial Genomes

Data Analysis

Microbial genome data analysis is set in the comparative context of multiple microbial genomes. Comparative analysis is essential for the identification of similar or unique genes among different, potentially phylogenetically related, genomes, which provides the foundation for microbial genome functional characterization [1, 3].

IMG allows navigating the microbial genome data space along several dimensions: genomes (genomes), functional annotations (pathways and terms), and genes. Microbial genome data analysis in IMG usually starts with the definition of an analysis context in terms of genomes (genomes), functional annotations, and/or genes, followed by the individual or comparative analysis of genomes, functional annotations, or genes.

Genomes

Genome (genome) selections help focus the analysis on a subset of genomes of interest, especially in terms of their phylogenetic relationships. For example, a set of interest may include the genomes for all the strains within a specified genus.

Genomes can be selected and examined individually. For an individual genome, details of interest include scaffolds, contigs, gene specific summaries, such as the total number of genes, genes with/without function prediction, genes assigned to enzymes; a summary of the pathways associated with the genome, and a summary of the functional terms (COG, Pfam, InterPro) associated with the genome. Gene, pathway and term summaries can be further expanded into lists that include genes or group of genes per pathway and term, respectively. These lists can be then used to further examine individual genes.

For an individual genome, details of interest include scaffolds, contigs, gene specific summaries, such as the total number of genes, genes with/without function prediction, genes assigned to enzymes; a summary of the pathways associated with the genome, and a summary of the functional terms (COG, Pfam, InterPro) associated with the genome. Gene, pathway and term summaries can be further expanded into lists that include genes or group of genes per pathway and term, respectively. These lists can be then used to further examine individual genes.

Genomes can be compared in terms of parameters such as GC content, number of genes, or specific COG or KEGG category. Conservation for selected genomes can be explored using the VISTA comparative genome analysis tools. VISTA is based on a glocal (global/local) alignment technique, Shuffle-LAGAN [2], that detects rearrangements and inversions in sequences while producing a global end-to end map. VISTA Browser provides a curve-based visualization for examining conservation.

Functions

Functional roles for genes in IMG are characterized using terms and/or in the context of pathways. In the current version of IMG, pathways consist of KEGG maps, while terms include COG categories and functions, Pfam domains, and InterPro protein families.

Pathways and terms can be selected and examined individually. For an individual pathway, details of interest include the list of enzymes and the pathway map. Each enzyme is associated with a list of genomes that have genes associated with that enzyme. Individual genomes can be further examined as discussed above; the list of genes associated with a specific enzyme in the context of an genome can be also further explored.

The enzyme profile for an enzyme of interest, ei shows the pattern of this enzyme across selected (by default, all) genomes O1 to On in the form of a vector of the form (k1,...,kn) where ki is the number of genes of Oi that are associated with e.

Enzyme profile for a set of enzymes of interest, e1,..., em, across genomes O1 to On consists of the enzyme profiles across O1 to On for e1,...,em. For example, consider the profile for the enzymes in a pathway P in the context of genomes O1, O2,...,O8. The profile for P consists of a matrix such as that shown below, consisting of the vectors representing the enzyme profiles for the enzymes in P.

P O1 O2 O3 O4 O5 O6 O7 O8
e1 2 3 0 1 2 3 2 0
... ... ... ... ... ... ... ... ...
en 0 4 0 2 1 3 1 0

An enzyme profile provides an estimate of similarity between genomes in terms of association with specific enzymes or pathways.

Each term is associated with a list of genomes that have genes associated with the term; these genomes can be further examined as discussed above. For COG based terms, individual COGs are organized in COG categories.

The term profile for a term of interest, t, shows the pattern of this term across selected (by default, all) genomes O1 to On in the form of a vector of the form (k1,...,kn) where ki is the number of genes of Oi that are associated with t.

The term profile for a set of terms of interest, t1,...,tm, across genomes O1 to On consists of the term profiles across O1 to On for t1,..., tm. An example of such a profile across genomes O1, O2,...,O8, is shown below.

T O1 O2 O3 O4 O5 O6 O7 O8
t1 7 9 3 11 0 13 2 0
... ... ... ... ... ... ... ... ...
tn 0 4 21 12 0 13 15 0

A term profile provides an estimate of similarity between genomes in terms of association with a specific functional characterization.

Genes

Selecting Genes

Gene selections help focus the analysis on genes with certain properties, such as genes sharing a common gene symbol, function (e.g., COG, GO), or pathway (KEGG). Gene selection can be carried out through keyword or similarity searches or using the phylogenetic profiler.

The phylogenetic profiler allows selecting predicted genes of an genome of interest, say O, that share a common property in the context of other phylogenetically related genomes, such as O1, O2, ... On. This operation is based on pre-computed (uni-directional) similarities between the genes of O and the genes of O1, O2, .. and On , respectively.

For example, consider genome O in the context of genomes O1, O2, ... O8. For specific similarity parameters (i.e., cutoff value and percent identity) the gene similarity matrix for the genes g of O consists of either "p" or "a" for every pair (g, Oi), whereby "p" / "a" indicates presence or absence of a Oi gene that is similar to g in O, the genome of interest, where the similarity is determined using the selected parameters, as shown below.

O O1 O2 O3 O4 O5 O6 O7 O8
g1 p p p p a a p a
g2 p p p p p a p p
g3 a p p a a a a a
g4 p p a p p p a p

The phylogenetic profiler gene selection can be used for analyzing biological phenomena of interest in terms of a specific genome (e.g., O) in the context of other genomes (e.g., O1, ... O8), such as finding genes that are preserved or gained in O with respect to O1, ... O8. For the example shown above, gene g1 of O seems to be preserved across O1 to O4, and in O7 while gene g3 of O seems to be gained with respect to O1 and O4 to O8.

Exploring Genes

Individual genes can be further analyzed by examining additional gene information, such as details about annotations and associated pathways. These details include evidence for the functional characterization (prediction), such as COG, InterPro and Pfam characterization, associated KEGG map, orthologous or homologous genes in other genomes (organized as a list or projected over the phylogenetic tree of genomes), and gene neighborhood.

A gene neighborhood displays the gene of interest in its location on the chromosome, together with other genes collocated on the same area of the chromosome: visual exploration of a gene in the context of other genes helps determining the accuracy of its functional annotation and its participation in positional clusters of genes that may potentially represent an operon. One can also examine a gene of interest in the context of related, such as orthologous, genes in other related genomes, across multiple neighborhoods. This allows examining concordance with biological evolutionary phenomena. For example, one can determine whether pairs of genes with related functions are collocated on chromosomes in multiple genomes or seem to be part of an operon.

The phylogenetic occurrence profile for a gene of interest, g, shows the pattern of this gene across selected (by default, all) genomes O1 to On in the form of a vector of the form (o1, ... on) where oi is either "p" if Oi has a gene similar to g or "a" if it does not. The occurrence profile is based on gene similarities computed using best bidirectional hits (BBH).

Multiple genes selected from the gene cart can be examined using ClustalW alignment or using multiple gene neighborhoods.

The phylogenetic occurrence profile for multiple genes shows the pattern of these genes across selected (by default, all) genomes O1 to On by listing the individual occurrence profile vectors of the form (o1, ... on) as described above.

Occurrence profiles are usually examined for multiple genes associated with the same genome. For a gene of interest, g, of a given genome, O, a phylogenetic occurrence profile similarity search allows finding other genes of G that have occurrence profiles that are similar to that of g. Genes with similar occurrence profiles have a similar evolutionary history and may potentially be functionally linked, or co-regulated in a pathway [1,3].

Occurrence profile similarity search involves a threshold called percent occurrence match defined as follows:

Consider the example above, with g1 selected as the gene of interest and the percent occurrence match threshold be set at 80%. Then

and therefore only g2 qualifies as a gene that has occurrence profiles similar to g1 across genomes O1 to O8.

A P-value is calculated using combinatorial math to indicate the probability of match has occurred by chance[4].

References

1 Bowers, P.M., M. Pellegrini, M., Thompson, M.J., Fierro, J., Yeates, T.O., and Eisenberg, D. 2004. Prolinks: A Database of Protein Functional Linkages Derived from Coevolution, Genome Biology 5.

2. Brudno et al. 2003. Glocal alignment: finding rearrangements during alignment. Bioinformatics. Suppl 1:I54.

3. Osterman, A., and Overbeek, R. 2003. Missing Genes in Metabolic Pathways: A Comparative Genomic Approach, Chemical Biology, 7: 238-251.

4. Wu, Jie, Kasif, Simon., DeLisi, Charles. Identification of functional links between genes using phylogenetic profiles . Bioinformatics. Vol 19: 1525-1526.