IMG: Integrated Microbial Genomes
IMG: Integrated Microbial Genomes

Identifying Fusions

A fused gene (fusion) is defined as a gene that is formed from the composition (fusion) of two or more previously separate genes (component genes). The identification of such genes is based on computing BLAST similarities between genes.

We have considered only genes from finished genomes as putative components, in order to avoid false predictions from fragmented genes in draft genomes. Furthermore, genes that are frequently appear as fragmented in finished genomes, such as transposases and integrases, as well as pseudogenes are excluded from fusion calculations.

Fusions are identified as follows:

  1. Starting from a candidate fused gene, x, in a given genome, G, check the similarities to all other genes in other genomes, Gi:

    For each genome Gi, candidate component genes are identified by finding Gi genes that have alignment longer than 80% of their size to gene x. Only candidate component genes that overlap for less than 10% of the size with the shortest candidate component gene are kept. Additionally, candidate components should not be paralogs.

  2. For each candidate fused gene, x, in a given genome, G, that has candidate component genes in genomes, Gi, x is accepted as a valid fusion only if
    1. The candidate components found in each genome Gi cover more than 80% of x (see Figure 1.1 and 1.3).
    2. The same combination of component genes is found in at least two genomes (see Figures 1.1 and 1.2), and in at least one of these genomes the components are not in tandem, i.e. in at least one genome one or more genes are found between the components (see Figure 2). The second condition eliminates cases of consistent frameshifts in a group of genomes.
  3. For each accepted fused gene, x, groups of component genes from multiple genomes that hit the same region of x are identified: for each of these groups, the gene that has the maximum bitscore/alignment length value is used as the representative component for the group and used as exemplar, as illustrated in Figure 3.

Figure 1. Fused genes and their components.


Figure 2. Component genes should appear not in tandem in at least one genome.


Figure 3. Component and exemplar genes for a fused gene.