Angela at Rice: About Orthologous Groups

InParanoid focuses on pairwise ortholog relationships. OrthoDB appreciates that the orthology concept is relative to different speciation points by providing a hierarchy of orthologs along the species tree. Other databases that provide eukaryotic orthologs include OrthoMaM for mammals, OrthologID and GreenPhylDB for plants. OrthoMCL has bacteria but old and incomplete.

Tree-based phylogenetic approaches aim to distinguish speciation from gene duplication events by comparing gene trees with species trees, as implemented in resources such as TreeFam and LOFT. A third category of hybrid approaches uses both heuristic and phylogenetic methods to construct clusters and determine trees, for example Ortholuge , EnsemblCompara GeneTrees and HomoloGene .

Orthology and paralogy, as originally defined by Fitch, are both evolutionary concepts. This is, orthologous genes are homologous sequences that started to diverge through a speciation event (the same with paralogs and duplication events). Consequently, the better you can approximate the evolution of such sequences, the better your orthology predictions will be.
In this respect, phylogenetic reconstruction is expected to provide you with the best evolutionary view. Therefore, by analyzing the phylogenetic trees (i.e. using tree reconciliation algorithms) it is possible to derive a collection of fine-grained predictions of all orthology relationship among sequences.
However, reconstructing gene phylogenies using the most modern and accurate methods is computationally very intensive (and they are not free of artifacts). As a consequence, this approach is prevented of being used at large scale if you do not count with enough computational power. Generally speaking, if your species of interest are available as precomputed predictions in any phylogeny-based database, is good to try. Otherwise, you can move to alternative methods based on pairwise sequence comparisons. These methods are faster and can usually cope with larger amounts of data.
There is also a third independent alternative that consist of inferring the evolution of genes (and therefore their relationships) based on other genomic features rather than their coding sequence. For instance, the YGOB database can be used to obtain orthology and paralogy predictions based on the gene order conservation among several species. This approach is usually considered as very reliable, and sometimes it is used as a golden-set for benchmarks.
Phylogeny-based analysis will be better choice if (among other reasons):

you are trying to predict orthology for a very intricate gene family, including many duplications, gene losses, etc.
you need a fine-grained distinction among, many-to-many, one-to-many and one-to-one relationships.
you need orthology and paralogy predictions among many species at the same time.
you want to know about gene losses.

-Note that phylogenetic trees are not perfect. They are not free of artifacts and they can lead (as other methods) to wrong predictions in the case of lineage sorting or horizontal gene transfer.-
Blast-based methods are much faster and provide good results. There are many tools that you can use to generate your own predictions. You will need to decide among them by considering their limitations and specific scope. For instance,

Do you need a very fast approach to find pairs of orthologs in many species? (Best Reciprocal Hits)
Is it crucial to differentiate one-to-one orthologs from sequences with in-paralogs? (InParanoid, COG, etc.)
Do you need cross relationships among more than two species? (MultiParanoid, orthoMCL)

-Note that many of these tools also provide precomputed data.-

An incomplete summary of resources:

(with special focus on phylogenetic based predictions)

Phylogeny based methods

MetaPhOrs (precomputed data): It combines predictions from many different databases and provide a consistency score for each orthology relationship. Useful to find highly reliable predictions. Data can be browsed interactively or downloaded from an FTP sever.
EnsemblCompara (precomputed data): Phylogeny based orthology and paralogy predictions. Ensembl bases its predictions in the analysis of gene family trees reconstructed using TreeBest (PhyML with fixed evolutionary model, DNA and protein analysis, slighted guided trees for better tree reconciliation).
PhylomeDB (precomputed data): It bases its predictions in a per-gene phylogenetic analysis (PhylML testing several evolutionary models and alignment timing and optimization). Note that, while Ensembl is a general purpose database, PhylomeDB is organized in "phylomes", which are genome wide collections of trees whose taxon sampling and analysis design is usually hypothesis driven. From the publication on Metaphors, PhylomeDB uses Metaphors to measure the reliability of their phylome-based predictions.
In general terms, Both Ensembl and phylomeDB tend to benchmark very similar (with good results) and they provide convenient API access to the DB and FTP downloads.
TreeFam (precomputed data): Similar to EnsemblCompara but it includes a set of manually curated trees. It seems to be discontinued, latest release dates from Feb 2009.
PHOG, analysis of precomputed phylogenies using a slightly different method.

Blast-based approaches

Inparanoid (precomputed data and standalone application): Predictions between pairs of species. It accounts for one-to-many and many-to-many relationships.
EggNOG (~COG) (precomputed data): Comprehensive catalog (630 species, including bacteria and archaea) of functionally annotated orthologs groups. An all-against-all blast comparison is used to build the orthologs groups. It accounts for in-paralogs.
OrthoMCL, MultiParanoid: Extensions of the previous methods. They add the possibility of generate predictions of several species at the same time.
Best Reciprocal hits (BRH): The simplest method. Still very useful when only the best orthologus pairs between two species are required.

Some benchmarks (among others)

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0018755
http://genomebiology.com/2007/8/6/R109 (figure 4)
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1838432/
http://nar.oxfordjournals.org/content/37/suppl_2/W84.full (Figure 1)
http://nar.oxfordjournals.org/content/early/2010/12/11/nar.gkq953.full (Figure 3)

Some comments
1) Definition of orthologs. Fitch's definition is the most widely accepted. IMO, it is also more precise and evolutionarily meaningful than the several alternatives. If you want to find orthologs, go for databases using such a definition (e.g. Ensembl, TreeFam and InParanoid).
2) In general, I prefer tree-based method, especially for mammalians and perhaps also vertebrates. With a tree you can visually tell if the inference makes sense, which is a huge advantage. Another advantage of tree-based methods over pairwise methods is that tree-based methods produce consistent results across species. For example, say A is a 1:1 ortholog of B and B is a 1:1 ortholog of C. In principle, A is a 1:1 ortholog to C (not true if not 1:1), but a pairwise method cannot always guarantee this.
3) However, tree-based methods are not necessarily better than other methods. Reconstructing trees is very difficult. It is quite possible to come up with a purely heuristic method to achieve better results.
4) For tree-based methods, it is important to build gene trees considering species tree, or try to fix the tree topology with the species tree as sort of a prior. Blindly building a gene tree (even using the best algorithm) and then do the standard reconciliation will give very bad inference.
5) Tree-based methods do not work well for bacteria due to the lack of a good species tree and LGT/HGT. LGT very rarely, if ever, happens to mammalians.
6) For mammalians, nucleotide trees tend to reflect the true evolution in comparison to protein trees. A paper is arguing a protein guided nucleotide alignment is the best for building trees. This is also my experiences. Ensembl/TreeFam are using that.
7) For primates and rodents, EnsemblCompara is probably the best choice. It may not be the most accurate, but should be good enough for most purposes. I usually do not like to take the results by combining predictions. It is good for method comparison, but leads to various artifacts that are hard to understand.

cited from http://biostar.stackexchange.com/questions/7591/what-is-the-best-method-to-find-orthologous-genes-of-a-species

Angela at Rice

Wednesday, December 14, 2011

About Orthologous Groups

An incomplete summary of resources:

Phylogeny based methods

Blast-based approaches

Some benchmarks (among others)

No comments:

Blog Archive

Contributors