RAxML-NG tutorial

IMPORTANT NOTE: This tutorial describes the functionality of the latest RAxML-NG v. 1.1.0 and is based on the tutorial provided by its authors.

Intro

RAxML-NG replaces standard RAxML as well as the corresponding supercomputer version ExaML. RAxML-NG does not (yet) support all options of standard RAxML, so you may need to use the previous implementation for some tasks.

Before you start:

  • I already installed RAxML-NG in the class directory
  • You can still update the course repository to download the data files

Check that you have the RAxML-NG version 1.1.0 or later:

$ raxml-ng -v

RAxML-NG v. 1.1.0 released on 29.11.2021 by The Exelixis Lab.
...

Getting help

--help option displays the help menu (you can also run raxml-ng without parameters).

More comprehensive documentation is available in GitHub wiki.

Data and Model

Remember that likelihood is the probability of the data given the model (P(D|M)).
So we have to specify our data and our model in all raxml-ng commands.

--msa <FILE> option (required) specifies alignment

RAxML-NG supports alignments in FASTA, PHYLIP and CATG formats. By default, RAxML-NG will try to automatically detect alignment format based on the file contents. You can also specify the alignment format explicitly with the --msa-format option.

--model <MODEL> option (required) specifies the model of evolution.

Examples of possible models are GTR+G for nucleotide data, JTT+G for aa data, BIN for binary data, MULTI8_MK multistate data with 8 categories, among many others.

Output

--prefix <NAME> option (optional) addes prefix to output files.

RAxML-NG will generate several output files. Use the --prefix <NAME> option to avoid over-writing them.

Checking/optimizing your data

Use the --check command to check/optimize your dataset.

--check command checks that the MSA can actually be read and doesn’t contain sites with only undetermined characters or sequences with undetermined characters or duplicate taxon names, etc. etc. If the problems are found, a “cleaned” file is written.

raxml-ng --check --msa <DATAFILE> --model <MODEL> --prefix T1

Use the --parse command for large alignments

In addition to MSA check, --parse command will perform two useful operations:

  1. Compress alignment patterns and store MSA in the binary format
  2. Estimate memory requirements and optimal number of CPUs/threads
raxml-ng --parse --msa <DATAFILE> --model <MODEL> --prefix T2

--search conducts a heurstic search for the ML tree

By default, --search will perform 20 tree searches using 10 random and 10 parsimony-based starting trees, and pick the best-scoring topology. You can add the --seed <N> option for reproducibility.

 raxml-ng --msa <DATAFILE> --model <MODEL> --prefix T3 --threads 2 --seed <N>

--tree pars{N},rand{M} option specifies the number of starting trees

Although, the default settings is a reasonable choice for many practical cases, computational resources permitting, we might want to increase the number of starting trees to explore the tree space more thoroughly:

 raxml-ng --msa <DATAFILE> --model GTR+G --prefix T4 --threads 2 --seed 2 --tree pars{25},rand{25}

Conversely, we can perform a quick-and-dirty search from a single random starting tree using the --search1 command:

 raxml-ng --search1 --msa <DATAFILE> --model GTR+G --prefix T5 --threads 2 --seed 2

--rfdist computes topological Robinson-Foulds (RF) distances among the trees

You may wonder if 20 (or even 50) starting trees were enough for the analysis. The program’s ouput showed some variation in the Final logLikelihood values, but was it due to different topologies? We can check this by using the --rfdist command to compute the topological Robinson-Foulds (RF) distance among all trees:

raxml-ng --rfdist --tree mltrees --prefix RF

Use cat to examine the .rfDistances file.

Bootstrapping

NOTE: As of v. 0.8.0b, RAxML-NG supports only standard/slow bootstrap algorithm (-b option in RAxML8). It is much slower than rapid bootstrapping implemented in RAxML8 (-x or -f a options), but gives more accurate support values estimates.

--bootstrap performs non-parametric bootstrap analysis.

raxml-ng --bootstrap --msa <DATAFILE> --model GTR+G --prefix T7 --seed 2 --threads 2

By default, RAxML-NG will perform MRE-based bootstopping test after every 50 replicates, and terminate once the diagnostic statistics drops below the cutoff value.

However, you can manually specify the number of replicates with the --bs-trees <N> option, where N is the number of replicates.

Bootstrap convergence can also be checked post-hoc using the --bsconverge command, we can also change the cutoff value to make the test more or less stringent:

raxml-ng --bsconverge --bs-trees <BOOTSTRAPS> --prefix T9 --seed 2 --threads 2 --bs-cutoff 0.01

--support maps bootstrap values on the ML tree

We can map bootstrap values onto the best-scoring/best-known ML tree. We will use the ML tree obtained in the run T3:

raxml-ng --support --tree T3.raxml.bestTree --bs-trees allbootstraps --prefix T13 --threads 2

We can use a tree viewer (I use FigTree a lot) to visualize the ML tree with mapped bootstrap values (you have to choose “label” option under the “display node support”.

Tip: To view a phylogenetic trees on the terminal, use the nw_display program from the newick_utils package.

Alternatively, we can compute so-called Transfer Bootstrap Expectation support metric (Lemoine et al. 2018), which is supposedly more appropriate for very large trees:

raxml-ng --support --tree T3.raxml.bestTree --bs-trees allbootstraps --prefix T14 --threads 2 --bs-metric tbe

Running ML search and bootstrap analysis in one step

--all command runs both ML search and bootstrap analysis;

  • --bs-metric option maps these bootstrap values on the ML tree.
raxml-ng --all --msa <DATFILE> --model <MODEL> --prefix T15 --seed 2 --threads 2 --bs-metric fbp,tbe

Tree likelihood evaluation

Was it worth doing the ML analysis? Wasn’t the parsimony/distance tree good enough?

One way to answer these question is to compute the likelihoods of MP, NJ, and ML trees by just optimizing model and/or branch length parameters and not changing the topology itself.

--evaluate command calculates likelihood of a tree

  • --opt-model on/off you can enable/disable model parameter optimization.
  • --opt-branches on/off you can enable/disable branch length optimization.

The commands looks like:

raxml-ng --evaluate --msa <ALIGNMNET> --threads <N> --model <MODEL> --tree <TREE TO EVALUATE> --prefix <PREFIX>

Partitioned analysis

Use --model <file_name> to specifies partitions.

So far, we used a single evolutionary model for all sites in our alignment. This is rather unrealistic biologically, because different genes and/or codon positions typically exhibit distinct substitution patterns. Therefore, it is common to divide alignment sites into subsets or partitions, which can be assigned individual evolutionary models. In the simplest case, we can assign identical models for all partitions, but allow independent model parameter estimates. For example, the dataset we’ll use today can be partitioned by individual genes as encoded in the following file:

GTR+G+FO, cob=1-1248
GTR+G+FO, cox1=1249-2808
GTR+G+FO, cox2=2809-3543
GTR+G+FO, cox3=3544-4482

A more sophisticated paritioning may involve different substitution matrices and rate heterogeneity models, and also split genes by codon position:

GTR+G+FO, COB=1-1248
GTR+G+FO, COXp1_2=1249-4482/3,1250-4482/3
HKY, COXp3=1251-4482/3

Beware of branch lengths

-brlen option sets the branch linkage model

In partitioned analyses, there are three common ways to estimate branch lengths (sometimes called branch linkage modes):

  • linked: all partitions share a common set of (global) branch lengths. This is the simplest model with fewest parameters (#branches). However, it is often considered too unrealistic, since it is known that genes (or genome regions) evolve at different speeds.
  • unlinked: each partition has its own, independent set of branch lengths. This model allows for the highest flexibility, but it also introduces a huge number of free parameters (#branches * #partitions), which makes it prone to overfitting.
  • scaled (proportional): a global set of branch lengths is estimated like in linked mode, but each partition has an individual scaling factor; per-partition branch lengths are obtained by multiplying the global branch lengths with these individual scalers. This approach is a compromise that allows to model distinct evolutionary rates across partitions while introducing only a moderate number of free parameters (#branches + #partitions).

Topological constraint

--constraint-tree FILE option specifies a constraint tree

If the constraint tree is comprehensive (i.e., it includes all taxa found in the MSA), then RAxML will simply resolve polytomies in the way that maximizes the likelihood. Conversely, if some taxa are missing from the constraint, they will be placed freely in the resulting ML tree.

Outgroup rooting

--outgroup o1,o2,..,oN option specifies outgroups when drawing a tree

The outgroup can be a single taxon (--outgroup Human) or a list of taxa which form a monophyletic group (--outgroup Human,Chimp,Gorilla).

Please note that outgroup rooting is just a drawing option and will not affect tree inference process or score in any way!

Now it is your turn.

1) Find the ML tree for the provided dataset using the GTR + gamma model of sequence evolution and calculate bootstrap support using 50 bs replicates.

2) Calculate the likelihood score for the following models of DNA evolution: Jukes-Cantor (JC), JC with rate heterogeneity (JC+G), GTR (GTR), GTR with the Gamma model of rate heterogeneity, but empirical base frequencies (GTR+G+FC), same buth with ML estimate of the base frequencies (GTR+G+FO), as previously by with 4 free rates instead of GAMMA-distributed rates (GTR+R4+FO). Use the best tree from above and E1-E6 prefixes for these analyses.

Check the results with:

$  grep logLikelihood E*.raxml.log

What did you find?

Check AIC/BIC scores with

grep "AIC score" E*.raxml.log

What model had the best (smallest) AIC/BIC score?

3) Does it make more sense to partition by gene or by codon position? (use your best tree to evaluate). Use the best partitioning scheme to infer the ML tree. Does it have a different topology from your original tree.

4) Which of the branch linkage modes has the highest likelihood? Which one should we choose (use your best tree to evaluate). Use the best model to search for the ML tree. Did it change the resulting likelihoods and/or topology?

What’s next?

There is more info in the Advanced Tutorial!



Content created by ISU-MolPhyl faculty at Iowa State University.
Hosted by GitHub Pages.
Jekyll theme based on Millidocs.
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 4.0 International License.