Lab #7 Phylogenomics

The objective of this activity is to introduce you to main steps and approaches in phylogenomic analysis. We will use several programs for this exercise. A few of them will be familiar, but many will be novel. So be ready!

Introduction

What is phylogenomics?

The term “phylogenomics” refers to several areas of research at the interplay between genomics and evolution. Here we will use it in a sense of inferring species’ relationships using genomic scale datasets.

To concatenate or not to concatenate…

When dealing with the sequence data in phylogenomic analysis, one of the main decision we have to make is whether to concatenate the data. If we do, we can use any of the previously discussed phylogenetic methods to infer a species tree. It will just take longer. If we don’t, we can use supertree or coalescent methods. We will try both approaches here.

The dataset

We will use one of the dataset from the study by McCormack et al. (2012). The directory for this lab contains both the sequences and alignments.

Getting started

We will use several new programs in this excercise. All of them are already installed on Nova, but we’ll do a few tricks to simplify their usage:

  1. Login to Nova and request 4 cores for 2 hours:
    salloc -p class-long -N 1 -n 4 -t 2:00:00 -A s2023.eeob.563.1
    
  2. Edit your .bashrc file by running the following commands from your home (~) directory:
    echo "alias amas='python3 /work/class-faculty/dlavrov/eeob563/bin/AMAS.py'" >> ~/.bashrc
    echo "alias astral='java -jar /work/class-faculty/dlavrov/eeob563/src/ASTRAL/Astral/astral.5.7.8.jar'" >> ~/.bashrc
    echo "alias pargenes='/work/class-faculty/dlavrov/eeob563/src/ParGenes/pargenes/pargenes.py'"  >> ~/.bashrc
    echo "alias pargenes-export='python3 /work/class-faculty/dlavrov/eeob563/src/ParGenes/pargenes/pargenes_src/export.py'"  >> ~/.bashrc
    source ~/.bashrc
    

Part I: Concatenation-based analysis

We will use AMAS program to concatenate alignments present in the data/org directory. To run AMAS type amas <command> [<args>]. Running amas without a command provides this help guide:

usage: AMAS <command> [<args>]

The AMAS commands are:
  concat      Concatenate input alignments
  convert     Convert to other file format
  replicate   Create replicate data sets for phylogenetic jackknife
  split       Split alignment according to a partitions file
  summary     Write alignment summary
  remove      Remove taxa from alignment
  translate   Translate DNA alignment into protein alignment
  trim        Remove columns from alignment

Use AMAS <command> -h for help with arguments of the command of interest
  1. Run the following commands from the lab7 directory:
mkdir -p analysis/{concatenation,coalescence};
cd analysis/concatenation;
amas concat -f nexus -d dna --out-format nexus --part-format nexus -i ../../data/org/183_aln/nexus/*.nex

The code above creates a concatenated alignment (concatenated.out) as well as a list of individual loci locations in this alignment (partitions.txt)

  1. Run your favorite program (e.g., MrBayes) to estimate a phylogenetic tree based on the alignment. Note, use out-format phylip option if you want to create a phylip-formatted file (e.g., for RAxML)

Part II: Coalescence-based analysis using ASTRAL

Making gene trees with PARGENE and RAXML-NG

Before running ASTRAL we need to build inidividual trees. We’ll do this with ParGenes.

ParGenes is a parallel tool that takes as input a set of multiple sequence alignments (typically from different genes) and infers their corresponding phylogenetic trees.

Pargenes has many options that you can check here or by running pargenes --help. We won’t use most of them for this lab.

  1. Run ParGenes with the following command:
    cd ../coalescence
    pargenes -a ../../data/org/183_aln/phylip -o ./pargene_out -c 16 -d nt -R "--model GTR"
    
  2. Run an export script that extracts ML trees from the output directory:
    pargenes-export --best-ml-tree -i pargene_out -o pargene_export.out
    
  3. Concatenate all trees in one file:
    cat pargene_export.out/* > gene_trees.tre
    

ASTRAL

ASTRAL is a java program for estimating a species tree given a set of unrooted gene trees. ASTRAL is statistically consistent under multi-species coalescent model (and thus is useful for handling ILS). The optimization problem solved by ASTRAL seeks to find the tree that maximizes the number of induced quartet trees in gene trees that are shared by the species tree. The optimization problem is solved exactly for a constrained version of the problem that restricts the search space. An exact solution to the unconstrained version is also implemented and can run on small datasets (less than 18 taxa).

Installation

The program is already installed in /work/class-faculty/dlavrov/eeob563, but you can dowload it from its github repository or use the module module load astral (older version).

EXECUTION:

  • ASTRAL is a java-based application, and should run in any environment (Windows, Linux, Mac, etc.) but you need java to run it.

If you created an alias above, run ASTRAL by just typing astral. Typing astral --help will show you a list of options available in ASTRAL:

================== ASTRAL ===================== 

This is ASTRAL version 5.7.8

Usage:
  ASTRAL (version5.7.8) [--help] (-i|--input) <input file> [(-o|--output) <output
  file>] [(-q|--score-tree) <score species trees>] [(-t|--branch-annotate) <branch
  annotation level>] [(-b|--bootstraps) <bootstraps>] [(-r|--reps) <replicates>]
  [(-s|--seed) <seed>] [-g|--gene-resampling] [--gene-only] [(-k|--keep) <keep>]
...
Input:
  • The input gene trees are in the Newick format
  • The input trees can have missing taxa, polytomies (unresolved branches), and multiple individuals per species.
  • Taxon names cannot have quotation marks (' or ") or special characters (like |, =, ?, /, and \) in the name.
Output:

The output in is Newick format and gives:

  • the species tree topology,
  • branch lengths in coalescent units (only for internal branches or for terminal branches if that species has multiple individuals),
  • branch supports measured as local posterior probabilities.
  • It can also annotate branches with other quantities, such as quartet support, as described in the tutorial.
  1. We can run ASTRAL as
    astral -i gene_trees.tre -o astral.tre
    # or
    astral -i gene_trees.tre -o astral.tre 2> astral.log
    

Viewing results of ASTRAL:

  1. The output of ASTRAL is a tree in Newick format. Let’s use this server to look at it.

Using either this applications open the astral.tre file.

Reroot the tree at the correct node, which is always necessary, since the rooting of the ASTRAL trees is arbitrary and meaningless.

Branch length and support

  • ASTRAL only estimates branch lengths for internal branches and those terminal branches that correspond to species with more than one individuals sampled.
  • Branch lengths are in coalescent units and are a direct measure of the amount of discordance in the gene trees. As such, they are prone to underestimation because of statistical noise in gene tree estimation.
  • Branch support values measure the support for a quadripartition (the four clusters around a branch) and not the bipartition, as is commonly done.


Content created by ISU-MolPhyl faculty at Iowa State University.
Hosted by GitHub Pages.
Jekyll theme based on Millidocs.
Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution 4.0 International License.