Exercise 2 - Creating a self sustained bioinformatic pipeline

Phylogenomics tutorial based on BUSCO genes

Github URL (latest version): https://github.com/chrishah/phylogenomics_intro_vertebrata

Disclaimer

To follow the demo and make the most of it, it helps if you have some basic skills with running software tools and manipulating files using the Unix shell command line. It assumes you have Singularity installed on your computer (tested with Singularity version 3.6.3 on Ubuntu 18.04).

Introduction

We will be reconstructing the phylogenetic relationships of some (iconic) vertebrates based on previously published whole genome data. The list of species we will be including in the analyses, and the URL for the data download can be found in this table.

All software used in the demo is deposited as Docker images on Dockerhub and all data is freely and publicly available.

The workflow we will demonstrate is as follows:

  • Download genomes from Genbank

  • Identifying complete BUSCO genes in each of the genomes

  • pre-filtering of orthology/BUSCO groups

  • For each BUSCO group:

    • build alignment

    • trim alignment

    • identify model of protein evolution

    • infer phylogenetic tree (ML)

  • construct supermatrix from individual gene alignments

  • infer phylogenomic tree with paritions corresponding to the original gene alignments using ML

Let’s begin

Before you get going I suggest you download this repository, so have all scripts that you’ll need. Ideally you’d do it through git.

(user@host)-$ git clone https://github.com/chrishah/phylogenomics_intro_vertebrata.git

Then move into the newly cloned directory, and get ready.

(user@host)-$ cd phylogenomics_intro_vertebrata

1.) Download data from Genbank

What’s the first species of vertebrate that pops into your head? Latimeria chalumnae perhaps? Let’s see if someone has already attempted to sequence its genome. NCBI Genbank is usually a good place to start. Surf to the webpage and have a look. And indeed we are lucky.

Let’s get it downloaded. Note that the (user@host)-$ part of the code below just mimics a command line prompt. This will look differently on each computer. The command you actually need to execute is the part after that, so only, e.g. mkdir assemblies:

#First make a directory and enter it
(user@host)-$ mkdir assemblies
(user@host)-$ mkdir assemblies/Latimeria_chalumnae
(user@host)-$ cd assemblies/Latimeria_chalumnae


#use the wget program to download the genome
(user@host)-$ wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/225/785/GCF_000225785.1_LatCha1/GCF_000225785.1_LatCha1_genomic.fna.gz

#decompress for future use
(user@host)-$ gunzip GCF_000225785.1_LatCha1_genomic.fna.gz

#leave the directory for now
(user@host)-$ cd ../..

We have compiled a list of published genomes that we will be including in our analyses here - it also ships with the repository and can be found here: data/samples.csv. You don’t have to download them all now, but if you want you can do another few just as practice. You can do one by one or use your scripting skills (for loop) to get them all in one go.

To keep things clean I’d suggest to download each into a separate directory that should be named according to the binomial (connected with underscores, rather than spaces) - see example for L. chalumnae above.

If you want to try out writing a for loop I have a reduced list of samples here: data/samples.exercise.csv. A possible solution also ships with the repo here: data/download_loop.sh.

2.) Run BUSCO on each assembly

Warning

Since these genomes are relatively large BUSCO takes quite a while to run so this step has been already done for you. For a separate session detailing common steps for evaluating draft genome assemblies, including an example on how to run BUSCO, we refer you to this repo.

A reduced representation of the BUSCO results for each species ships with our repository in the directory results/orthology/busco/busco_runs.

Take a few minutes to explore the reports.

3.) Prefiltering of BUSCO groups

Important Information

In this tutorial we’ll be using Docker containers through Singularity. When calling singularity exec docker://<containername:version> as below the corresponding container will be downloaded from Dockerhub automatically if it is not yet present locally. This is very convenient, but might in some instances take a bit of time.

If you are doing this exercise as part of a course you might be provided with local copies of the images to save some time.

In some courses for example you’ll find local *sif files in ~/Share/Singularity_images/ - Please doublecheck with your instructor(s) if this is the case. If it is the case you are encouraged to use the local files instead of the images from the cloud, so whenever there is a singularity call you can replace the cloud id with the path to the local *sif file. Filenames should correspond to the docker ids, like e.g. the following:

(user@host)-$ singularity exec docker://reslp/biopython_plus:1.77 \
              <rest of the command>

could be replaced with:

(user@host)-$ singularity exec ~/Share/Singularity_images/biopython_plus_1.77.sif \
              <rest of the command>

FYI, the following command would download the image and safe it to a local *.sif file.

(user@host)-$ singularity pull docker://reslp/biopython_plus:1.77
(user@host)-$ ls -hrlt #see what happened

Now, assuming that we ran BUSCO across a number of genomes, we’re going to select us a bunch of BUSCO genes to be included in our phylogenomic analyses. Let’s get an overview.

We have a script to produce a matrix of presence/absence of BUSCO genes across multiple species. Let’s try it out.

(user@host)-$ singularity exec docker://reslp/biopython_plus:1.77 \
              bin/extract_busco_table.py \
              --hmm results/orthology/busco/busco_set/vertebrata_odb10/hmms \
              --busco_results results/orthology/busco/busco_runs/ \
              -o busco_table.tsv

The resulting file busco_table.tsv can be found in your current directory.

We’d want for example to identify all genes that are present in at least 20 of our 25 taxa and concatenate the sequences from each species into a single fasta file.

(user@host)-$ mkdir -p by_gene/raw
(user@host)-$ singularity exec docker://reslp/biopython_plus:1.77 \
              bin/create_sequence_files.py \
              --busco_table busco_table.tsv \
              --busco_results results/orthology/busco/busco_runs \
              --cutoff 0.5 \
              --outdir by_gene/raw \
              --minsp 20 \
              --type aa \
              --gene_statistics gene_stats.txt \
              --genome_statistics genome_statistics.txt

A bunch of files have been created in your current directory (gene_stats.txt) and also in the directory by_gene/raw (per gene fasta files).

4.) For each BUSCO group

For each of the BUSCOs that passed we want to:

  • do multiple sequence alignment

  • filter the alignment, i.e. remove ambiguous/problematic positions

  • build a phylogenetic tree

Let’s go over a possible solution step by step for gene: 409625at7742.

Perform multiple sequence alignment with clustalo.

#alignment with clustalo
(user@host)-$ mkdir by_gene/aligned
(user@host)-$ singularity exec docker://reslp/clustalo:1.2.4 \
              clustalo \
              -i by_gene/raw/409625at7742_all.fas \
              -o by_gene/aligned/409625at7742.clustalo.fasta \
              --threads=2

We can then look at the alignment result. There is a number of programs available to do that, e.g. MEGA, Jalview, Aliview, or you can do it online. A link to the upload client for the NCBI Multiple Sequence Alignment Viewer is here (I suggest to open in new tab). Upload (by_gene/aligned/409625at7742.clustalo.fasta), press ‘Close’ button, and have a look.

What do you think? It’s actually quite messy..

Let’s move on to score and filter the alignment, using TrimAl.

#alignment trimming with trimal
(user@host)-$ mkdir by_gene/trimmed
(user@host)-$ singularity exec docker://reslp/trimal:1.4.1 \
              trimal \
              -in by_gene/aligned/409625at7742.clustalo.fasta \
              -out by_gene/trimmed/409625at7742.clustalo.trimal.fasta \
              -gappyout

Try open the upload dialog for the Alignment viewer in a new tab and upload the new file (by_gene/trimmed/409625at7742.clustalo.trimal.fasta). What do you think? The algorithm has removed quite a bit at the ends of the original alignment, reducing it to only ~100, but these look mostly ok, at first glance.

Now, let’s infer a ML tree with IQtree.

#ML inference with IQTree
(user@host)-$ mkdir -p by_gene/phylogeny/409625at7742
(user@host)-$ singularity exec docker://reslp/iqtree:2.0.7 \
              iqtree \
              -s by_gene/trimmed/409625at7742.clustalo.trimal.fasta \
              --prefix by_gene/phylogeny/409625at7742/409625at7742 \
              -m MFP --seqtype AA -T 2 -bb 1000

The best scoring Maximum Likelihood tree can be found in the file: by_gene/phylogeny/409625at7742/409625at7742.treefile.

The tree is in the Newick tree format. There is a bunch of programs that allow you to view and manipulate trees in this format. You can only do it online, for example through iTOL, embl’s online tree viewer. There is others, e.g. ETE3, icytree, or trex. You can try it out, but first let’s have a quick look at the terminal.

(user@host)-$ cat by_gene/phylogeny/409625at7742/409625at7742.treefile

Well done!

5.) Run the process for multiple genes

Now, let’s say we want to go over these steps for multiple genes, say these:

  • 359032at7742

  • 413149at7742

  • 409719at7742

  • 406935at7742

Exercise

For loop would do the job right? See the below code. Do you manage to add the tree inference step in, too? It’s not in there yet.

(user@host)-$ for gene in $(echo "359032at7742 413149at7742 409719at7742 406935at7742")
do
        echo -e "\n$(date)\t$gene"
        echo -e "$(date)\taligning"
        singularity exec docker://reslp/clustalo:1.2.4 clustalo -i by_gene/raw/${gene}_all.fas -o by_gene/aligned/${gene}.clustalo.fasta --threads=2
        echo -e "$(date)\ttrimming"
        singularity exec docker://reslp/trimal:1.4.1 trimal -in by_gene/aligned/${gene}.clustalo.fasta -out by_gene/trimmed/${gene}.clustalo.trimal.fasta -gappyout
        echo -e "$(date)\tDone"
done

Possible solutions using docker images or local *sif files (make sure to change the path to the *sif files) can be found here and here, respectively. See scripts backup/bygene.sh and backup/bygene_local.sh in this repository.

If you want to the script ``backup/bygene_local.sh`` please make sure that the paths to the Shared folder where the ``*sif`` files are is correct.

If you want to skip this step alltogehter you can fetch the files that would be produced by this step from the backup directory, like so:

(user@host)-$ rsync -avpuzP backup/by_gene .

The results of the above processes (aligning, trimming and tree inference for 5 genes) can be found in the directory by_gene/. Check it out and take note how we have neatly organised our results with meaningful directory names, etc.

Now, let’s infer a ML tree using a supermatrix of all 5 genes that we have processed so far.

(user@host)-$ singularity exec docker://reslp/iqtree:2.0.7 \
              iqtree \
              -s by_gene/trimmed/ \
              --prefix five_genes \
              -m MFP --seqtype AA -T 2 -bb 1000

This will run for about 10 Minutes. You can check out the result five_genes.treefile, once it’s done. A backup ships with the repository in backup/five_genes.treefile.

(user@host)-$ cat five_genes.treefile

Now, we can also try to build a speciestree from the 5 individual gene trees using ASTRAL.

First bring the individual gene trees together into one file. Let’s call the file trees.txt, then run ASTRAL:

(user@host)-$ cat by_gene/phylogeny/*/*.treefile > trees.txt
(user@host)-$ singularity exec docker://reslp/astral:5.7.1 \
              java -jar /ASTRAL-5.7.1/Astral/astral.5.7.1.jar \
              -i trees.txt -o species_tree.astral.tre

Have a look at the result.

(user@host)-$ cat species_tree.astral.tre #or try backup/species_tree.astral.tre instead if you had trouble

Instead of looking at the plain text representation you can also explore the trees e.g. via ITOL.

Congratulations, you’ve just built your first phylogenomic tree(s)!!!

5.) Automate the workflow with Snakemake

A very neat way of handling this kind of thing is Snakemake.

Important Information

Snakemake needs to be installed on your system to continue with the exercise. An easy way to get it set up is through conda.

If you do this as part of a course it might be that the environment is already set up for you. Run the command: conda env list to list all conda envirnments. If you see something called snakemake or serpentesmake they are probably the right ones (check back with instructors in case of doubt). Please activate the right environment, e.g. like so: conda activate serpentesmake.

If the environment isn’t yet set up for you, we provide some instructions here.

The very minimum you’ll need to create Snakemake workflow is a so called Snakefile. The repository ships with a file called Snakemake_intro/Snakefile_cloud. This file contains the instructions for running a basic workflow with Snakemake. Let’s have a look. Note that we have prepared also a file Snakemake_intro/Snakefile_local in case you’ve been instructed to use local *sif files.

(user@host)-$ less Snakemake_intro/Snakefile_cloud #exit less with 'q'
(user@host)-$ less Snakemake_intro/Snakefile_local #exit less with 'q'

In the Snakefile you’ll see rules (that’s what individual steps in the analyses are called in the Snakemake world). Some of which should look familiar, because we just ran them manually, and then from within a simple bash script. Filenames etc. are replaced with variables but other than that..

Assuming you’ve set up a conda environment called snakemake (either you just did or it’s already set up for you), in order to run Snakemake you first need to enter this environment.

(user@host)-$ conda activate snakemake # maybe conda activate serpentesmake - depends on the course
(snakemake) (user@host)-$ snakemake -h

Now, let’s try to do a Snakemake ‘dry-run’ (flag -n), providing a specific target file and see what happens. Remember that snakemake expects a file called Snakefile in the working directory per default, so you could copy the file Snakemake_intro/Snakefile_local (or if instructed to use images from Dockerhub instead, Snakemake_intro/Snakefile_cloud) here, renamed Snakefile (this is what will be assumed) or find another way (you could add the flag -s <path/to/Snakefile> to the snakemake commands below).

Review the Snakefile you’ve just copied and make sure that the paths to the locally provided *sif files are correct. Change from ~/Shared_folder to ~/Share for example. Note that each rule uses a different one.

(user@host)-$ cp Snakemake_intro/Snakefile_local Snakefile

## now let's do a 'dry-run' with snakemake
## you provide a specific target file and given the rules
## in the Snakefile snakemake will figure out what it
## needs to do to get to this file
(user@host)-$ snakemake -n -rp auto/trimmed/193525at7742.clustalo.trimal.fasta

Now, you could extend the analyses to further genes.

(user@host)-$ snakemake -n -rp \
              auto/trimmed/193525at7742.clustalo.trimal.fasta \
              auto/trimmed/406935at7742.clustalo.trimal.fasta

Actually running would happen if you remove the -n flag. Note that I’ve added another flag (--use-singularity) which tells snakemake to use containers for certain rules if so indicated in the Snakefile.

(user@host)-$ snakemake -rp --use-singularity --jobs 4 \
              auto/trimmed/193525at7742.clustalo.trimal.fasta \
              auto/trimmed/406935at7742.clustalo.trimal.fasta

Check out the results of these analyses in the directory auto/.

Well Done!

Challenge

Add two rules to the Snakefile:

  • rule gene_tree - infer a gene tree for each alignment

  • rule supermatrix - infer the final tree (target filename: super.treefile) based the supermatrix created from the individual gene alignments

Further, make sure your workflow includes the following genes:

  • 409625at7742

  • 409719at7742

  • 413149at7742

  • 42971at7742

  • 97645at7742

A possible solution can be found here. It also ships with the repository backup/Snakefile_with_ml or backup/Snakefile_with_ml_local.

The snakemake call could look something like this (assuming you’ve first put a file called Snakefile in place):

(snakemake) (user@host)-$ snakemake -nrp --use-singularity --jobs 4 super.treefile

Also check out snakemakes functionality to generate dags (directed acyclic graphs).

Notice that I am using a backup Snakefile backup/Snakefile_with_ml_from_dir_local below. This file has built in a python function at the beginning that will cause the workflow to consider all fasta files in the directory by_gene/raw/ as input rather than only specific ones.

(snakemake) (user@host)-$ snakemake -n --dag \
            -s backup/Snakefile_with_ml_from_dir_local | dot -Tpdf > dag.from.dir.pdf

In case you actually ran the above commands the final tree can be found in super.treefile (also in backup/super.treefile). The directected acyclid graph can be viewed in the pdf dag.from.dir.pdf (or alternatively backup/dag.with_ml_from_dir.pdf).

Well Done!!!

All that you need now is to practice .. ;-)

6.) Full automation

We have developed a pipeline for automating the entire process of phylogenomic analyses from BUSCO genes (for now). You can find it here.

The current repository is actually a snapshot of phylociraptor. In the base directory of this repository you could resume an analysis as shown below. If there is time we’ll talk about the setup a little bit.

The main things you need are:

  • config file data/config.vertebrata_minimal.yaml

  • sample file data/vertebrata_minimal.csv

A few steps were already run for you - see the file data/preparation.md, and you could resume and finish the analyses with the following commands.

#get table
./phylociraptor orthology -t serial=2 --config-file data/config.vertebrata_minimal.yaml

#filter-orthology
./phylociraptor filter-orthology -t serial=2 --config-file data/config.vertebrata_minimal.yaml --verbose

#align
./phylociraptor align -t serial=2 --config-file data/config.vertebrata_minimal.yaml --verbose

#filter align
./phylociraptor filter-align -t serial=2 --config-file data/config.vertebrata_minimal.yaml --verbose

#modeltest
./phylociraptor modeltest -t serial=2 --config-file data/config.vertebrata_minimal.yaml

#ml tree
./phylociraptor mltree -t serial=2 --config-file data/config.vertebrata_minimal.yaml --verbose

#speciestree
./phylociraptor speciestree -t serial=2 --config-file data/config.vertebrata_minimal.yaml --verbose

#figure
./phylociraptor report --config-file data/config.vertebrata_minimal.yaml
./phylociraptor report --figure --config-file data/config.vertebrata_minimal.yaml