Large-scale Machine Learning for Metagenomics Sequence Classification

Large-scale Machine Learning for Metagenomics Sequence Classification

Kevin Vervier, Pierre Mahé, Maud Tournoud, Jean-Baptiste Veyrieras and Jean-Philippe Vert

Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Due to the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art performance, compositional approaches that assign a taxonomic class to a DNA read based on the k-mers it contains have the potential to provide faster solutions.

In this work, we investigate the potential of modern, large-scale machine learning implementations for taxonomic affectation of next-generation sequencing reads based on their k-mers profile. We show that machine learning-based compositional approaches benefit from increasing the number of fragments sampled from reference genome to tune their parameters, up to a coverage of about 10, and from increasing the k-mer size to about 12. Tuning these models involves training a machine learning model on about 10^8 samples in 10^7 dimensions, which is out of reach of standard softwares but can be done efficiently with modern implementations for large-scale machine learning. The resulting models are competitive in terms of accuracy with well-established alignment tools for problems involving a small to moderate number of candidate species, and for reasonable amounts of sequencing errors.We show, however, that compositional approaches are still limited in their ability to deal with problems involving a greater number of species, and more sensitive to sequencing errors. We finally confirm that compositional approach achieve faster prediction times, with a gain of 3 to 15 times with respect to the BWA-MEM short read mapper, depending on the number of candidate species and the level of sequencing noise.

Data and code used in this project

The following file contains the data used in the paper (the small and large databases), as well as programs allowing to reproduce the results

This archive is structured as follows :

We now detail the contents of the three directories present in the archive.

data directory

src directory

Each src directory contains three sub-directories (input, output and src) allowing to reproduce the results presented in our paper (see the Tutorial section below):

tools directory

This directory contains source codes of two C programs involved in the Vowpal Wabbit (VW) based read classification pipeline :

Third-party softwares

To run the experiments, you need in addition to install the following third-party softwares:

Optionally, you may want to install also the following softwares which we used in our paper for comparison to existing methods (but you do not need them if you just want to run our method):

Our implementation uses the following libraries, which you do not need to install since they are provided in our software for convenience

Tutorial: reproducing the results of the paper

After downloading the project archive, first untar it:

$ tar zxvf large-scale-metagenomics-1.0.tar.gz

Then, run the BASH script INSTALL.sh that can be found in the tools directory. This script will process the installation of the GDL library, and create the binary executables:

$ cd large-scale-metagenomics-1.0/tools
$ sh INSTALL.sh

In order to check that everything went well during the installation, please use test.sh in the tools/test directory. It will use installed tools to simulate a small dataset.

$ cd test/
$ sh test.sh
$ ls output/
                frags.fasta	frags.gi2taxid	frags.taxid	frags.vw	vw-dico.txt

            
            

The following instructions have to be executed in the given order. At each step, we also point on tunable parameters.

  1. $ cd src/1-generate-test-datasets/src
  2. $ sh 01.generate-dataset-fragments.sh
    Generates the fragments test dataset based on the following parameters:
  3. $ sh 02.generate-dataset-reads-homo.sh
    Generates the homopolymer test datasets based on the following parameters:
  4. $ sh 03.generate-dataset-reads-mutation.sh
    Generates the mutations test datasets based on the following parameters:
  5. $ cd ../../2-build-models/src
  6. $ sh 01.main.sh
    Uses VW for an iterative learning (WARNING: can take time!) based on the following parameters:

  7. WARNING : this process takes some time, generates large files (~ 12 and 25 GB for the small and large databases, respectively) and has a comparable memory footprint.
  8. $ cd ../../3-make-predictions/src
  9. $ sh 01.make-predictions.sh
    Uses VW for predictions on validation sets based on the following parameters :
  10. $ Rscript 02.generate-graphs.R
    Uses the previous results to plot the performance indicator considered (median species-level accuray), based on the following parameters:

Classifying your own metagenomics sequences

You can train your own classifier by using the same scripts but changing some of the parameters.

In particular, you should be able to easily adapt the script 2-build-models/src/01.main.sh

Please remember that some parameters are mandatory:

You may also want to directly use the already-built models to make predictions for your own sequencing data.

To do this, please change the scripts in 3-make-predictions/src/ in order to point to your FASTA file.

References

K. Vervier, P. Mahé, M. Tournoud, J.-B. Veyrieras, and J.-P. Vert. Large-scale machine learning for metagenomics sequence classification, Bioinformatics, 32(7):1023-1032, 2016.

Contacts

If you have any questions or suggestions, please email pierre dot mahe at biomerieux dot fr and/or jean-philippe dot vert at mines dot org