BBSplit Tutorial

Overview of BBSplit

BBSplit is a module from the BBMap suite, a collection of bioinformatics tools designed for fast and accurate DNA/RNA sequence analysis. BBSplit, in particular, is a specialized tool used for sorting reads based on their alignment to multiple reference genomes. This is especially useful in metagenomics studies where samples may contain a mixture of organisms, and researchers need to separate host DNA from microbial DNA or differentiate between multiple microbial species.

The tool is known for its efficiency in handling high-throughput sequencing data and its ability to deal with short reads, which are common in next-generation sequencing (NGS) technologies. BBSplit can be integrated into various bioinformatics pipelines, such as ATLAS and Sunbeam, for host decontamination or other purposes where specific read sorting is required.

BBSplit uses short k-mers to align reads directly to the genome, which allows it to span introns and find novel isoforms. It is one of the few tools that explicitly claim support for both Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) long reads, making it versatile for different sequencing platforms.


To install BBSplit, you will need to download the BBMap package, which includes BBSplit among other tools. The BBMap suite is available on SourceForge and can be downloaded from the official BBMap project page. Once downloaded, you can extract the package and compile it if necessary. BBSplit does not require any special installation steps beyond those needed for the BBMap suite.

It is important to ensure that Java is installed on your system, as BBMap tools are Java applications. The version of Java required can vary, so it's best to check the documentation for the specific version needed.

Quick Start

To get started with BBSplit, you first need to have your reference genomes indexed. BBSplit will use these indexes to align reads and sort them based on the reference they match. The indexing is done using the script included in the BBMap suite.

Once the references are indexed, you can run BBSplit with a basic command structure like this: in=reads.fq ref=genome1.fa,genome2.fa basename=out_%.fq

In this command, reads.fq is the input file containing your reads, genome1.fa and genome2.fa are the reference genome files, and basename=out_%.fq is the pattern for the output files, where % will be replaced by the reference names.

Code Examples Of Popular Commands

Here are five popular commands that you might use with BBSplit:

  1. Basic Read Sorting: in=reads.fq ref=genome1.fa,genome2.fa basename=out_%.fq

    This command sorts reads into separate files based on which reference genome they align to.

  2. Ambiguous Read Handling: in=reads.fq ref=genome1.fa,genome2.fa basename=out_%.fq ambiguous2=all

    With ambiguous2=all, reads that align to multiple references with equal best hits are written to all matching output files.

  3. Setting Minimum Identity: in=reads.fq ref=genome1.fa,genome2.fa basename=out_%.fq minid=0.95

    The minid=0.95 parameter ensures that only reads with at least 95% identity to the reference are considered as aligned.

  4. Excluding Reads: in=reads.fq ref=genome1.fa,genome2.fa basename=out_%.fq outu=unmatched.fq

    The outu=unmatched.fq parameter specifies a file to write reads that do not align to any reference.

  5. Limiting Memory Usage: in=reads.fq ref=genome1.fa,genome2.fa basename=out_%.fq -Xmx20g

    The -Xmx20g option tells BBSplit to use a maximum of 20 gigabytes of RAM, which is useful for controlling resource usage on shared systems.

These commands showcase the flexibility of BBSplit in handling different scenarios encountered in sequence analysis. Users can combine these options to tailor the behavior of BBSplit to their specific needs.

In conclusion, BBSplit is a powerful tool for sorting sequencing reads based on alignment to multiple references. Its ability to handle different types of sequencing data and integrate into various bioinformatics workflows makes it a valuable asset in the field of genomics and metagenomics. Whether you are dealing with host contamination or exploring the diversity of microbial communities, BBSplit provides a reliable way to manage and analyze your sequencing data.