STAR Tutorial

Overview of STAR Bioinformatics Tool

What is STAR?

STAR, which stands for Spliced Transcripts Alignment to a Reference, is a powerful bioinformatics tool designed for the alignment of high-throughput RNA-sequencing (RNA-seq) data. RNA-seq is a technique used to study the transcriptome, which is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA produced in one or a population of cells.

The STAR tool is specifically tailored to handle the complexities of RNA-seq data, which include the alignment of spliced transcripts. This means that it can accurately map RNA sequences that are cut and rejoined in the cell before they are translated into proteins, a process known as splicing.

Why is STAR Important?

RNA-seq data analysis is a challenging task due to the non-contiguous nature of the transcript structure. STAR addresses this challenge with an innovative algorithm that allows for fast and precise alignment of RNA-seq reads to a reference genome. It is capable of handling large datasets efficiently, aligning 550 million paired-end reads per hour on a standard 12-core server, which is significantly faster than other available aligners.

Moreover, STAR is not just fast; it also improves alignment sensitivity and precision. It can detect canonical junctions, non-canonical splices, and chimeric (fusion) transcripts, and it is capable of mapping full-length RNA sequences. This makes it an invaluable tool for researchers looking to analyze complex RNA-seq data.

Availability

STAR is implemented as standalone C++ code and is distributed as free open source software under the GPLv3 license. It can be downloaded from its official repository.

Installation

Before we dive into the installation process, it's important to note that STAR is designed to run on Unix-based systems, including Linux and macOS. The installation process involves compiling the source code, which requires a C++ compiler.

Downloading STAR

To install STAR, you need to download the latest version of the source code from the official repository. You can use the following command to clone the repository using Git:

git clone https://github.com/alexdobin/STAR.git

Compiling STAR

Once you have downloaded the source code, navigate to the source directory and compile the STAR executable using the make command:

cd STAR/source
make STAR

After the compilation process is complete, you will have the STAR executable ready to use.

Quick Start

To get started with STAR, you need to have your RNA-seq data in the form of FASTQ files and a reference genome to which the reads will be aligned. The reference genome should be indexed using STAR before alignment, which can be done using the following command:

STAR --runThreadN NumberOfThreads \
     --runMode genomeGenerate \
     --genomeDir /path/to/genomeDir \
     --genomeFastaFiles /path/to/genome/fasta/files \
     --sjdbGTFfile /path/to/annotations.gtf

Replace NumberOfThreads with the number of threads you wish to use, /path/to/genomeDir with the directory where you want to store the indexed genome, /path/to/genome/fasta/files with the path to your reference genome FASTA files, and /path/to/annotations.gtf with the path to the gene annotation file.

Aligning Reads

After indexing the genome, you can align your RNA-seq reads using the following command:

STAR --runThreadN NumberOfThreads \
     --genomeDir /path/to/genomeDir \
     --readFilesIn /path/to/read1.fastq /path/to/read2.fastq \
     --outFileNamePrefix /path/to/output/prefix

This command will align the reads from read1.fastq and read2.fastq to the reference genome located in /path/to/genomeDir and output the results with the specified prefix in /path/to/output/prefix.

Code Examples Of Popular Commands

In this section, we will look at five popular commands used with STAR to perform various tasks in RNA-seq data analysis.

1. Basic Alignment

To perform a basic alignment of paired-end reads to a reference genome, use the following command:

STAR --genomeDir /path/to/genomeDir \
     --readFilesIn /path/to/read1.fastq /path/to/read2.fastq \
     --runThreadN NumberOfThreads

2. Aligning with Gene Annotations

If you have gene annotations available, you can use them to improve the accuracy of the alignment:

STAR --genomeDir /path/to/genomeDir \
     --readFilesIn /path/to/read1.fastq /path/to/read2.fastq \
     --sjdbGTFfile /path/to/annotations.gtf \
     --runThreadN NumberOfThreads

3. Outputting Aligned Reads in BAM Format

To output the aligned reads in BAM format, which is a binary format for storing sequence data, add the --outSAMtype option:

STAR --genomeDir /path/to/genomeDir \
     --readFilesIn /path/to/read1.fastq /path/to/read2.fastq \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN NumberOfThreads

4. Using Two-Pass Mode

Two-pass mode allows STAR to perform a more sensitive alignment by using splice junctions discovered in the first pass for the second pass:

# First pass
STAR --genomeDir /path/to/genomeDir \
     --readFilesIn /path/to/read1.fastq /path/to/read2.fastq \
     --runThreadN NumberOfThreads \
     --outFileNamePrefix /path/to/firstPass/

# Second pass
STAR --genomeDir /path/to/genomeDir \
     --readFilesIn /path/to/read1.fastq /path/to/read2.fastq \
     --sjdbFileChrStartEnd /path/to/firstPass/SJ.out.tab \
     --runThreadN NumberOfThreads \
     --outFileNamePrefix /path/to/secondPass/

5. Filtering Alignments by Quality

To filter alignments by mapping quality, use the --outFilterScoreMin and --outFilterMatchNmin options:

STAR --genomeDir /path/to/genomeDir \
     --readFilesIn /path/to/read1.fastq /path/to/read2.fastq \
     --outFilterScoreMin 30 \
     --outFilterMatchNmin 30 \
     --runThreadN NumberOfThreads

This command will only output alignments with a minimum score and match length of 30.

In conclusion, STAR is a versatile and efficient tool for RNA-seq data analysis, offering fast and accurate alignment of RNA sequences. With its ability to handle large datasets and complex splicing patterns, STAR is an essential tool for researchers in the field of genomics and transcriptomics.