Picard Tutorial

Overview of Picard

Picard is a robust set of tools used in bioinformatics, particularly for high-throughput sequencing (HTS) data. Developed in Java, Picard allows users to manipulate sequencing data and formats such as SAM (Sequence Alignment/Map), BAM (Binary Alignment/Map), CRAM (compressed version of BAM), and VCF (Variant Call Format). These formats are essential for storing and analyzing sequence data, and Picard provides a suite of tools to handle them efficiently.

The toolkit is designed to perform a variety of tasks, including quality control, sorting, marking duplicates, and indexing files. It is particularly well-suited for preparing data for downstream analysis, such as variant discovery or genome assembly.

Picard is maintained by the Broad Institute and is open-source under the MIT license, making it freely available for both academic and commercial use. The toolkit is highly respected in the bioinformatics community for its performance and comprehensive set of features.

Installation

To install Picard, you need to download the executable jar file from the Picard GitHub release page. The file is named picard.jar. Once downloaded, you can place the jar file in a convenient directory on your hard drive or server.

Since Picard is a Java program, it cannot be added directly to your system's PATH like C-compiled programs such as Samtools. Instead, it is recommended to set up an environment variable or create a shell script to act as a shortcut to run the Picard tools.

Quick Start

To get started with Picard, you need to have Java installed on your system. With Java in place and the picard.jar file downloaded, you can execute Picard commands using the following syntax:

java -jar picard.jar <PicardCommand> [OPTIONS]

Replace <PicardCommand> with the specific Picard tool you wish to use, and [OPTIONS] with the appropriate options for that tool.

Code Examples Of Popular Commands

Here are five popular commands in Picard and how to use them:

1. SortSam

Sorts a SAM or BAM file by coordinate or query name (QNAME).

java -jar picard.jar SortSam \
      I=input.bam \
      O=sorted.bam \
      SORT_ORDER=coordinate

2. MarkDuplicates

Identifies duplicate reads in a BAM or SAM file.

java -jar picard.jar MarkDuplicates \
      I=sorted.bam \
      O=marked_duplicates.bam \
      M=marked_dup_metrics.txt

3. CreateSequenceDictionary

Creates a .dict file from a reference sequence, which is required by many tools.

java -jar picard.jar CreateSequenceDictionary \
      R=reference.fasta \
      O=reference.dict

4. CollectAlignmentSummaryMetrics

Collects metrics summarizing the quality of alignments in a SAM or BAM file.

java -jar picard.jar CollectAlignmentSummaryMetrics \
      R=reference.fasta \
      I=input.bam \
      O=alignment_metrics.txt

5. AddOrReplaceReadGroups

Adds or replaces read group information for all reads in a SAM or BAM file.

java -jar picard.jar AddOrReplaceReadGroups \
      I=input.bam \
      O=rg_added.bam \
      RGID=4 \
      RGLB=lib1 \
      RGPL=illumina \
      RGPU=unit1 \
      RGSM=20

These commands are just the tip of the iceberg when it comes to the capabilities of Picard. The toolkit includes many other tools for various tasks, such as converting between file formats, validating files, and extracting specific data from files.

Picard is a powerful asset in the bioinformatics toolkit, and its comprehensive set of tools makes it an indispensable resource for researchers working with high-throughput sequencing data. Whether you're preparing data for analysis, ensuring quality control, or manipulating file formats, Picard provides the functionality needed to get the job done efficiently and accurately.