proBAM specification

proBAM is one of the standards developed by members of the Proteomics Informatics working group of the PSI.

For general information of the activities and the organization of this working group see HERE.

Contents

  1. proBAM (version 1.0.0): Specification documents
  2. proBAM Tools and Implementations

proBAM (version 1.0.0): Specification documents

The proteomics BAM (proBAM) file format is designed for storing and analyzing peptide spectrum matches (PSMs) within the context of the genome. proBAM is built upon the SAM format and its compressed binary version, BAM, with necessary modifications to accommodate information specific to proteomics data such as PSM scores and confidence, charge states and peptide level modifications, both artefactual and PTMs (post-translational modifications).

Direct links to deliverables:

  • proBAM specification document (docx|PDF)
  • proBAM example files:
    • First set of proBAM example and related files:
    • Second set of proBAM example and related files (large file):
      • PXD001390.pro.bam - 2nd example proBAM file converted from the 2nd mztab example file using the proBAMconvert tool. In this example the Ensembl protein identifiers (e.g. ENSP00000349878) were used for genome coordinate mapping.
      • PXD001390.pro.bam – 2nd example proBAM file converted from the 2nd mzIdentML file using proBAMr. ENSEMBL v86 was used for annotating genomic coordinates. 
      • PXD001390.mzid - 2nd example mzIdentML file.
      • PXD001390.mztab - 2nd example mztab file converted from the 2nd mzIdentML example file.
    • Third set of proBAM example and related files:
      • PXD000124.pro.bam - 3rd example sam file generated from proBAM file converted from the 3rd mztab example file using the proBAMconvert tool. In this example the Ensembl transcript identifiers (e.g. ENSMUST00000005017) were used for genome coordinate mapping. Also, an extra 3 reading frame translation was performed on these transcript sequences, enabling mapping of 5’UTR translation products. The search database from this example is compiled from RIBO-seq derived sequences in a proteogenomics appraoch.
      • PXD000124.mzid - 3rd example mzIdentML file.
    • Fourth set of proBAM example and related files:
      • CPTAC_CRC.pro.bam (link) – 4th example proBAM file generated from CPTAC_CRC dataset. Refseq annotation (hg19) was used for genome coordinate mapping. In addition to the normal peptide identification, RNASeq data based customized databases were used to identify variant peptides. The location as well as the nucleotide level change were also included in this proBAM file.
      • Original data link

proBAM Tools and Implementations

  • Commonly used genomic tools that can process proBAM files (maybe not apply to all functions)
    • SAMtools: provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format. ( http://www.htslib.org/doc/samtools.html)
    • BEDtools: a powerful toolset to intersect, merge, count, complement and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF,VCF. (http://bedtools.readthedocs.io/en/latest)
    • IGV (Integrative Genomics Viewer): high-performance visualization tool for interactive exploration of large, integrated genomic datasets. (https://www.broadinstitute.org/igv/)
  • proBAMconvert: Python implementation to generate proBAM files (link), a worked-out example is also available (example).
  • proBAMsuite: includes two R packages, proBAMr and proBAMtools, for generating and analyzing proBAM files, respectively (proBAMr, proBAMtools, PMID:26657539)
    • A computational pipeline to generate and analyze proBAM files in R enviroment. (link)