Proteomics Informatics

Proteomics Informatics Standards Working Group Charter

Submitted: 2017-06-13

Template Rev2016b

 

HUPO PSI Proteomics Informatics Standards Working Group Charter

Submitted: 2017-06-13

 

  1. 1.     Administrative Section

Status (New/Update): Update

Group Name:

A group name should be reasonably descriptive or identifiable.  Additionally, the group must define an acronym (maximum of 8 printable ASCII characters) to reference the group in the PSI directories, mailing lists, and general documents.  The name and acronym must not conflict with any other PSI name and acronym.

HUPO PSI Proteomics Informatics Standards Working Group (PSI-PI W

 

Chair (with affiliation and current email address):

Juan Antonio Vizcaíno – EMBL-European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridge, UK (juan@ebi.ac.uk)

Co-Chairs (1 or 2) (with affiliation and current email address):

 

Andy Jones, University of Liverpool, UK (andrew.jones@liverpool.ac.uk)

Martin Eisenacher – Medizinisches Proteom Center, Ruhr-Universität Bochum, Germany (martin.eisenacher@rub.de)

 

Secretary:

Vacant

 

Other officers (optional) (with affiliation and current email address):

Editor(s): Gerhard Mayer - Medizinisches Proteom Center, Ruhr-Universität Bochum, Germany (gerhard.mayer@rub.d

Minimal Reporting Requirements Coordinator(s): Pierre-Alain Binz, CHUV – Centre Hospitalier Universitaire Vaudois, Lausanne Switzerland (pierre-alain.binz@chuv.ch)

 Ontology Coordinator(s): Gerhard Mayer - Medizinisches Proteom Center, Ruhr-Universität Bochum, Germany (gerhard.mayer@rub.de)

Web site Maintainer(s): Da Qi, University of Liverpool, UK (ddq@liverpool.ac.uk)

 

Mailing list:

psi-pi-dev@lists.sourceforge.net

 

  1. 2.     Description and objectives

Focus and Purpose

 

The PSI-PI working group is composed of academic, government, and industry researchers, software developers, journal representatives, and instrument manufacturers. The main goal of the PSI-PI working group is to define community data formats and associated controlled vocabulary (CV) terms, facilitating data exchange and archiving of the downstream results of proteomics analysis by mass spectrometry, including the identification and quantification of peptides and proteins by software, and the output of integrative analysis of proteomics data with other omics technologies (e.g. proteogenomics analysis).

 Current projects of the PSI-PI working group are:

  • Ongoing maintenance and enhancement of the mzIdentML, mzQuantML and mzTab data formats for proteomics workflows.
    • Development of software and adaptation of existing software to support the new mzIdentML version 1.2 (formally released in mid 2017).
    • Extension of mzTab to support the encoding of results coming from Data Independent Acquisition approaches (e.g. SWATH-MS and MSe).
    • Including the maintenance and further development of the PSI-MS CV, jointly with the PSI-MS WG (Mass Spectrometry Working Group), and the new XLMOD CV, including cross-linker reagents.
    • Completion of formats for proteogenomics (proBed and proBAM). proBed has been formalized already (version 1.0), whereas proBAM is currently undergoing the first round of review of the PSI Document Process. A joint publication including both formats has just been submitted.
    • Work together with the metabolomics standards group to extend mzTab to improve the support for mass spectrometry metabolomics approaches (version 1.1).
    • Work together with the PSI-MS WG to define a set of common metadata and CV for spectral libraries and spectral archives.

 

Goals/Milestones

 

Goal 1: Adapt and further develop existing open source software to support the recently finalised mzIdentML 1.2, promoting its adoption.

Goal 2: Release and publish in a scientific journal the two complementary formats designed for proteogenomics approaches, called proBed (version 1.0 already formalized) and proBAM (still under review in the PSI Document Process).

Goal 3: Standards organizations in metabolomics (MSI; COSMOS) are tasked with developing similar standards to those in use in PSI, particularly for metabolite identification and quantification from mass spectrometry. PSI-PI WG is assisting in the development and adaptation of mzTab for metabolomics.

Goal 4: Extend mzTab to support Data Independent Acquisition approaches (e.g. SWATH-MS and MSe).

Goal 5: Further development and maintenance of the recently created XLMOD CV for cross-linker reagents.

Goal 6: Coordinate efforts in the maintenance and further development of open source software that support the data standards developed by the WG.

Goal 7: Align and update the corresponding MIAPE documents with the updates of the mzIdentML, mzQuantML, mzTab, the covered workflows and make it compatible with external guidelines (for instance the Human Proteome Project guidelines).

Goal 8: Work together with the PSI-Mass Spectrometry Group to define a set of common metadata and controlled vocabulary for spectral libraries and archives.

 

 

            Website: http://www.psidev.info/proteomics-informatics (and links therein)

             Collaborative software development environment: http://github.com/HUPO-PSI


 

Tags: 

proBed Specification 1.0.0

proBed is one of the data standards developed by members of the Proteomics Informatics working group of the PSI.

For general information of the activities and the organization of this working group see HERE.

The original BED format (Browser Extensive Data, https://genome.ucsc.edu/FAQ/FAQformat.html - format1), developed by the UCSC (University of California, Santa Cruz) team, is used to describe genome coordinate data across lines, for use on annotation tracks. In BED, data lines are defined as tab-separated plain text with 12 mandatory fields (columns). Of those, only the first three fields are required, and the other 9 are optional.

The proBed format builds upon this original structure by extending the 12 original BED fields to include a further 13 fields to describe information primarily on peptide-spectrum matches (PSMs). The format can also accommodate peptides (as groups of PSMs).

Contents

  1. proBed 1.0.0 (Final Version): Specification document and example files
  2. proBed Tools and Implementations

proBed 1.0.0 (Final Version): Specification document and example files

The proBed file format is designed for storing and analyzing peptide spectrum matches (PSMs) within the context of the genome.

Direct links:


proBed Tools and Implementations

proBed example viewed in the Ensembl Genome Browser

 

Tags: 

mzTab Specification 1.0.0

mzTab is one of the standards developed by members of the Proteomics Informatics working group of the PSI.

For general information of the activities and the organization of this working group see HERE.

Contents

  1. mzTab 1.0.0 (Final Version): Specification documents
  2. mzTab Tools and Implementations

mzTab 1.0.0 (Final Version): Specification documents

Submitted originally to the PSI document process on May 2012. Final version 1.0.0 accepted on June 2014.

More documentation is available in the mzTab Google code project at https://github.com/HUPO-PSI/mzTab

Direct links to deliverables:


mzTab Tools and Implementations

  • jmzTab: A Java API to read, write and merge mzTab files (link)
  • LipidDataAnalyzer: Tool to quantify lipids from LC-MS data (link)
  • OpenMS: Open-source software C++ library for LC/MS data management and analyses (link)
  • MSnBase: Bioconductor package. Basic plotting, data manipulation and processing of MS-based Proteomics data (link)
  • PRIDE Converter 2: A redesign of the PRIDE Converter tool, for performing data submissions to the PRIDE database (link)
  • MaxQuant: A quantitative proteomics software package designed for analyzing large mass-spectrometric data sets (link)
  • PIA: A toolbox for MS based protein inference and identification analysis (link),  PMID:25938255

Tags: 

mzIdentML Development Timeline

1. Spring 2006, Meeting in San Francisco, USA – start of a UML model for AnalysisXML (universal standard for all types of proteome informatics)

Orchard et al. Proteomics 2006, 6, 4439–4443:

“PSI-Proteomics Informatics (PSI-PI) working group now has responsibility for the production of the mass spectrometry informatics standards, such as analysisXML, which will cover, among other things, protein identification reporting. The remit of the groups is to produce a UML data model with an XML implementation, example instance documentation, a validation tool, and an accompanying ontology. The use cases were reviewed and expanded upon and the existing version analysisXML reviewed in the light of these use cases. Migration to a UML model should be achieved in time for ASMS in order to generate an XML schema for public viewing.”

2.  Fall 2006, meeting in Washington DC, USA  - continued work on the AnalysisXML schema

Orchard et al. Proteomics 2007, 7, 337–339:

“Work also continued on AnalysisXML, with the revision of a file containing a list of information available in output files from a majority of currently available search engines. A number of common elements have been mapped to the current model and have been associated to appropriate CV terms.”

3. Spring 2007, Meeting in Lyon, France – continued work on the AnalysisXML schema

Orchard et al.  Proteomics 2007, 7, 3436–3440:

“A draft analysisXML schema and example instance documents were produced at the PSI Autumn 2005 workshop in Washington [5]. In the last few months, feedback has been received from all the major search engine vendors on the parameter spreadsheet and a draft CV prepared as an .OBO file. The aims of the meeting were to further develop the schema, review the instance document and improve the general documentation. By the end of the workshop, the schema had been tested against all of the MIAPE-MSI requirements, with the exception of the requirements for quantification for which a structure has been discussed. SILAC and iTRAQ features have been added as a feature set and these sets can be combined to give a ratio. Instance documents were reviewed and modified with new use cases such as top-down, mixed MS and MS/MS, de Novo sequencing and error tolerant tag searches discussed. Protein inference analysis has been more clearly split from peptide identification. Finally, a decision was made to put the terms required by two or more search engines directly into the schema as attributes/elements rather than described in a CV.”

4. Spring 2008, Meeting in Toledo, Spain – agreed to switch to direct XSD development to speed completion

Orchard et al.  Proteomics 2008, 8, 4168–4172:

“The development of analysisXML has proven far from straightforward, partly because the scope of the project has changed often in a fast moving field. The main aim of this meeting was to readdress the goals of the project and produce a timeline for completing the first release. Fundamental questions such as whether it is practical to try to write a schema that can cover all scenarios, including quantitation support, in the first implementation were considered.

analysisXML has been developed as an extension to FuGE by creating a schema from UML. It was agreed to continue by developing the XML schema directly and extending a cut-down version of the FuGE xsd. Rather than use the FuGE format for the controlled vocabulary, it was agreed to use the same format as for mzML version 1.0. It was also agreed that quantitation will not be addressed until version 2.0. However, a scheme was developed that will ensure that version 1.0 documents will be backwards compatible with the 2.0 schema. Development of quantitation support will be carried out in parallel to the version 1.0 release.”

 

5. December 2008, submission of AnalysisXML to the PSI document process

 

6. Spring 2009, Meeting in Turku, Finland  – AnalysisXML split into identification format (mzIdentML) and quantitation (mzQuantML), minor changes to the schema

Orchard et al.  Proteomics 2009, 9, 4429–4432

“The scope of the current format is limited to protein identification and the format previously known as AnalysisXML has been renamed mzIdentML to reflect this. The resources now include semantic validation tools, specification document, tables of conformance to both the MIAPE and MCP guidelines and 12 example instance documents. A manuscript has been prepared and the format was submitted to the PSI document review process in December 2008. The feedback from this process has resulted in a minor set of changes to the schema, documentation and examples.”

...

“It is now planned to develop a separate schema, mzQuantML, with a structure broadly similar to mzIdentML to add the ability to handle quantitation data.”

 

7. August 2009 – completion of the PSI document process and formal release of version 1 of mzIdentML

 

8. 2010-2011  - Some minor issues identified with verbosity in the files, and some redundant information captured. A few minor bugs identified. A decision was taken to fix all bugs in one go and release a new1.1 version

 

9. August 2011 - Version 1.1 released from the PSI document process, and considered to be the stable development release of the format.

 

 

Tags: 

mzQuantML

 

Formal version 1.0 release (Specification 1.0.1)

Direct Links to current documents:

find Example Instance Documents HERE

 

Background

The mzQuantML standard format is intended to store the systematic description of workflows quantifying molecules (principly peptides and proteins) by mass spectrometry. A large number of different software packages are available that produce output in a variety of different formats. It is intended that mzQuantML will provide a common format for the export of identification results from any software package. The format was originally developed under the name AnalysisXML as a format for several types of computational analyses performed over mass spectra in the proteomics context. It has been decided to split development into two formats: mzIdentML for peptide and protein identification and mzQuantML (described here), covering quantitative proteomic data derived from MS.

The development of mzQuantML is driven by some general principles, specific use cases and the goal of supporting specific techniques, as listed below. These were discussed and agreed at the development meeting in Tübingen in July 2011.

 

General principles, the format SHOULD support: 

  • Journal requirement for the reporting of quantitative proteomic data from mass spectrometry.
  • Reporting according to MIAPE-MSI (and the emerging MIAPE-Quant document).
  • Submission of quantitative data to public databases.
  • Data exchange between software tools, where data are defined as values about features (defined here as regions on MS1 mass spectra that report on a single peptide or small molecule), feature matches across different spectra or withing spectra, peptides, proteins and protein groups.
  • Import of data into statistical processing tools.
  • The ability to reprocess or recreate the analysis workflow using the same parameters, assuming no manual steps have taken place.

 

Use cases, the format SHOULD capture:

  • Final abundance values (relative or absolute) for peptides, proteins and protein groups where protein inference cannot be performed in an unambiguous manner.
  • Quantification values about peptide/protein modifications, such as post-translational modifications.
  • Abundance values at the level of a single run (called an assay in this context) and logical groupings of runs (called study variables in this context), which the user, for example, wishes to report relative values for.
  • The evidence trail for how final abundance values were calculated, such as the features used for quantifying peptides and proteins.
  • Relationships between features either on different regions of the same spectrum or on different spectra that report on the same peptide or small molecule. These are particularly required for relative quantification approaches.
  • Details about pre-fractionation sufficient to describe the combination of multiple input data files (e.g. raw files) into a single assay where this has been performed.

 

mzQuantML 1.0.1

The format extends support to SRM/MRM technique on the basis of version 1.0.0. The specification document of version 1.0.1 is in HERE

mzQuantML 1.0.0

More documentation is available in the mzQuantML Google code project at http://code.google.com/p/mzquantml/.

The format supports the following specific techniques used in proteomics (as shown in examples files):

  • MS1 label-free intensity
  • MS1 label-based e.g. SILAC and metabolic labelling such as 15N
  • MS2 tag-based e.g. iTRAQ / TMT
  • MS2 spectral counting

We expect that the format MAY also be able to cover the following techniques adequately, although these have not been tested in great detail at this stage, and we encourage further input from users of these techniques: 

  • Quantification by selected reaction monitoring (SRM)
  • Absolute quantification based on averaging the intensities of features e.g. Waters Hi3 technique
  • Small molecule quantification (in metabolomics)
  • MS2 intensity-based approaches
  • MS2 label-based approaches

Change log

The standard was submitted to the PSI document process in August 2011. The specifications have since been updated through version 1-rc2 and version 1-rc3 (current), with the release of version 1.0.0 in Feb 2013.

Major changes in versions (also see versioned schema documents on Google Code):

  • rc1 to rc2. Introduction of mapping rules/semantic rules for different techniques.
  • rc2 to rc3: Minor updates in responses to reviewer comments from journal review and fixes for cardinalities/internal references etc.
  • rc3 release to version1.0.0 release: no changes except update to version number.

The overall resource change log can also be consulted here: https://code.google.com/p/mzquantml/source/list

 

Tags: 

Pages

Subscribe to RSS - Proteomics Informatics