Skip to content

mzPeak: a next generation MS run file format

mzPeak is a new binary file format to represent instrument runs in a compact, fast, and cloud-or-local-friendly manner. It is built on top of Apache Parquet, a featureful, robust and production-tested data format with implementations available in virtually all major programming languages today. You can read it from your hard drive, a object store like Amazon’s S3, or from most HTTP servers. It competes with and sometimes beats vendor proprietary file formats for size, and uncompressed mzML for complex random access queries.

Specification Status: In Development

Current Location: https://github.com/mobiusklein/mzpeak_prototyping

Specification Draft: Markdown

Original White Paper: https://pubs.acs.org/doi/full/10.1021/acs.jproteome.5c00435

Active Reference Implementations

These are all re-implementations, not bindings on a single core library in a lower level language. They use generally available Apache Parquet and Apache Arrow libraries for their languages.

Rust: https://github.com/mobiusklein/mzpeak_prototyping.

  • This is the ur-reference implementation which serves as the basis for developing the initial prototypes and pushing their limits.
  • Read (Local, Cloud)
  • Write (Local), including direct conversion from Thermo RAW and Bruker .TDF

Python: https://github.com/mobiusklein/mzpeak_prototyping/python.

  • Read (Local, Cloud)
  • Bonus: SQL interface via DataFusion

R: https://github.com/mobiusklein/mzpeak_prototyping/R

  • Read (Local)

C++: TODO

C#: TODO

Java: TODO

TypeScript/WASM: TODO