Technical Note

Consideration of HDF5 File Architecture for LCMS Data Acquisition Archival and Efficient Access

Jeff Jones, Ryan Benz



HDF5 is a well developed and extensible big data format.

Structures exist for metadata strings, single dimension arrays, tables and matrices.


Minimizes disk space requirements by upto a factor of 2 from vendor formats, and over 5 times better than XML based mzML.

Allows access to portions of the data without loading entire file to memory.


Recomputing the volume under the isotope across 100s of experiments for a given feature of interest at a known LC MZ.

Searching for similar ms2 spectra for a given feature of interest at a known LC MZ to perform spectra addition to improve peptide sequence identification.

Agilent QTOF
*.d 0.75 GB
mzML 3.67 GB 886%
mz5 1.43 GB 210%
HDF5 0.47 GB 64%

Waters QTOF
*.wiff 0.22 GB
mzML 1.95 GB 489%
mz5 0.46 GB 191%
HDF5 0.14 GB 63%


Open access data efforts for LCMS are either terribly space inefficient (xml) or not fully optimized (mz5) for big data analysis. While there exists deliberate considerations towards unifying a consistent and accessible file architecture for LCMS, to date those efforts have resulted disk space greedy formats that require reading entire files to memory for any computational task. Furthermore, lack of vendor adoption is evident and likely has to do with concurrent read/write ability and disk space efficiency. In addition, as instruments have increased speed, resolution and dynamic range, file sizes have exploded. Consequently, computational tasks requiring input from multiple experiments have become evermore difficult, requiring access to resources beyond the scope of the average researcher.


Proposed here is a file architecture based on the HDF5 accessible file format, optimized for efficient extraction and archiving of big data sets. Utilizing a combination of SQL-like relational tables with numeric arrays and matrices, the main attraction for the HDF5 file format is the ability to access and load only portions of the data into memory while maintaining concurrent read/write operations. While there has been efforts to port MS data to the HDF5 format, consideration for partial access loading was missing. While there is a clear motivation to adhere to the HUPO Proteomics Standards for data integrity, there are only a few essential data constructs - measurement conditions, measurement time, and m/z ~ intensity. The main attempt here is to improve upon that last construct while maintaining the current metadata standards.
SoCal Bioinformatics Inc.


Currently, data from every instrument manufacture records both m/z and intensity for every scan event - the former being the most redundant. Albeit, some experimenters call for different m/z ranges to be collected, however, that does not need to be a constraint for data recording. By standardizing the m/z axis of collection, for instance, to 0.01Da intervals for TOF/QQQ/QE proteomics and 0.001Da for FTMS/Orbi proteomics, the m/z axis needs only to be defined once and contains sufficient resolution for complex samples such as blood plasma and cultured cell proteomes. Regular interval data collection establishes a matrix of data, which can be further refined by collecting ms1 scan events at regular intervals. Storing this matrix of data in the HDF5 format, is more efficient then the currently proposed HDF5 format (mz5), for both read and write operations. In addition, taking advantage of the HDF5 concepts, data can be accessed in part (hyperslab) without the need to load an entire file into memory. This becomes particularly enticing when dealing with large, multi-experiment datasets.

SoCal Bioinformatics Inc.
SoCal Bioinformatics Inc.

Vendor adoption of an HDF5 format would allow for a simultaneous read/write operation enabling vendors to continue to display real-time spectra and metrics in addition to opening the possibility for near-real-time biological identification. Enormous potential for DIA data efficiencies which capture both ms1 and ms1+n in the same fashion. Currently in use for large computations involving multi-sample and longitudinal studies.

Find it on Github