Background: Rare genetic diseases affect around 1 in 10 Australians; however, current clinical pathways only result in diagnosis for approximately 50% of patients. Many are left with unresolved variants of uncertain significance (VUSs): genomic changes which may be upgraded to a pathogenic classification and diagnosis with supporting functional evidence. Mass spectrometry-based quantitative proteomics presents a promising solution to the growing need for high-throughput functional assays to resolve VUSs – particularly missense variants – and contribute to rare disease diagnoses. As a gene- and disease-agnostic test, proteomics can provide functional evidence for diagnostic hypotheses by comparing patient protein expression levels against controls. However, utility and adoption of rare disease proteomics is currently limited by a lack of standardisation and practical limitations hampering control group sizes.
Aim: We sought to create an automated, customisable and reproducible workflow for analysing mass spectrometry-based quantitative proteomics data as a rare disease diagnostic tool, utilising peripheral blood mononuclear cell (PBMC) samples as a clinically accessible, easily isolated, and informative tissue in which more than 50% of all known monogenic disease genes are expressed.
Methods/Results: We first searched the literature to identify the optimal tools for each element of the proteomics workflow: peptide identification and protein inference, protein quantification and managing missing data, contaminants removal, normalisation, batch effect correction, and differential expression analysis. Leveraging a novel dataset of paediatric PBMC samples from 394 control individuals, we compared different methods on a high-quality subset of the data (n = 42 samples, 84 replicates). We demonstrate the advantages of a DIA-NN–limpa pipeline, harnessing DIA-NN’s command line interoperability and the recently published limpa R package’s approach to handling missing data. We produce complete protein matrices without imputing or eliminating lowly detected proteins, yielding a higher quantity of informative data and more robust results in downstream analyses. When benchmarked against our group’s previous methods, this workflow produced more statistically significant results without any loss of accuracy, improving its potential to inform diagnostic investigations in challenging patient cases.
Conclusion: We present a standardised workflow for processing mass spectrometry-based label-free DIA proteomics data from paediatric PBMC samples, and analysis in a rare disease diagnostic context.