MS2Tox Machine Learning Tool for Predicting the Ecotoxicity of Unidentified Chemicals in Water by Nontarget LC-HRMS
Pilleriin Peets, Wei-Chieh Wang, Matthew MacLeod, Magnus Breitholtz, Jonathan W. Martin, Anneli Kruve
ES&T 2022
To achieve water quality objectives of the zero pollution action plan in Europe, rapid methods are needed to identify the presence of toxic substances in complex water samples. However, only a small fraction of chemicals detected with nontarget high-resolution mass spectrometry can be identified, and fewer have ecotoxicological data available. We hypothesized that ecotoxicological data could be predicted for unknown molecular features in data-rich high-resolution mass spectrometry (HRMS) spectra, thereby circumventing time-consuming steps of molecular identification and rapidly flagging molecules of potentially high toxicity in complex samples. Here, we present MS2Tox, a machine learning method, to predict the toxicity of unidentified chemicals based on high-resolution accurate mass tandem mass spectra (MS2). The MS2Tox model for fish toxicity was trained and tested on 647 lethal concentration (LC50) values from the CompTox database and validated for 219 chemicals and 420 MS2 spectra from MassBank. The root mean square error (RMSE) of MS2Tox predictions was below 0.89 log-mM, while the experimental repeatability of LC50 values in CompTox was 0.44 log-mM. MS2Tox allowed accurate prediction of fish LC50 values for 22 chemicals detected in water samples, and empirical evidence suggested the right directionality for another 68 chemicals. Moreover, by incorporating structural information, e.g., the presence of carbonyl-benzene, amide moieties, or hydroxyl groups, MS2Tox outperforms baseline models that use only the exact mass or log KOW.

Protomer Formation Can Aid the Structural Identification of Caffeine Metabolites
Helen Sepman, Sofja Tshepelevitsh, Henrik Hupatz, Anneli Kruve
Anal Chem 2022
DOI: 10.1021/acs.analchem.2c00257
The structural annotation of isomeric metabolites remains a key challenge in untargeted electrospray ionization/high-resolution mass spectrometry (ESI/HRMS) metabolomic analysis. Many metabolites are polyfunctional compounds that may form protomers in electrospray ionization sources and therefore yield multiple peaks in ion mobility spectra. Protomer formation is strongly structure-specific. Here, we explore the possibility of using protomer formation for structural elucidation in metabolomics on the example of caffeine, its eight metabolites, and structurally related compounds. It is observed that two-thirds of the studied compounds formed high- and low-mobility species in high-resolution ion mobility. Structures in which proton hopping was hindered by a methyl group at the purine ring nitrogen (position 3) yielded structure-indicative fragments with collision-induced dissociation (CID) for high- and low-mobility ions. For compounds where such a methyl group was not present, a gas-phase equilibrium could be observed for tautomeric species with two-dimensional ion mobility. We show that the protomer formation and the gas-phase properties of the protomers can be related to the structure of caffeine metabolites and facilitate the identification of the structural isomers.

Uncertainty estimation strategies for quantitative non-targeted analysis
Louis C Groff, Jarod N Grossman, Anneli Kruve, Jeffrey M Minucci, Charles N Lowe, James P McCord, Dustin F Kapraun, Katherine A Phillips, S Thomas Purucker, Alex Chao, Caroline L Ring, Antony J Williams, Jon R Sobus
Anal Bioanal Chemy A 2022
DOI: 10.1007/s00216-022-04118-z
Non-targeted analysis (NTA) methods are widely used for chemical discovery but seldom employed for quantitation due to a lack of robust methods to estimate chemical concentrations with confidence limits. Herein, we present and evaluate new statistical methods for quantitative NTA (qNTA) using high-resolution mass spectrometry (HRMS) data from EPA’s Non-Targeted Analysis Collaborative Trial (ENTACT). Experimental intensities of ENTACT analytes were observed at multiple concentrations using a semi-automated NTA workflow. Chemical concentrations and corresponding confidence limits were first estimated using traditional calibration curves. Two qNTA estimation methods were then implemented using experimental response factor (RF) data (where RF = intensity/concentration). The bounded response factor method used a non-parametric bootstrap procedure to estimate select quantiles of training set RF distributions. Quantile estimates then were applied to test set HRMS intensities to inversely estimate concentrations with confidence limits. The ionization efficiency estimation method restricted the distribution of likely RFs for each analyte using ionization efficiency predictions. Given the intended future use for chemical risk characterization, predicted upper confidence limits (protective values) were compared to known chemical concentrations. Using traditional calibration curves, 95% of upper confidence limits were within ~tenfold of the true concentrations. The error increased to ~60-fold (ESI+) and ~120-fold (ESI−) for the ionization efficiency estimation method and to ~150-fold (ESI+) and ~130-fold (ESI−) for the bounded response factor method. This work demonstrates successful implementation of confidence limit estimation strategies to support qNTA studies and marks a crucial step towards translating NTA data in a risk-based context.

Estimation of the concentrations of hydroxylated polychlorinated biphenyls in human serum using ionization efficiency prediction for electrospray
Sara Khabazbashi, Josefin Engelhardt, Claudia Möckel, Jana Weiss, Anneli Kruve
Anal Bioanal Chem 2022
DOI: 10.1007/s00216-022-04096-2
Hydroxylated PCBs are an important class of metabolites of the widely distributed environmental contaminants polychlorinated biphenyls (PCBs). However, the absence of authentic standards is often a limitation when subject to detection, identification, and quantification. Recently, new strategies to quantify compounds detected with non-targeted LC/ESI/HRMS based on predicted ionization efficiency values have emerged. Here, we evaluate the impact of chemical space coverage and sample matrix on the accuracy of ionization efficiency-based quantification. We show that extending the chemical space of interest is crucial in improving the performance of quantification. Therefore, we extend the ionization efficiency-based quantification approach to hydroxylated PCBs in serum samples with a retraining approach that involves 14 OH-PCBs and validate it with an additional four OH-PCBs. The predicted and measured ionization efficiency values of the OH-PCBs agreed within the mean error of 2.1 × and enabled quantification with the mean error of 4.4 × or better. We observed that the error mostly arose from the ionization efficiency predictions and the impact of matrix effects was of less importance, varying from 37 to 165%. The results show that there is potential for predictive machine learning models for quantification even in very complex matrices such as serum. Further, retraining the already developed models provides a timely and cost-effective solution for extending the chemical space of the application area.

MultiConditionRT: Predicting liquid chromatography retention time for emerging contaminants for a wide range of eluent compositions and stationary phases
Amina Souihi, Miklos Mohai, Emma Palm Louise Malm, Anneli Kruve
Journal of Chromatography A 2022
DOI: 10.1016/j.chroma.2022.462867
Structural elucidation of compounds detected with liquid chromatography coupled to high resolution mass spectrometry is a challenging and time-consuming step in the workflow of non-targeted analysis and often requires manual validation of the results. Retention time, alongside exact mass, isotope pattern, fragmentation spectra, and collision cross-section, is valuable information for ruling out unlikely structures and increasing the confidence in others. Different approaches to predict retention times have been used previously for reversed phase chromatography and hydrophilic interaction liquid chromatography (HILIC), but application is limited to a small set of mobile phases and gradient profiles. Here, we expand the toolbox available for retention time predictions by developing a random forest regression model for predicting retention times for four column types and twenty mobile phase systems. MultiConditionRT was built using a dataset containing 78 compounds analyzed with C18 reversed phase, mixed mode, HILIC, and biphenyl columns. In addition, different eluent compositions were used: both methanol and acetonitrile were combined with different aqueous phases with pH from 2.1 to 10.0 (formic acid, acetic acid, trifluoroacetic acid, formate, acetate, bicarbonate, and ammonia). The root mean square error (RMSE) of the test set predictions was 1.55 min for C18 reversed phase, 1.79 min for mixed-mode, 1.93 min for HILIC, and 1.56 min for biphenyl column. Additionally, MultiConditionRT can be applied to different gradient profiles with a general additive model-based calibration approach. The approach of MultiConditionRT was validated externally and internally with 356 and 151 compounds respectively, yielding an RMSE of 2.68 and 2.32 min. 324 and 84 of these compounds were not in the dataset used in the model development.

Machine Learning for Absolute Quantification of Unidentified Compounds in Non-Targeted LC/HRMS
Emma Palm, Anneli Kruve
Molecules 2022
DOI: 10.3390/molecules27031013
LC/ESI/HRMS is increasingly employed for monitoring chemical pollutants in water samples, with non-targeted analysis becoming more common. Unfortunately, due to the lack of analytical standards, non-targeted analysis is mostly qualitative. To remedy this, models have been developed to evaluate the response of compounds from their structure, which can then be used for quantification in non-targeted analysis. Still, these models rely on tentatively known structures while for most detected compounds, a list of structural candidates, or sometimes only exact mass and retention time are identified. In this study, a quantification approach was developed, where LC/ESI/HRMS descriptors are used for quantification of compounds even if the structure is unknown. The approach was developed based on 92 compounds analyzed in parallel in both positive and negative ESI mode with mobile phases at pH 2.7, 8.0, and 10.0. The developed approach was compared with two baseline approaches— one assuming equal response factors for all compounds and one using the response factor of the closest eluting standard. The former gave a mean prediction error of a factor of 29, while the latter gave a mean prediction error of a factor of 1300. In the machine learning-based quantification approach developed here, the corresponding prediction error was a factor of 10. Furthermore, the approach was validated by analyzing two blind samples containing 48 compounds spiked into tap water and ultrapure water. The obtained mean prediction error was lower than a factor of 6.0 for both samples. The errors were found to be comparable to approaches using structural information.

Sodium adduct formation with graph-based machine learning can aid structural elucidation in non-targeted LC/ESI/HRMS
Riccardo Costalunga, Sofja Tshepelevitsh, Helen Sepman, Meelis Kull, Anneli Kruve
Analytica Chimica Acta 2021
DOI: 10.1016/j.aca.2021.339402
Non-targeted screening with LC/ESI/HRMS aims to identify the structure of the detected compounds using their retention time, exact mass, and fragmentation pattern. Challenges remain in differentiating between isomeric compounds. One untapped possibility to facilitate identification of isomers relies on different ionic species formed in electrospray. In positive ESI mode, both protonated molecules and adducts can be formed; however, not all isomeric structures form the same ionic species. The complicated mechanism of adduct formation has hindered the use of this molecular characteristic in the structural elucidation in non-targeted screening. Here, we have studied the adduct formation for 94 small molecules with ion mobility spectra and compared collision cross-sections of the respective ions. Based on the results we developed a fast support vector machine classifier with polynomial kernels for accurately predicting the sodium adduct formation in ESI/HRMS. The model is trained on five independent data sets from different laboratories and uses the graph-based connectivity of functional groups and PubChem fingerprints to predict the sodium adduct formation in ESI/HRMS. The validation of the model showed an accuracy of 74.7% (balanced accuracy 70.0%) on a dataset from an independent laboratory, which was not used in the training of the model. Lastly, we applied the classification algorithm to the SusDat database by NORMAN network to evaluate the proportion of isomeric compounds that could be distinguished based on predicted sodium adduct formation. It was observed that sodium adduct formation probability can provide additional selectivity for about one quarter of the exact masses and, therefore, shows practical utility for structural assignment in non-targeted screening.
