Skip to main content

Research Repository

Advanced Search

Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions. [Dataset]


Data Collector

Andrew Ramsay
Data Collector

Justin J.J. van der Hooft
Data Collector

Katherine R. Duncan
Data Collector

Sylvia Soldatou
Data Collector

Juho Rousu
Data Collector

Data Collector

Joe Wandy
Data Collector

Simon Rogers
Data Collector


In this article, we introduce NPLinker, a software framework to link genomic and metabolomic data, to link microbial secondary metabolites to their producing genomic regions. Two of the major approaches for such linking are analysis of the correlation between sets of strains, and analysis of predicted features of the molecules. While these methods are usually used separately, we demonstrate that they are in fact complementary, and show a way to combine them to improve their performance. We begin by demonstrating a weakness in the most common method of strain correlation analysis, and suggest an improvement. We then introduce a new feature-based analysis method which, unlike most such methods, does not directly depend on the natural product compound class. Finally, we demonstrate that the two are complementary and proceed to combine them into a single scoring function for genomic and metabolomic links, which shows improved performance over either of the individual approaches. Verification is done using curated databases of genomic and metabolomic data, as well as public data sets of microbial data including validated links. To further validate the IOKR approach we investigated if it was possible, for high-scoring pairs of MS2 spectra and metabolites, to manually match relevant peaks in MS2 spectra to possible fragments of the metabolites. Full validation would require additional wet lab analysis, which is not possible with these publicly available datasets. If a link is genuine, it ought to be possible to match MS2 peaks in the spectra to substructures of the relevant chemical structures. If we can, it ought to be the case that these fragment peaks are particularly important in the IOKR model. We provide some examples to show that this is indeed the case. To illustrate this process, we took validated links in the Crusemann data set (see Section 2.8.2 and Table 4 in the published article, as well as two high-scoring potential links chosen as their ranking had a strong contribution from the IOKR score.


ELDJÁRN, G.H., RAMSAY, A., VAN DER HOOFT, J.J.J., DUNCAN, K.R., SOLDATOU, S., ROUSU, J., DALY, J., WANDY, J. and ROGERS, S. 2021. Ranking microbial metabolomic and genomic links in the NPLinker framework using complementary scoring functions. [Dataset]. PLOS computational biology [online], 17(5), e1008920. Available from:

Acceptance Date Mar 26, 2021
Online Publication Date May 4, 2021
Publication Date May 31, 2021
Deposit Date May 31, 2021
Publicly Available Date May 31, 2021
Publisher Public Library of Science
Keywords Ecology; Modelling and simulation; Computational theory and mathematics; Genetics; Ecology, evolution, behavior and systematics; Molecular biology; Cellular and molecular neuroscience
Public URL
Publisher URL
Related Public URLs
Type of Data 5 PDF files, 4 XLXS files and supporting text (.txt) file.
Collection Date Apr 28, 2021
Collection Method To match MS2 peaks to chemical substructures we made use of the MetFrag web interface [2]. For a given metabolite and spectrum, using the compound name search function within the NPAtlas database [3], we found the accurate mass for the metabolite. This was used as a search criterion on the neutral mass in the NPAtlas_Aug2019 database in MetFrag [2], to ensure that the relevant metabolite was in the candidate set. Because we wanted to match measured peaks in an actual MS2 spectrum to the predicted peaks for a particular metabolite, ideally, the MetFrag candidate set should have one member. Where more than one result was returned, only the result where the candidate metabolite name matched the given metabolite was used, except in the case of griseochelin, which was considered equivalent to zincophorin as it has been by others in literature [4]. The relevant spectral data was extracted from the Metabolomics Spectrum Resolver [5] and the MetFrag in-silico fragmentation algorithm (with default settings) was used. Peaks that did match were then checked to see how their exclusion from the MS2 spectrum in uenced the ranking of the metabolite, among the set of all metabolites, to that spectrum. The images for the spectra were generated by the Metabolomics Spectrum Resolver [5] while the images for the metabolites were genreated by MetFrag [5], with the identified substructure highlighted in green.