Abstract
Metabolites are small molecules (<1500 Da) that are used in or produced during chemical reactions in cells, tissues, or organs. Upon absorption or biosynthesis in humans (or other organisms), they can either be excreted back into the environment in their original form, or as a pool of degradation products. The outcome and effects of such interactions is function of many variables, including the structure of the starting metabolite, and the genetic disposition of the host organism. For this reasons, it is usually very difficult to identify the transformation products as well as their long-term effect in humans and the environment. This can be explained by many factors: (1) the relevant knowledge and data are for the most part unavailable in a publicly available electronic format; (2) when available, they are often represented using formats, vocabularies, or schemes that vary from one source (or repository) to another. Assuming these issues were solved, detecting patterns that link the metabolome to a specific phenotype (e.g. a disease state), would still require that the metabolites from a biological sample be identified and quantified, using metabolomic approaches. Unfortunately, the amount of compounds with publicly available experimental data (~20,000) is still very small, compared to the total number of expected compounds (up to a few million compounds). For all these reasons, the development of cheminformatics tools for data organization and mapping, as well as for the prediction of biotransformation and spectra, is more crucial than ever. My PhD thesis focused on developing several cheminformatics tools that address these limitations. First, I developed ClassyFire and ChemOnt. ClassyFire is a publicly available software tool and webserver that automatically and hierarchically classifies any given molecule based on its structure. It relies partly on ChemOnt, a comprehensive and comprehensible taxonomy that contains >4,800 chemical categories, as well as their textual descriptions and mappings to other ontologies. ClassyFire was used to classify and annotate >80 million compounds. The webserver also integrates a text-based search engine. These features make ClassyFire unique in the sphere of publicly available computational tools. ClassyFire and ChemOnt are available at http://classyfire.wishartlab.com. Second, I developed BioTransformer and BioTransformerDB. BioTransformer is a software tool for the prediction of small molecule metabolism in mammals. It uses a hybrid approach that partly relies on BioTransformerDB, a unique database of biotransformations containing experimentally confirmed metabolic reactions that transform >1,000 drugs, pesticides, cosmetics, and food compounds, among others. The current version of BioTransformer, which is available at https://bitbucket.org/djoumbou/biotransformer, focuses on the human species, but is easily expandable to other species. Third, I developed CFM-ID 3.0, an extension of CFM-ID (1.0, and 2.0), originally developed by Felicity Allen et al. CFM-ID 3.0 is a software tool and webserver for the prediction and annotation of MS spectra, as well as the identification of metabolites. With the integration of a rule-based fragmentation approach for spectra prediction, the development of new ranking functions, and the expansion of the spectral database, CFM-ID 3.0 showed a significant improvement, in terms of speed and accuracy, compared to previous versions. CFM-ID 3.0 is currently available as we web server at http://cfmid-staging.wishartlab.com/. ClassyFire, BioTransformer, and CFM-ID have found applications in various fields including chemical information management, metabolomics, and exposomics, among others. Together, they build a cheminformatics platform that can enable metabolomics, and contribute to the understanding of our environment as well as the advancement of science.
Keywords
Related Publications
VSEARCH: a versatile open source tool for metagenomics
Background VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence...
fastp: an ultra-fast all-in-one FASTQ preprocessor
Abstract Motivation Quality control and preprocessing of FASTQ files are essential to providing clean data for downstream analysis. Traditionally, a different tool is used for e...
SwissTargetPrediction: updated data and new features for efficient prediction of protein targets of small molecules
Abstract SwissTargetPrediction is a web tool, on-line since 2014, that aims to predict the most probable protein targets of small molecules. Predictions are based on the similar...
eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale
Abstract Even though automated functional annotation of genes represents a fundamental step in most genomic and metagenomic workflows, it remains challenging at large scales. He...
HTSeq—a Python framework to work with high-throughput sequencing data
Abstract Motivation: A large choice of tools exists for many standard tasks in the analysis of high-throughput sequencing (HTS) data. However, once a project deviates from stand...
Publication Info
- Year
- 2017
- Type
- article
- Citations
- 5
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.7939/r3vd6pj8b