Introduction

Gempipe is a tool for drafting, curating and analyzing pan and multi-strain genome-scale metabolic models (GSMMs or GEMs).

In brief

Gempipe can start from genomes or directly from proteomes, if a reliable annotation is available. Genomes are filtered for quality using both technical and biological metrics. Then, genes are annotated and grouped into clusters, and an extensive gene-recovery procedure is applied to counteract possible errors introduced during genome assembling or gene calling.

Gene clusters are used to build a reference-free reconstruction based on the CarveMe semi-curated universe, applying different rules for the generation of GPRs (gene-to-reaction associations), accounting for alternative isoforms and respecting the original enzyme complex definitions stored in BiGG.

The reference-free reconstruction is used as a source of new reactions for the expansion of an optional user-provided reference, thus taking into account the strain-specificity outside the scope of the reference itself. This expansion respects the design decision of the reference in terms of metabolites formula and charge and reactions balance.

The resulting draft pan-GSMM is then annotated de novo with accessions from many databases and duplicated metabolites and reactions are optionally removed.

Unlike other tools like CarveMe or gapseq, and even if a totally automated reconstruction mode (gempipe autopilot) is provided, manual curation is strongly encouraged. To facilitate manual curation, Gempipe provides an application programming interface (API) with dedicated functions.

Once the pan-GSMM is finalized, it is used to derive a strain-specific GSMM for each input genome or proteome, exploiting the gene clusters information, eventually granting biomass production on a set of user-defined growth media. At this point, auxotrophies and growth-enabling substrates can be predicted, and Biolog® screenings can be simulated.

Finally, specific functions of the Gempipe API can be used to analyze the deck of strain-specific GSMMs: phylometabolic trees can be created, strains can be divided in homogeneous metabolic groups, discriminative metabolic features can be extracted, core metabolism of species can be identified, etc.

Components and workflow

Gempipe is composed by 3 command-line programs and an API. The Gempipe workflow is divided in four parts:

  • Part 1. Creation of the draft pan-GSMM and the presence/absence matrix (PAM), starting either from genomes or proteomes (command line program gempipe recon).

  • Part 2. Manual curation of the draft pan-GSMM, for example using functions provided by the Gempipe API.

  • Part 3. Derivation of strain-specific GSMMs, starting from the PAM and the curated pan-GSMM (command line program gempipe derive).

  • Part 4. Analysis of the deck of strain-specific GSMMs, for example using functions provided by the Gempipe API.

As a (discouraged) alternative to the manual curation, the additional command line program gempipe autopilot is provided, which internally calls gempipe recon and gempipe derive, linking them together performing an automated gap-filling on the draft pan-GSMM.

Below it is reported the interactive flowchart of Gempipe. It can be zoomed and panned to see the details. Some nodes are clickable and point to the corresponding doc section.