Part 1: gempipe recon

gempipe recon is designed to reconstruct a draft pan-GSMM and a presence/absence matrix (PAM), starting either from genomes or proteomes. Below are shown all its options:

usage: gempipe recon [-h] [-v] [-c] [-o] [--verbose] [--overwrite] [--dbs]
                     [-t] [-g] [-p] [-gb] [-s] [-b] [--buscoM] [--buscoF]
                     [--ncontigs] [--N50] [--identity] [--coverage] [-rm]
                     [-rp] [-rs] [-mc] [--tcdb] [--dedup] [--norec] [--dbmem]
                     [--sbml] [--nofig] [-md]

gempipe v1.38.4, please cite "TODO". Full documentation available at
https://gempipe.readthedocs.io/en/latest/index.html.

optional arguments:
  -h, --help            Show this help message and exit.
  -v, --version         Show version number and exit.
  -c , --cores          Number of parallel processes to use. (default: 1)
  -o , --outdir         Main output directory (will be created if not
                        existing). (default: ./)
  --verbose             Make stdout messages more verbose, including debug
                        messages. (default: False)
  --overwrite           Delete the working/ directory at the startup.
                        (default: False)
  --dbs                 Path were the needed databases are stored (or
                        downloaded if not already existing). (default:
                        ./working/dbs/)
  -t , --taxids         Taxids of the species to model (comma separated, for
                        example '252393,68334'). (default: -)
  -g , --genomes        Input genome files or folder containing the genomes
                        (see documentation). (default: -)
  -p , --proteomes      Input proteome files or folder containing the
                        proteomes (see documentation). (default: -)
  -gb , --genbanks      Input genbank files (.gb, .gbff) or folder containing
                        the genbanks (see documentation). (default: -)
  -s , --staining       Gram staining, 'pos' or 'neg'. (default: neg)
  -b , --buscodb        Busco database to use ('show' to see the list of
                        available databases). (default: bacteria_odb10)
  --buscoM              Maximum number of missing Busco's single copy
                        orthologs (absolute or percentage). (default: 2%)
  --buscoF              Maximum number of fragmented Busco's single copy
                        orthologs (absolute or percentage). (default: 100%)
  --ncontigs            Maximum number of contigs allowed per genome.
                        (default: 200)
  --N50                 Minimum N50 allowed per genome. (default: 50000)
  --identity            Minimum percentage amino acidic sequence identity to
                        use when aligning against the BiGG gene database.
                        (default: 30)
  --coverage            Minimum percentage coverage to use when aligning
                        against the BiGG gene database. (default: 70)
  -rm , --refmodel      Model to be used as reference. (default: -)
  -rp , --refproteome   Proteome to be used as reference. (default: -)
  -rs , --refspont      Reference gene marking spontaneous reactions.
                        (default: spontaneous)
  -mc , --mancor        Manual corrections to apply during the reference
                        expansion. (default: -)
  --tcdb                Experimental feature: try to build transport reactions
                        using TCDB. (default: False)
  --dedup               Try to remove duplicate metabolites and reactions
                        using MNX annotation, when a reference is provided.
                        (default: False)
  --norec               Skip gene recovery when starting from genomes.
                        (default: False)
  --dbmem               Load the entire eggNOG-mapper database into memory
                        (should speed up the functional annotation step).
                        (default: False)
  --sbml                Save the output GSMMs in SBML format (L3V1 FBC2) in
                        addition to JSON. (default: False)
  --nofig               Skip the generation of figures. (default: False)
  -md , --metadata      Table for manual correction of genome metadata.
                        (default: -)

Starting from genomes

Gempipe accepts both genomes and proteomes. Each file in input is assumed to belong to a different strain. There are several options to specifiy the input genomes:

Specify a folder containing the input genomes. They are all assumed to belong to the same species. For example:

    gempipe recon -g my_genomes/

Specify a list of input genome files (comma separated). They are all assumed to belong to the same species. For example:

    gempipe recon -g my_genomes/GCA_001689725.1.fa,my_genomes/GCA_001756855.1.fa,my_genomes/GCA_003058285.2.fa

Specify a list of input genome files, grouped by species. This follows a rigid sintax: SpA@G1,G2,G3+SpB@G4,G5+SpC@G6,G7.... Please replace spaces with underscores. For example:

    gempipe recon -g Erwinia_dacicola@my_genomes/GCA_001689725.1.fa,my_genomes/GCA_001756855.1.fa,my_genomes/GCA_003058285.2.fa+Erwinia_aphidicola@my_genomes/GCA_016925695.1.fa,my_genomes/GCA_024169515.1.fa

Starting from species taxids

As an alternative starting point, it is possible to insert the list of NCBI Species Taxids under study (comma separated). All the available genome assemblies under those taxids will be automatically dowloaded from NCBI, and grouped based on the species. For example:

    gempipe recon -t 252393,68334

💡 Tip! Each species appearing on NCBI is linked to its specific NCBI Species Taxid. For example, to retrieve the taxid of Erwinia aphidicola, enter the NCBI Taxonomy Database and type the name of the species. Once the webpage for the species is opened, the taxid can be found right below its name: 68334 for E. aphidicola!

Filtering the genomes

Bad-quality genomes in input are detected and ignored in subsequent analysis. Bad-quality is defined by 4 metrics, which can be either techinical or biological:

The maximum number of contigs per genome (--ncontigs).
The minimum N50 per genome (--N50).
The maximum number of Busco’s missing single-copy orthologs per genome (--buscoM) (it can be expressed also in percentage).
The maximum number of Busco’s fragmented single-copy orthologs per genome (--buscoF) (it can be expressed also in percentage).

Please note that the total number of evaluated Busco’s single-copy orthologs depends on the selected Busco database. The default database is bacteria_odb10, but it is strongly suggested to select a more specific database according to the organisms under study. This way, the number the evaluated Busco’s single-copy orthologs will increase. To change the Busco database, use -b/--buscodb.

💡 Tip! To read the list of available Busco databases, type gempipe recon --buscodb show.

The example below deals with many metagenome-derived assemblies: to avoid the loss of key assemblies, the default thresholds are relaxed:

    gempipe recon -t 252393,68334 -b enterobacterales_odb10 --ncontigs 2000 --N50 5000 --buscoM 10% --buscoF 15%

Starting from proteomes

Gempipe gives its best when starting from genomes, as a gene recovery procedure is applied. Anyway, starting directly from proteomes is also allowed (see flowchart), if reliable genome annotations are available. A proteome is here defined as a multifasta file containing all the strain-specific gene sequences written in aminoacidic format. There are several options to specifiy the input proteomes:

Specify a folder containing the input proteomes. They are all assumed to belong to the same species. For example:

    gempipe recon -p my_proteomes/

Specify a list of input proteome files (comma separated). They are all assumed to belong to the same species. For example:

    gempipe recon -p my_proteomes/GCA_001689725.1.fa,my_proteomes/GCA_001756855.1.fa,my_proteomes/GCA_003058285.2.fa

Specify a list of input proteome files, grouped by species. This follows a rigid sintax: SpA@P1,P2,P3+SpB@P4,P5+SpC@P6,P7.... Please replace spaces with underscores. For example:

    gempipe recon -p Erwinia_dacicola@my_proteomes/GCA_001689725.1.fa,my_proteomes/GCA_001756855.1.fa,my_proteomes/GCA_003058285.2.fa+Erwinia_aphidicola@my_proteomes/GCA_016925695.1.fa,my_proteomes/GCA_024169515.1.fa

Starting from genbanks

Proteomes can be also passed as genbank files. In this case, Gempipe will try to recover gene annotations to boost the Memote metrics under the “gene annotation” section. There are several options to specifiy the input genbanks:

Specify a folder containing the input genbanks. They are all assumed to belong to the same species. For example:

    gempipe recon -gb my_genbanks/

Specify a list of input genbank files (comma separated). They are all assumed to belong to the same species. For example:

    gempipe recon -gb my_genbanks/GCA_001689725.1.gb,my_genbanks/GCA_001756855.1.gb,my_genbanks/GCA_003058285.2.gb

Specify a list of input genbank files, grouped by species. This follows a rigid sintax: SpA@P1,P2,P3+SpB@P4,P5+SpC@P6,P7.... Please replace spaces with underscores. For example:

    gempipe recon -gb Erwinia_dacicola@my_genbanks/GCA_001689725.1.gb,my_genbanks/GCA_001756855.1.gb,my_genbanks/GCA_003058285.2.gb+Erwinia_aphidicola@my_genbanks/GCA_016925695.1.gb,my_genbanks/GCA_024169515.1.gb

💡 Tip! Start from genbanks if the highest Memote metrics are desired: only with this input mode, Gempipe will try to improve also the “gene annotation” section.

Creation of a PAM

Genes are clustered based on their amino acidic sequence identity. Each cluster has one representative sequence.

Following the clustering, a presence/absence matrix (PAM) is derived. The PAM has clusters in row and strains in column. If a strain harbors two or more genes belonging to the same cluster, those will appear in the same cell separated by a semicolon (;). As an example, a PAM with just 3 strains is reported below.

	GCA_918698235.1	GCA_014773485.1	GCA_024169515.1
Cluster_0	LBGCNHPF_02539	OEIEKCCN_02232	OCILAMFM_00111
Cluster_1	NaN	OEIEKCCN_03649	OCILAMFM_04027
Cluster_2	NaN	OEIEKCCN_04206	NaN
Cluster_3	NaN	NaN	OCILAMFM_04171
Cluster_4	LBGCNHPF_03507	OEIEKCCN_03842	OCILAMFM_04138
...	...	...	...
Cluster_5475	LBGCNHPF_02700	OEIEKCCN_04289	OCILAMFM_01419
Cluster_5476	LBGCNHPF_01132	NaN	NaN
Cluster_5477	NaN	NaN	OCILAMFM_04217
Cluster_5478	LBGCNHPF_03644	OEIEKCCN_00956	NaN
Cluster_5479	LBGCNHPF_00113	OEIEKCCN_03524;OEIEKCCN_04081	OCILAMFM_03912

5460 rows × 3 columns

Gene recovery

When starting from genomes (-t or -g), Gempipe executes an extensive gene recovery composed by 3 subsequent modules. To skip the gene recovery, activate the --norec option. To have more details on the gene recovery implementation, please read Methods in the Gempipe paper.

Module 1. Sometimes a gene in a genome undergo little mutations, such as base insertions, deletions or sobstitutions. This will probably lead to the formation of premature stop codons in the middle of the sequence. This in turn will lead to the prediction of two or more coding sequences, instead of one, from the same genomic region. Anyway, a similar behaviour can be also be due to sequencing or assembling errors. Gempipe search for proteins broken in two high-identity pieces, and assumes the broke up to be due to technical (non-biological) issues. Sequences recovered in this way have a _frag suffix in the PAM.
Module 2. The prediction of coding sequences in a genome (gene calling) may not be perfect. Some genes may simply not be seen. To cope with this problem, Gempipe searches for extra coding sequences by aligning representative sequences on the genome masked from its predicted genes. Like in all recovery modules, only high-identity and high-coverage alignments are considered. Sequences recovered in this way have a _refound suffix in the PAM.
Module 3. From a bioinformatics point of view, sometimes a coding sequence ends internally to another coding sequence. In these cases, gene callers may have problems in distinguishing the two overlapping genes. Module 3 takes the short fragments recovered in Module 2, and seeks to extend them over the previously masked regions (where other genes were originally annotated by the gene caller). Sequences recovered in this way have a _overlap suffix in the PAM.

⏩ Warning! Gempipe gives its best when starting from genomes: starting from proteomes (multifasta or genbank) will skip the gene recovery.

Draft pan-GSMM reference-free reconstruction

Gempipe reconstructions are based on the BiGG database, preferred over ModelSEED because of its human readable IDs (compare for example the BiGG ID for glucose glc__D with its ModelSEED counterpart cpd00027). These turn out to be particularly convenient when drawing metabolic maps with Escher or Escher-FBA.

Gempipe uses the representative sequences of the clusters to create a reference-free draft pan-GSMM using the same CarveMe gene database and universes, but applying different reconstruction rules. To choose between gram positive and gram negative reconstructions, use the -s/--staining option.

During the reconstruction process, only reactions with genetic support are included, and no automatic gap-filling is performed, this way minimizing the number of false-positives. This approach goes in the opposite direction compared to other famous tools like CarveMe, which include an unskippable gap-filling algorithm needed to ensure “FBA-ready” models in output: this forced gap-filling might be the cause of the inclusion of false-positive reactions (please read Results in the Gempipe paper).

Note that a BiGG reaction can appear in different BiGG models with different protein complex definitions encoded in the GPR. With gempipe recon, a reaction is copied from the selected universe only if at least one of the alternative original protein complexes definitions is fully satisfied. Moreover, if two or more slightly different genes align equally well on the same BiGG gene, all the involved GPRs will take into account these alternative isoforms. These two design decisions should lead to more accurate GPRs, solving two known CarveMe issues, #180 and #182.

Draft pan-GSMM expanded reference-based reconstruction

Gempipe accetps in input an optional reference GSMM (any type: pan or strain-specific): the reference GSMM itself is provided with -rm/--refmodel, while the associated proteome (aminoacidic multifasta) is provided with -rp/--refproteome. Providing a reference, the reference-free reconstruction previously produced (see above) is used to expand the reference itself with new genes and reactions.

In “pure” reference-based reconstructions, orthologs are determined between each strain and the reference, for example via a best-reciprocal hits alignment (BRH), as described in the 2019 Nature Protocol Extension and implemented in a pipeline named Bactabolize. Then, for each strain, a copy of the reference is subtracted from the genes missing an ortholog, leading to the retention of reactions specific to the strain to model. Since it is rare to dispose in advance of a manually curated pan-model for the species under study, usually the curated reference used in the BRH approach is a strain-specific model. This implies that the strain to model will harbour only a subset of the reference strain’s reactions. This clearly limits the exploration of the biodiversity at the strain level. Therefore, reference-based reconstructions made with gempipe recon are always expanded with new reactions coming from the previously produced reference-free reconstruction (see flowchart).

First, for each strain to model, a BRH with the reference is performed. This enables a “translation” of reference gene IDs into the respective cluster IDs (rows of the PAM). Then, the translated reference GSMM is expanded with new reactions and genes coming from the reference-free reconstruction. After this expansion process, the expanded reference GSMM becomes a draft pan-GSMM (see flowchart).

In these reference-based reconstructions, the reference is the cornerstone of the reconstruction, providing the biomass equation, the NGAM energy (non-growth associated maintainence energy), all the chemical formulas and charges assigned to metabolites, and so on. Therefore, the transfer of reactions from the reference-free reconstruction to the reference GSMM is not conducted blindly, but follows some principles:

If the reaction ID is already included in the reference, the corresponding GPR is updated including eventual new gene clusters not yet included.
If the reaction ID is not included in the reference, synonym reactions are searched, having the same reactant and product IDs, ignoring protons. If a synonym reaction is found in the reference, the expansion is limited to new gene clusters that may be included in its GPR.
If a reaction is new, it is added to the reference model. New metabolites and gene clusters are also transferred if needed.

During the reaction transfer, if a metabolite is encoded both in the reference and in the reference-free reconstruction with same ID but different formula and/or charge, the metabolite definition of the reference is maintained. During the reference expansion, however, users are allowed to superimpose specific reaction balances and metabolite formulas and charges using the -m/--mancor option. This will provide a text file of manual corrections to be applied during the reference expansion, particularly useful to avoid unbalanced reactions and stoichiometric inconsistencies. It describes one rule per line, following a rigid syntax: rule.ID:value. Rules are divided in 4 categories:

Formula of a metabolite, with the rule formula. For example:

    formula.isocapcoa:C27H42N7O17P3S

Charge of a metabolite, with the rule charge. For example:

    charge.cdigmp:-2

Definition of a reaction, with the rule reaction. For example:

    reaction.NTD12:dimp_c + h2o_c --> din_c + pi_c + h_c

Exclusion of a reaction from the transfer, with the rule blacklist. For example:

    blacklist.LIPO3S24_BS

Plase note that metabolite IDs must be provided without compartment in the formula and charge rules. It is possible to comment a line (that is, to ignore a rule) prepending a % at the beginnning of the line.

💡 Tip! Often, an artifact gene marking all the spontanous reactions is utilized in GSMMs (usually spontaneous, literally). This gene, here called the “spontaneous” gene, is useful to distinguish spontaneous reactions from reactions that remained without GPR (orphans). In gempipe recon, the “spontaneous” gene adopted in the reference can be specified using -rs/--refspont.

Transport reactions expansion with TCDB

Since the goodness of qualitative growth simulations usually depends on transport reactions, and since the BiGG gene database is small and biased towards model organisms, gempipe recon provides an optional expansion of the tranport reactions using the --tcdb switch (still an experimental implementation).

Briefly, representative sequences of the clusters are aligned on the entire TCDB gene database. For many transport families, BiGG-compatible reaction strings were generated in advance. These are imported into a GSMM as new reactions only if all the components of the trasport protein complex have been detected. Be default, the addition of TCDB-derived transport reactions is disabled, as they could impaire the stoichiometric consistency.

Annotation and duplicates removal

When a draft pan-GSMM has been obtained, it is subjected to metabolites and reactions re-annnotation (see flowchart), performed de novo using MetaNetX v4.4 (MNX). This adds annotations for HMDB, KEGG, MetaCyc, Rhea, ChEBI, and many other databases. As a consequence, Memote metrics of the output GSMMs should improve.

Then, the MNX annotation is optionally used to detect and correct eventual duplicated metabolites and reactions in the draft pan-GSMM, activating the --dedup option. Duplicates are present in the CarveMe universes and could be inherited directly, for example glc_D_B and glc__bD; other duplicates could be formed during the reference expansion. Duplicate metabolites are replaced by a consensus metabolite, giving priority to metabolites in the reference GSMM, if available. If the duplicate and the consensus metabolites are both present in the reference GSMM, they are both retained.

Next, if --dedup is active, the MNX annotation is used to correct for duplicate reactions. As for duplicate metabolites, the consensus reaction is selected giving priority to those in the reference GSMM; if duplicates are present in the reference, they are kept. Otherwise, the GPR of the consensus reaction is updated to incorporate all the gene clusters appearing among the duplicates. By default, the metabolites and reactions deduplication is disabled, as it could impaire the stoichiometric consistency.

The re-annotated draft-pan GSMM can be seen as the ultimate draft pan-GSMM, main output of gempipe recon together with the PAM.

Output files

gempipe recon produces 4 main output files in the specified output directory (-o/--outdir). These are required for manual curating the draft pan-GSMM using the Gempipe API (see the function gempipe.initialize):

draft_panmodel.json. The draft pan-GSMM, which will likely need a manual curation (see Part 2).
draft_panmodel.faa representative aminoacidic sequences of gene clusters modelled in the draft pan-GSMM.
pam.csv. The final presence/absence matrix (PAM), accounting for the gene recovery (if performed).
annotation.csv. Functional annotation of gene clusters made by eggNOG-mapper.

Moreover, Gempipe produces also a working directory (./working/) with all the intermediate files and logs needed during reconstruction. The working directory allows Gempipe to restart from the last completed task in case of interruption. To avoid the use of cached files (i.e., to perform a fresh run), utilize the --overwrite option. Please note that the working directory can include also ~50 GB of databases needed for the functional annotation.

Some files in the working directory are higly informative, and are described below:

filtering/tmetrics.csv. Technical metrics (N50, n_contigs) used to filter good quality genomes.
filtering/bmetrics.csv. Biological metrics (BUSCO) used to filter good quality genomes.
expansion/results.csv. Table listing the reactions of the reference-free reconstruction (rid) and their fate respect to the expanding reference (action). If a reaction ID or a synonym is found in the reference (rid_found or synonym, respectively), then the reaction will not be used (action:ignore), unless some new genes have to be handled to update the GPR (action:update_gpr). If a reference reaction is found (either same ID or a synonym), then the presence of the exact same reactants and products (regardless protons) is checked (same_mids), as well as their formula and charge (same_fc). If the final reaction included ends up to be unbalaced, then balancing suggestions are included (bal_suggestions).
duplicates/dup_m_edits.csv. Table listing the duplicate metabolites (duplicated) and their replacing metabolite (replacement). Formula, charge, presence in the reference and presence in the reference-free reconstruction are reported both for the duplicate and the replacement.
duplicates/dup_m_translations.csv. Table listing the reactions (rid) affected by duplication removal, which are visible before (reaction_old) and after (reaction_new) the removal. Eventual new metabolites created appear under added_mids. If the final reaction ends up to be unbalaced, then balancing suggestions are included (bal_suggestions).
duplicates/dup_r_edits.csv. Table listing the duplicate reactions (duplicated) and their replacing reaction (replacement). Reaction definition, GPR, presence in the reference and presence in the reference-free reconstruction are reported both for the duplicate and the replacement.