Part 3: gempipe derive
gempipe derive is designed to derive strain-specific GSMMs, using as main inputs the PAM (presence/absence matrix) and the pan-GSMM. Below are shown all its options:
usage: gempipe derive [-h] [-v] [-c] [-o] [--verbose] [-im] [-ip] [-ir] [-ig]
[-m] [--minflux] [--biolog] [--sbml] [--skipgf]
[--nofig] [--aux] [--cnps] [--cnps_minmed] [--biosynth]
gempipe v1.38.4, please cite "TODO". Full documentation available at
https://gempipe.readthedocs.io/en/latest/index.html.
optional arguments:
-h, --help Show this help message and exit.
-v, --version Show version number and exit.
-c , --cores How many parallel processes to use. (default: 1)
-o , --outdir Main output directory (will be created if not
existing). (default: ./)
--verbose Make stdout messages more verbose, including debug
messages. (default: False)
-im , --inpanmodel Path to the input pan-model. (default: -)
-ip , --inpam Path to the input PAM. (default: -)
-ir , --inreport Path to the input report file. (default: -)
-ig , --ingannots Path to the input genes annotation file. (default: -)
-m , --media Medium definition file or folder containing media
definitions, to be used during the automatic gap-
filling. (default: -)
--minflux Minimum flux through the objective of strain-specific
models. (default: 0.1)
--biolog Simulate Biolog's utilization tests on strain-specific
models. (default: False)
--sbml Save the output GSMMs in SBML format (L3V1 FBC2) in
addition to JSON. (default: False)
--skipgf Skip the gap-filling step applied to the strain-
specific models. (default: False)
--nofig Skip the generation of figures. (default: False)
--aux Test auxotrophies for aminoacids and vitamins.
(default: False)
--cnps Sistematically simulate growth on all the available
C-N-P-S sources. (default: False)
--cnps_minmed Base the C-N-P-S simulations on a minimal medium
leading to the specified minimum objective value. If 0,
user-defined medium will be used. (default: 0.0)
--biosynth Check biosynthesis of each metabolite while granting
the specified minimum fraction of objective. If 0, this
step will be skipped. (default: 0.0)
Deriving strain-specific models
Three inputs are required by gempipe derive to start:
The pan-GSMM (
-im/--inpanmodel), in its final version. Ideally, the pan-GSMM was produced withgempipe recon(Part 1) and subsequently curated (Part 2).The presence/absence matrix (PAM) (
-ip/--inpam), as it was provided bygempipe recon. As discussed in Part 1, this table contains clusters in row and strains in column, and each cell contains the strain’s genes belonging to the corresponding cluster.The report (
-ir/--inreport) produced bygempipe recon, contaning the strain-to-species relationships.
An example of command line is reported below:
gempipe derive -c 16 -im curation/panmodel.json -ip reconout/pam.csv -ir reconout/report.csv
When gempipe derive starts, a species-specific GSMM is created for each genome appearing in the PAM. This procedure may be called “derivation by subtraction”, meaning that a copy of the pan-GSMM is subtracted from all the genes that are missing from the corresponding strain, according to the PAM content. Removing genes can eventually lead to the removal of metabolic reactions, in accordance to the GPR definitions.
💡 Tip! If gempipe recon started from genbanks, then a genome annotation file was produced. It can be passed to gempipe derive using -ig/--ingannots to improve the “gene annotation” section of the Memote reports.
Gap-filling strain-specific models
After strain-specific GSMMs are derived, they may not grow. The reasons are several:
the biomass equation coming from the pan-GSMM contains some strain-specific precursors. In this case, it would be better to first remove the strain-specific precursors from the biomass assembly during the manual curation in Part 2.
the GPR of some reactions essential to form biomass doesn’t take into account strain-specific gene variants. By design, the clustering step in
gempipe reconis performed at high sequence identity (90%), meaning that two orthologous genes, if sufficiently dissimilar, may be grouped in different clusters even if the function is the same. Despitegempipe recondoes its best to manage these cases, it could be necessary to expand some GPR during the manual curation in Part 2.
To solve the cases where a strain-specific GSMM doesn’t grow, an automated and strain-specific step of gap-filling is performed. As the gap-filling is dependant on the nutritional inputs, one or more growth media recipes, better if minimal, have to be provided (--media): they are assumed to support growth of all the strains in input. Despite this automated gap-filling, each strain that doesn’t grow should deserve a manual inspection, to see if one of the cases discussed above has occurred, leading to an improved manual curation.
Medium recipes can be described using a json file following a rigid syntax, composed by the keys name and exchanges, plus the lower bounds of each component. For example:
{"name": "my_medium_name", "exchanges": {"EX_ca2_e": -1000, "EX_cl_e": -1000, "EX_co2_e": -1000, "EX_cobalt2_e": -1000, "EX_cu2_e": -1000, "EX_h_e": -1000, "EX_fe2_e": -1000, "EX_fe3_e": -1000, "EX_h2o_e": -1000, "EX_pi_e": -1000, "EX_glc__D_e": -1000, "EX_so4_e": -1000, "EX_k_e": -1000, "EX_zn2_e": -1000, "EX_mg2_e": -1000, "EX_mn2_e": -1000, "EX_mobd_e": -1000, "EX_nh4_e": -1000, "EX_o2_e": -1000}}
However, please note that the gap-filling may hide strain-specific auxotrophies. If the detection of auxotrophies is of interest, and the medium provided is not rich (e.g. it does not include amino acids and vitamins), then the gap-filling phase should be skipped using --skipgf.
Biolog® PM simulations
Biolog® PM growth assays are commonly used to drive manual curation and to validate a GSMM. Using the --biolog option, the user can automatically simulate all the main Biolog® PM plates (such as PM1, PM2A, PM3, PM4, and so on) for each modeled strain. Alternatively, the same function can be interactively called via the Gempipe API (see the function gempipe.biolog_preview).
With this feature, substrates composing the Biolog® PM plates are iterated and tested for utilization, see below for an example output. The same substrate (substrate) can be considered as a C, N, P, or S source (source) according to the plate or origin; in these cases, a different growth simulation will be performed for each nutritive source. For example, if the starting C source is glucose (exr_before), then it is first removed and an FBA is performed; then the new C source (exr_after) is introduced, if it is present in the GSMM (exr_after_present), and the FBA is performed again; if the objective value increases, then the strain is assumed to be able to use the substrate (growth=True). For more information please read the Gempipe paper.
| substrate | source | formula | exr_before | exr_before_present | exr_after | exr_after_present | growth | value | status | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | L-Arabinose | C | NaN | EX_glc__D_e | True | EX_arab__L_e | False | NaN | NaN | NaN |
| 1 | D-Saccharic Acid | C | C6H8O8 | EX_glc__D_e | True | EX_glcr_e | True | True | 19.728400 | optimal |
| 2 | D-Galactose | C | C6H12O6 | EX_glc__D_e | True | EX_gal_e | True | False | 0.000000 | optimal |
| 3 | L-Aspartic Acid | N | C4H6NO4 | EX_nh4_e | True | EX_asp__L_e | True | True | 28.743739 | optimal |
| 4 | L-Aspartic Acid | C | C4H6NO4 | EX_glc__D_e | True | EX_asp__L_e | True | True | 17.614535 | optimal |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 646 | Thiourea | S | NaN | EX_so4_e | True | NaN | False | NaN | NaN | NaN |
| 647 | D-Serine | N | C3H7NO3 | EX_nh4_e | True | EX_ser__D_e | True | True | 34.731331 | optimal |
| 648 | D-Serine | C | C3H7NO3 | EX_glc__D_e | True | EX_ser__D_e | True | True | 33.295776 | optimal |
| 649 | Succinic Acid | C | C4H4O4 | EX_glc__D_e | True | EX_succ_e | True | True | 17.473351 | optimal |
| 650 | Fumaric Acid | C | C4H2O4 | EX_glc__D_e | True | EX_fum_e | True | True | 17.614535 | optimal |
651 rows × 10 columns
Strain-specific screenings
GSMMs can be used to predict strain-specific metabolic features, on which a multi-strain analysis can be based (see Part 4). With gempipe derive, for each strain, it is possible to predict some of these features. They are binary, meaning that only their presence/absence is discribed, not their magnitude.
Using the --aux option, each strain is screened for auxotrophies for amino acids and vitamins. The output is a binary feature table (aux.csv) having strains in column and amino acids or vitamins in row, for which the auxotrophy is evaluated. Cells contain 1 when the auxotrophy is present, otherwise 0.
Using the --cnps option, the ability to catabolize alternative C/N/P/S sources is evaluated for each strain. Similarly to Biolog substrates, a starting C source is first removed and an FBA is performed; then an alternative C source is introduced and the FBA is performed again; if the objective value increases, this indicates that the strain can potentially catabolize the alternative source. This is iterated for every C source having a dedicated exchange reaction in the GSMM, then alternative N, P, and S sources are evaluated using the same procedure. The output is a binary feature table (cnps.csv) containing 1 where a catabolic activity is detected, otherwise 0. To perform this analysis on a computed minimal medium, add the --cnps_minmed option.
Using the --biosynth option, the potential biosynthesis of target metabolites, is evaluated for each strain. Biomass production is constrained to a user-specified fraction of its maximum. The resulting a binary feature table (biosynth.csv) contains 1 when strains are considered potential producers of the tested metabolite. By default, all cytosolic metabolites are tested.
Deriving species-specific models
A species-specific GSMM is here defined as a GSMM composed by all the reactions always present in all the strain-specific GSMMs of the same species. Having a representative GSMM for each species enables to perform comparisons at the species level, detecting for example which pathways distinguish a species from other phylogenetically close species.
In gempipe derive, after gap-filling the strain-specific GSMMs, one species-specific GSMM is derived for each of the species in input. Moreover, like strain-specific GSMMs, species-specific GSMMs are also gapfilled on the media indicated with --media, unless --skipgf is activated.
Output files
gempipe derive produces 2 main output files in the current directory (-o/--outdir):
strain_models_gf/*.json. The gap-filled strain-specific GSMMs.
species_model_gf/*.json. The gap-filled species-specific GSMMs.
Other useful files are the following:
strain_models/*.json. The strain-specific GSMMs, before gap-filling.
species_models/*.json. The species-specific GSMMs, before gap-filling.
derive_strains.csv. Table showing metrics for each strain-specific GSMM, such as the number of reactions, the objective value and the solver status, before and after the gap-filling step.
derive_species.csv. Table showing metrics for each species-specific GSMM, such as the number of reaction, the objective value and the solver status. One row for each species inputted in
gempipe recon.rpam.csv Binary feature table reporting the presence of each reaction in each strain-specific GSMM.
aux.csv Binary feature table reporting the presence amino acids and vitamins auxotrophies for each strain-sspecific GSMM (requires
--aux).cnps.csv Binary feature table reporting the potential catabolic activity for alternative C, N, P, and S sources (requires
--cnps).biosynth.csv Binary feature table reporting potential biosynthetic activities (requires
--biosynth).