Part 3: gempipe derive

gempipe derive is designed to derive strain-specific GSMMs, using as main inputs the PAM (presence/absence matrix) and the pan-GSMM. Below are shown all its options:

usage: gempipe derive [-h] [-v] [-c] [-o] [--verbose] [-im] [-ip] [-ir] [-ig]
                      [-m] [--minflux] [--biolog] [--sbml] [--skipgf]
                      [--nofig] [--aux] [--cnps] [--cnps_minmed] [--biosynth]

gempipe v1.38.4, please cite "TODO". Full documentation available at
https://gempipe.readthedocs.io/en/latest/index.html.

optional arguments:
  -h, --help           Show this help message and exit.
  -v, --version        Show version number and exit.
  -c , --cores         How many parallel processes to use. (default: 1)
  -o , --outdir        Main output directory (will be created if not
                       existing). (default: ./)
  --verbose            Make stdout messages more verbose, including debug
                       messages. (default: False)
  -im , --inpanmodel   Path to the input pan-model. (default: -)
  -ip , --inpam        Path to the input PAM. (default: -)
  -ir , --inreport     Path to the input report file. (default: -)
  -ig , --ingannots    Path to the input genes annotation file. (default: -)
  -m , --media         Medium definition file or folder containing media
                       definitions, to be used during the automatic gap-
                       filling. (default: -)
  --minflux            Minimum flux through the objective of strain-specific
                       models. (default: 0.1)
  --biolog             Simulate Biolog's utilization tests on strain-specific
                       models. (default: False)
  --sbml               Save the output GSMMs in SBML format (L3V1 FBC2) in
                       addition to JSON. (default: False)
  --skipgf             Skip the gap-filling step applied to the strain-
                       specific models. (default: False)
  --nofig              Skip the generation of figures. (default: False)
  --aux                Test auxotrophies for aminoacids and vitamins.
                       (default: False)
  --cnps               Sistematically simulate growth on all the available
                       C-N-P-S sources. (default: False)
  --cnps_minmed        Base the C-N-P-S simulations on a minimal medium
                       leading to the specified minimum objective value. If 0,
                       user-defined medium will be used. (default: 0.0)
  --biosynth           Check biosynthesis of each metabolite while granting
                       the specified minimum fraction of objective. If 0, this
                       step will be skipped. (default: 0.0)

Deriving strain-specific models

Three inputs are required by gempipe derive to start:

The pan-GSMM (-im/--inpanmodel), in its final version. Ideally, the pan-GSMM was produced with gempipe recon (Part 1) and subsequently curated (Part 2).
The presence/absence matrix (PAM) (-ip/--inpam), as it was provided by gempipe recon. As discussed in Part 1, this table contains clusters in row and strains in column, and each cell contains the strain’s genes belonging to the corresponding cluster.
The report (-ir/--inreport) produced by gempipe recon, contaning the strain-to-species relationships.

An example of command line is reported below:

gempipe derive -c 16 -im curation/panmodel.json -ip reconout/pam.csv -ir reconout/report.csv

When gempipe derive starts, a species-specific GSMM is created for each genome appearing in the PAM. This procedure may be called “derivation by subtraction”, meaning that a copy of the pan-GSMM is subtracted from all the genes that are missing from the corresponding strain, according to the PAM content. Removing genes can eventually lead to the removal of metabolic reactions, in accordance to the GPR definitions.

💡 Tip! If gempipe recon started from genbanks, then a genome annotation file was produced. It can be passed to gempipe derive using -ig/--ingannots to improve the “gene annotation” section of the Memote reports.

Gap-filling strain-specific models

After strain-specific GSMMs are derived, they may not grow. The reasons are several:

the biomass equation coming from the pan-GSMM contains some strain-specific precursors. In this case, it would be better to first remove the strain-specific precursors from the biomass assembly during the manual curation in Part 2.
the GPR of some reactions essential to form biomass doesn’t take into account strain-specific gene variants. By design, the clustering step in gempipe recon is performed at high sequence identity (90%), meaning that two orthologous genes, if sufficiently dissimilar, may be grouped in different clusters even if the function is the same. Despite gempipe recon does its best to manage these cases, it could be necessary to expand some GPR during the manual curation in Part 2.

To solve the cases where a strain-specific GSMM doesn’t grow, an automated and strain-specific step of gap-filling is performed. As the gap-filling is dependant on the nutritional inputs, one or more growth media recipes, better if minimal, have to be provided (--media): they are assumed to support growth of all the strains in input. Despite this automated gap-filling, each strain that doesn’t grow should deserve a manual inspection, to see if one of the cases discussed above has occurred, leading to an improved manual curation.

Medium recipes can be described using a json file following a rigid syntax, composed by the keys name and exchanges, plus the lower bounds of each component. For example:

{"name": "my_medium_name", "exchanges": {"EX_ca2_e": -1000, "EX_cl_e": -1000, "EX_co2_e": -1000, "EX_cobalt2_e": -1000, "EX_cu2_e": -1000, "EX_h_e": -1000, "EX_fe2_e": -1000, "EX_fe3_e": -1000, "EX_h2o_e": -1000, "EX_pi_e": -1000, "EX_glc__D_e": -1000, "EX_so4_e": -1000, "EX_k_e": -1000, "EX_zn2_e": -1000, "EX_mg2_e": -1000, "EX_mn2_e": -1000, "EX_mobd_e": -1000, "EX_nh4_e": -1000, "EX_o2_e": -1000}}

However, please note that the gap-filling may hide strain-specific auxotrophies. If the detection of auxotrophies is of interest, and the medium provided is not rich (e.g. it does not include amino acids and vitamins), then the gap-filling phase should be skipped using --skipgf.

Biolog® PM simulations

Biolog® PM growth assays are commonly used to drive manual curation and to validate a GSMM. Using the --biolog option, the user can automatically simulate all the main Biolog® PM plates (such as PM1, PM2A, PM3, PM4, and so on) for each modeled strain. Alternatively, the same function can be interactively called via the Gempipe API (see the function gempipe.biolog_preview).

With this feature, substrates composing the Biolog® PM plates are iterated and tested for utilization, see below for an example output. The same substrate (substrate) can be considered as a C, N, P, or S source (source) according to the plate or origin; in these cases, a different growth simulation will be performed for each nutritive source. For example, if the starting C source is glucose (exr_before), then it is first removed and an FBA is performed; then the new C source (exr_after) is introduced, if it is present in the GSMM (exr_after_present), and the FBA is performed again; if the objective value increases, then the strain is assumed to be able to use the substrate (growth=True). For more information please read the Gempipe paper.

	substrate	source	formula	exr_before	exr_before_present	exr_after	exr_after_present	growth	value	status
0	L-Arabinose	C	NaN	EX_glc__D_e	True	EX_arab__L_e	False	NaN	NaN	NaN
1	D-Saccharic Acid	C	C6H8O8	EX_glc__D_e	True	EX_glcr_e	True	True	19.728400	optimal
2	D-Galactose	C	C6H12O6	EX_glc__D_e	True	EX_gal_e	True	False	0.000000	optimal
3	L-Aspartic Acid	N	C4H6NO4	EX_nh4_e	True	EX_asp__L_e	True	True	28.743739	optimal
4	L-Aspartic Acid	C	C4H6NO4	EX_glc__D_e	True	EX_asp__L_e	True	True	17.614535	optimal
...	...	...	...	...	...	...	...	...	...	...
646	Thiourea	S	NaN	EX_so4_e	True	NaN	False	NaN	NaN	NaN
647	D-Serine	N	C3H7NO3	EX_nh4_e	True	EX_ser__D_e	True	True	34.731331	optimal
648	D-Serine	C	C3H7NO3	EX_glc__D_e	True	EX_ser__D_e	True	True	33.295776	optimal
649	Succinic Acid	C	C4H4O4	EX_glc__D_e	True	EX_succ_e	True	True	17.473351	optimal
650	Fumaric Acid	C	C4H2O4	EX_glc__D_e	True	EX_fum_e	True	True	17.614535	optimal

651 rows × 10 columns

Strain-specific screenings

GSMMs can be used to predict strain-specific metabolic features, on which a multi-strain analysis can be based (see Part 4). With gempipe derive, for each strain, it is possible to predict some of these features. They are binary, meaning that only their presence/absence is discribed, not their magnitude.

Using the --aux option, each strain is screened for auxotrophies for amino acids and vitamins. The output is a binary feature table (aux.csv) having strains in column and amino acids or vitamins in row, for which the auxotrophy is evaluated. Cells contain 1 when the auxotrophy is present, otherwise 0.

Using the --cnps option, the ability to catabolize alternative C/N/P/S sources is evaluated for each strain. Similarly to Biolog substrates, a starting C source is first removed and an FBA is performed; then an alternative C source is introduced and the FBA is performed again; if the objective value increases, this indicates that the strain can potentially catabolize the alternative source. This is iterated for every C source having a dedicated exchange reaction in the GSMM, then alternative N, P, and S sources are evaluated using the same procedure. The output is a binary feature table (cnps.csv) containing 1 where a catabolic activity is detected, otherwise 0. To perform this analysis on a computed minimal medium, add the --cnps_minmed option.

Using the --biosynth option, the potential biosynthesis of target metabolites, is evaluated for each strain. Biomass production is constrained to a user-specified fraction of its maximum. The resulting a binary feature table (biosynth.csv) contains 1 when strains are considered potential producers of the tested metabolite. By default, all cytosolic metabolites are tested.

Deriving species-specific models

A species-specific GSMM is here defined as a GSMM composed by all the reactions always present in all the strain-specific GSMMs of the same species. Having a representative GSMM for each species enables to perform comparisons at the species level, detecting for example which pathways distinguish a species from other phylogenetically close species.

In gempipe derive, after gap-filling the strain-specific GSMMs, one species-specific GSMM is derived for each of the species in input. Moreover, like strain-specific GSMMs, species-specific GSMMs are also gapfilled on the media indicated with --media, unless --skipgf is activated.

Output files

gempipe derive produces 2 main output files in the current directory (-o/--outdir):

strain_models_gf/*.json. The gap-filled strain-specific GSMMs.
species_model_gf/*.json. The gap-filled species-specific GSMMs.

Other useful files are the following:

strain_models/*.json. The strain-specific GSMMs, before gap-filling.
species_models/*.json. The species-specific GSMMs, before gap-filling.
derive_strains.csv. Table showing metrics for each strain-specific GSMM, such as the number of reactions, the objective value and the solver status, before and after the gap-filling step.
derive_species.csv. Table showing metrics for each species-specific GSMM, such as the number of reaction, the objective value and the solver status. One row for each species inputted in gempipe recon.
rpam.csv Binary feature table reporting the presence of each reaction in each strain-specific GSMM.
aux.csv Binary feature table reporting the presence amino acids and vitamins auxotrophies for each strain-sspecific GSMM (requires --aux).
cnps.csv Binary feature table reporting the potential catabolic activity for alternative C, N, P, and S sources (requires --cnps).
biosynth.csv Binary feature table reporting potential biosynthetic activities (requires --biosynth).