Tutorial: sanity check

The main output of gempipe recon is a folder (-o/--outdir) containing, among the other files, the draft pan-GSMM. Like every draft GSMM, it needs some sanity checks, and eventually some gap-filling. In this tutorial we show some handy functions contained in Gempipe to perform the sanity check. This is not meant to replace the community effort Memote, but just to provide a quick and convenient way to check the main sanity standards before going on deriving strain-specific GSMMs with gempipe derive.

Here we start from an expanded reference-based reconstruction created with gempipe recon, based on a curated model for Lactiplantibacillus plantarum (formerly Lactobacillus plantarum). This reference was originally created in Teusink2006, but the GSMM used during this reconstruction is an updated version taken from the supplementary materials of Mendoza2021. 22 L. plantarum genomes were selected from the strains included in the comparative analysis by Siezen2010.

gempipe recon -s pos -g plantarum_genomes/ -b lactobacillales_odb10 -c 16 -rm from_mendoza/iLP728.xml -rp from_mendoza/protein_fasta.faa -mc mancor.txt --outdir docs/tutoring_materials/plantarum

To begin, we load the library and the draft pan-GSMM we want to check:

import gempipe

# initialize gempipe on the 'gempipe recon' --outdir:
panmodel = gempipe.initialize("tutoring_materials/plantarum")

Loading PAM (tutoring_materials/plantarum/pam.csv)...
Loading functional annotation table (tutoring_materials/plantarum/annotation.csv)...
Loading report table (tutoring_materials/plantarum/report.csv)...
Loading draft pan-GSMM (tutoring_materials/plantarum/draft_panmodel.json)...

We start from setting the unconstrained lower and upper bounds. Originally they were set as -999999/999999, and this doesn’t represent an issue. Anyway, we prefer to have them in the style -1000/1000, that seems more established today. We use the function gempipe.reset_unconstrained_bounds:

# get the current unconstrained lb/ub:
gempipe.get_unconstrained_bounds(panmodel)

(-999999.0, 999999.0)

# set the unconstrained lb/ub to -1000/1000:
gempipe.reset_unconstrained_bounds(panmodel)

# verify the edit:
gempipe.get_unconstrained_bounds(panmodel)

(-1000, 1000)

Below we seek to identify the biomass assembly reaction of this pan-GSMM using the gempipe.search_biomass function. The model appears to contain two alternative definitions, one of them lacking the GAM (growth-associated maintainence) term, which is also set as the current objective. It’s usually adopted to simulate growth in a retentostat, that is a particular kind of chemostat for studying physiology at near-zero growth rates.

💡 Tip! In gempipe.search_biomass, use show_reaction=True to print also the full biomass reaction string.

_ = gempipe.search_biomass(panmodel)

1 : biomass_LPL_RETB_t576_NoATP : R_biomass_equation_LPL_Retentostat_B_t576h_No_ATP_costs : (0.000435, 1000.0)
2 : biomass_LPL60 : R_biomass_equation_LPL_specific : (0.0, 0.0)

gempipe.get_objectives(panmodel)

['biomass_LPL_RETB_t576_NoATP']

Since we are interesting in “classic” growth simulations, we need to (1) reset the objective, and (2) close the bounds of the other biomass reaction:

# set the new objective:
panmodel.objective = 'biomass_LPL60'

# enable biomass formation:
panmodel.reactions.get_by_id('biomass_LPL60').bounds = (0, 1000)

# close the other biomass assembly:
panmodel.reactions.get_by_id('biomass_LPL_RETB_t576_NoATP').bounds = (0, 0)

Now it’s time for a preliminary check of wheter the growth predictions are realistic or not. First of all, we need to define the growth conditions. Instead of using the medium concentrations like in the gap-filling tutorial, here we set the exchange reactions to the uptake and secretion rates experimentally determined in Teusink2006 for the growth rate 0.3 1/h. L. plantarum was grown in a 25 mM glucose chemically defined medium (CDM) with a long list of components: those lacking an experimentally determined rate are left unconstrained.

💡 Tip! In gempipe.set_bounded_uptakes and gempipe.set_bounded_secretions, bounds can be provided also as a pair of (value, error), if the experimental error is available (for example: 'EX_glc__D_e': (11.609, 0.5311)). Otherwise, just an absolute value is sufficient (for example: 'EX_glc__D_e': 11.609).

As we can see below, the predicted growth rate looks realistic. Later in this tutorial we will try to simulate using medium concentrations too.

# define the growth conditions:
def apply_rates(model):
    gempipe.reset_growth_env(model)
    gempipe.set_bounded_uptakes(model, {'EX_glc__D_e': (7.41, 0.339), 'EX_cit_e': (0.6, 0.063), 'EX_ala__L_e': (0.03, 0.006), 'EX_arg__L_e': (0.036, 0.006), 'EX_asp__L_e': (0.147, 0.021), 'EX_cys__L_e': (0.021, 0.009), 'EX_glu__L_e': (0.096, 0.009), 'EX_gly_e': (0.078, 0.021), 'EX_ile__L_e': (0.042, 0.003), 'EX_leu__L_e': (0.066, 0.027), 'EX_lys__L_e': (0.051, 0.006), 'EX_phe__L_e': (0.027, 0.003), 'EX_pro__L_e': (0.054, 0.03), 'EX_ser__L_e': (0.474, 0.318), 'EX_thr__L_e': (0.21, 0.054), 'EX_tyr__L_e': (0.069, 0.033), 'EX_val__L_e': (0.096, 0.015)}) 
    gempipe.set_bounded_secretions(model, {'EX_lac__D_e': (12.15, 0.495), 'EX_pyr_e': (0.06, 0.009), 'EX_for_e': (1.17, 0.513), 'EX_ac_e': (2.04, 0.444), 'EX_etoh_e': (0.42, 0.114), 'EX_succ_e': (0.72, 0.015)})
    gempipe.set_unbounded_exchanges(model, ['EX_h2o_e', 'EX_h_e'])  
    # other components in CDM (no experimental rate provided):
    gempipe.set_unbounded_exchanges(model, ['EX_pnto__R_e', 'EX_btn_e', 'EX_nac_e', 'EX_4abz_e', 'EX_pydam_e', 'EX_pydxn_e', 'EX_ribflv_e', 'EX_thm_e', 'EX_fol_e', 'EX_pi_e', 'EX_na1_e', 'EX_nh4_e', 'EX_mn2_e', 'EX_ade_e', 'EX_gua_e', 'EX_ins_e', 'EX_orot_e', 'EX_thymd_e', 'EX_ura_e', 'EX_xan_e', 'EX_his__L_e', 'EX_met__L_e', 'EX_trp__L_e'])

    
# apply medium to the panmodel:
apply_rates(panmodel)

# simulate biomass production:
panmodel.slim_optimize()

0.36310820624546114

We now check the eventual presence of sink and demand reactions, using the function gempipe.check_sinks_demands. There is a demand reaction for the metabolite “4-hydroxy-5-methyl-3(2H)-furanone” (hmfurn_c). Looking at literature, we know that this metabolite is volatile, and must be able to leave the cell. Without any information on specific transporters, it’s legitimate to leave this demand reaction, otherwise the FBA would result to be infeasible.

_ = gempipe.check_sinks_demands(panmodel)

1 : DM_hmfurn_c : hmfurn_c -->  : (0.0, 1000)

Next, we check that every exchange reaction in the model has an ID starting with EX_. This way, we respect conventions. Here, exchange reactions are defined simply as reaction having a single metabolite involved, present in the extracellular compartment.

_ = gempipe.check_exr_notation(panmodel)

No EX_change reaction with bad ID found.

We now check the presence of metabolic reactions (no exchange, demand or sink reactions) with constrained bounds, which we define as everything except for (-1000, 1000) or reversible, and (-0, 1000) or irreversible. This means that reactions defined to go in the opposite direction will also be highlighted. We note for example, that 3 reactions have been closed, such as the catalase, inherited from the original reference GSMM.

_ = gempipe.check_constrained_metabolic(panmodel)

: CAT : (0.0, 0.0) : 2.0 h2o2_c --> 2.0 h2o_c + o2_c
: RHCYS : (0.0, 0.0) : h2o_c + rhcys_c --> hcys__L_c + rib__D_c
: CYTB_B2 : (0.0, 0.0) : 2.0 h_c + mql7_c + 0.5 o2_c --> h2o_c + 2.0 h_e + mqn7_c
: AH6PI : (-1000.0, 0.0) : ah6p__D_c <-- f6p_c
: CLt3_2pp : (-1000.0, 0.0) : 2.0 cl_p + h_c <-- 2.0 cl_c + h_p
: INOSR : (-1000.0, 0.0) : inost_c + nadp_c <-- 2ins_c + h_c + nadph_c
: SPMDt3i : (-1000.0, 0.0) : h_c + spmd_e <-- h_e + spmd_c
: SUCD1 : (-1000.0, 0.0) : fad_c + succ_c <-- fadh2_c + fum_c

Going on, we check which metabolites are using artificial atoms, here defined as those not appearing on the periodic table. For example, it’s common to see fatty acids linked to ACP (acyl-carrier protein) being modeled with ‘X’ in the formula: in these cases, a common paradigm is used which involves the sobstitution of the ACP part with just an ‘X’. We should be aware of all the modeling paradigms adopted in our GSMMs. Below we get an overview using the gempipe.check_artificial_atoms function. We can see that, apart from ACP-fatty acids, also the oxidized/reduced thioredoxin (trdox_c/trdrd_c) is modeled with X. Moreover, all the transfer-RNA are modeled with ‘R’, together with apoACP_c.

_ = gempipe.check_artificial_atoms(panmodel)

1 : X : but2eACP_c[C4H4OX], 2chdeacp_c[C16H28OX], 2cocdacp_c[C18H32OX], tddec2eACP_c[C12H20OX], tdec2eACP_c[C10H16OX], tpalm2eACP_c[C16H28OX], thex2eACP_c[C6H8OX], toctd2eACP_c[C18H32OX], toct2eACP_c[C8H12OX], tmrs2eACP_c[C14H24OX], 3haACP_c[C4H6O2X], 3hddecACP_c[C12H22O2X], 3hdecACP_c[C10H18O2X], 3hhexACP_c[C6H10O2X], 3hoctACP_c[C8H14O2X], 3hoctaACP_c[C18H34O2X], 3hpalmACP_c[C16H30O2X], 3hmrsACP_c[C14H26O2X], 3oddecACP_c[C12H20O2X], 3odecACP_c[C10H16O2X], 3ohexACP_c[C6H8O2X], 3opalmACP_c[C16H28O2X], 3ooctACP_c[C8H12O2X], 3ooctdACP_c[C18H32O2X], 3omrsACP_c[C14H24O2X], actACP_c[C4H4O2X], ACP_c[X], butACP_c[C4H6OX], cpocdacp_c[C19H34OX], ddcaACP_c[C12H22OX], dcaACP_c[C10H18OX], palmACP_c[C16H30OX], hexACP_c[C6H10OX], malACP_c[C3HO3X], ocdcaACP_c[C18H34OX], ocACP_c[C8H14OX], myrsACP_c[C14H26OX], trdox_c[X], trdrd_c[XH2], 3hcmrs7eACP_c[C14H25O2X], t3c7mrseACP_c[C14H23OX], tdeACP_c[C14H25OX]
2 : R : alatrna_c[C18H28NO17P2R3], apoACP_c[RHO], argtrna_c[C21H36N4O17P2R3], asntrna_c[C19H29N2O18P2R3], asptrna_c[C19H27NO19P2R3], cystrna_c[C18H28NO17P2SR3], glntrna_c[C20H31N2O18P2R3], glutrna_c[C20H29NO19P2R3], glytrna_c[C17H26NO17P2R3], histrna_c[C21H30N3O17P2R3], iletrna_c[C21H34NO17P2R3], leutrna_c[C21H34O17P2R3N], lystrna_c[C21H36N2O17P2R3], mettrna_c[C20H32NO17P2SR3], phetrna_c[C24H32NO17P2R3], protrna_c[C20H30NO17P2R3], sertrna_c[C18H28NO18P2R3], thrtrna_c[C19H30NO18P2R3], trnaala_c[C15H23O16P2R3], trnaarg_c[C15H23O16P2R3], trnaasn_c[C15H23O16P2R3], trnaasp_c[C15H23O16P2R3], trnacys_c[C15H23O16P2R3], trnaglu_c[C15H23O16P2R3], trnagly_c[C15H23O16P2R3], trnahis_c[C15H23O16P2R3], trnaile_c[C15H23O16P2R3], trnaleu_c[C15H23O16P2R3], trnalys_c[C15H23O16P2R3], trnamet_c[C15H23O16P2R3], trnaphe_c[C15H23O16P2R3], trnapro_c[C15H23O16P2R3], trnaser_c[C15H23O16P2R3], trnathr_c[C15H23O16P2R3], trnatrp_c[C15H23O16P2R3], trnatyr_c[C15H23O16P2R3], trnaval_c[C15H23O16P2R3], trptrna_c[C26H33N2O17P2R3], tyrtrna_c[C24H32NO18P2R3], valtrna_c[C20H32NO17P2R3]

Going on, we check that each metabolite in the GSMM has a chemical formula and a charge. We use two Gempipe functions, gempipe.check_missing_formulas and gempipe.check_missing_charges, respectively. We aim at having all metabolites fully defined. Two metabolites without a defined chemical formula pop out: MCOOH_c is the small subunit of the molybdopterin (MPT) synhtese, while MCOSH_c is its sulfurylated form. We are going to fix them in the next few cells.

_ = gempipe.check_missing_formulas(panmodel)

1 : MCOOH_c
2 : MCOSH_c

_ = gempipe.check_missing_charges(panmodel)

No metabolite with missing charge attribute found.

But first, let’s have an overview of which reactions are currently unbalanced in mass or charge. This is a crucial part of the sanity check, as unbalances may alter predictions. Keep in mind that, during gempipe recon, the user can provide a text file of manual corrections (-m/--mancor) to be applied during the phase of reference expansion (see Part 1). These corrections apply to the reactions coming from the reference-free reconstruction to be inserted into the reference GSMM. This means that reactions contained in the reference are assumed as good, and are not affected by gempipe recon. Anyway, as we see below, the reference may still have some imbalances, which we therefore find again in the reconstructed pan-GSMM:

_ = gempipe.check_mass_unbalances(panmodel)

: ACPS1 : apoACP_c + coa_c --> ACP_c + h_c + pap_c : {'X': 1.0}
: THZPSN2 : MCOSH_c + dxyl5p_c + gly_c + nadp_c --> 4mpetz_c + MCOOH_c + co2_c + 2.0 h2o_c + 2.0 h_c + nadph_c : {'H': 1.0, 'S': 1.0}
: MPTS_LPL : 2.0 MCOSH_c + cpmp_c + h2o_c --> 2.0 MCOOH_c + 3.0 h_c + mpt_c : {'H': 2.0, 'S': 2.0}
: MOADCST : MCOOH_c + atp_c + cys__L_c + h2o_c --> MCOSH_c + amp_c + ppi_c + ser__L_c : {'O': 1.0}

_ = gempipe.check_charge_unbalances(panmodel)

1 : ACPS1 : apoACP_c + coa_c --> ACP_c + h_c + pap_c : {'charge': 1.0}
2 : LTAS_LPL : 0.01 dgdag_LPL_c + 0.25 pg_LPL_c --> 0.01 LTA_LPL_c + 0.25 dag_LPL_c : {'charge': 25.0}

In the next cell, we are going to set some chemical formulas and charges in order to: (1) put a formula where was lacking (MTP subunit, see above), and (2) solve all the unbalances of the model. Doing so, as we see below, the resulting pan-GSMM no longer contains unbalanced reactions.

panmodel.metabolites.get_by_id("MCOSH_c").formula = 'XSH'
panmodel.metabolites.get_by_id("MCOOH_c").formula = 'XO'

panmodel.metabolites.get_by_id("apoACP_c").formula = 'X'
panmodel.metabolites.get_by_id("apoACP_c").charge = 1

panmodel.metabolites.get_by_id("ACP_c").formula = 'X'
panmodel.metabolites.get_by_id("ACP_c").charge = 0

panmodel.metabolites.get_by_id('LTA_LPL_c').charge = -2500
panmodel.metabolites.get_by_id('LTAglc_LPL_c').charge = -2500

_ = gempipe.check_mass_unbalances(panmodel)

No mass-unbalanced reactions found.

_ = gempipe.check_charge_unbalances(panmodel)

No charge-unbalanced reactions found.

Now we proceed checking the presence of eventual EGCs (energy-generating cycles). An EGC is defined as a set of reactions producing an energy-containing molecule in absence of any external input. EGCs are artifacts of reconstruction pipelines, and must be solved to prevent wrong predictions. For example, if the EGC for ATP is present, an unrealistic biomass could be predicted via FBA, due to the fact that the cell has more ATP to spend than the pool physiologically available.

Before starting, to have a reference, we compute the current maximal thoeretical biomass yield (g/L). To do so, we have to switch the environment definition from experimental rates (mmol/gDW/h) to medium concentrations (mmol/L). Therefore, we set the exchange reactions to the CDM concentrations as defined in Teusink2006. The biomass results as 4.07 g/L, souspiciously high considering that the original publication reports ~1 g/L.

# define the growth conditions:
def apply_medium(model):
    gempipe.reset_growth_env(model)
    gempipe.set_bounded_uptakes(model, {'EX_glc__D_e': 25, 'EX_ala__L_e': 2.693905040, 'EX_arg__L_e': 0.717566016, 'EX_asp__L_e': 3.155451042, 'EX_cys__L_e': 1.073014510, 'EX_glu__L_e': 3.398355196, 'EX_gly_e': 2.331157586, 'EX_his__L_e': 0.966806316, 'EX_ile__L_e': 1.600853789, 'EX_leu__L_e': 3.621116829, 'EX_lys__L_e': 3.009781791, 'EX_met__L_e': 0.837756689, 'EX_phe__L_e': 1.664749682, 'EX_pro__L_e': 5.862937549, 'EX_ser__L_e': 3.235229749, 'EX_thr__L_e': 1.888851578, 'EX_trp__L_e': 0.244823213, 'EX_tyr__L_e': 1.379767095, 'EX_val__L_e': 2.774268447, 'EX_pnto__R_e': 0.004196919, 'EX_btn_e': 0.010232901, 'EX_nac_e': 0.008122817, 'EX_4abz_e': 0.072918186, 'EX_pydam_e': 0.020737423, 'EX_pydxn_e': 0.009725734, 'EX_ribflv_e': 0.002657031, 'EX_thm_e': 0.002964984, 'EX_fol_e': 0.002265519, 'EX_ade_e': 0.074002812, 'EX_gua_e': 0.0661682, 'EX_ins_e': 0.018640788, 'EX_orot_e': 0.032031365, 'EX_thymd_e': 0.020641539, 'EX_ura_e': 0.089215616, 'EX_xan_e': 0.065741897, 'EX_pi_e': 42.48157966, 'EX_na1_e': 12.19066195, 'EX_cit_e': 2.652637163, 'EX_nh4_e': 5.318382608, 'EX_mn2_e': 0.134807226}) 
    gempipe.set_unbounded_exchanges(model, ['EX_h2o_e', 'EX_h_e'])  

    
# apply medium to the panmodel:
apply_medium(panmodel)

# simulate biomass production:
panmodel.slim_optimize()

4.071587468671884

We now start to verify the EGC for ATP. With gempipe.verify_egc, we can test the presence of EGCs for a particular molecule. As we can see below, a cycle composed by 5 reactions (NH3t, NH4DISex, ATPS3r, NH3c, NH4t) is able to produce ATP indefinitely.

💡 Tip! With the escher=True option, the function will produce an Escher-compatible model, containing just the reactions composing the EGC. This way it will be easier to draw the EGC map, supporting the user in the identification of the fix.

gempipe.verify_egc(panmodel, mid='atp')

atp_c + h2o_c --> adp_c + h_c + pi_c
333.3333333333333 : optimal

NH3t        1000.000000
NH4DISex    1000.000000
ATPS3r       333.333333
NH3c        1000.000000
NH4t       -1000.000000

After drawing the EGC on a Escher map, it’s easy to see that the reaction NH4t, originally not present in the reference but acquired during the reference expansion, is generating a cycle producing an additional extracellular proton that is used by the ATP-synthase to generate ATP. In the reference model, the transport of ammonium (NH4+) was encoded as transport of ammonia (NH3), after proton dissociation in the extracellular compartment. Therefore, we decide to remove the NH4t reaction. Doing so, as we can see, the EGC disappears, and the biomass deflates down to 1.5 g/L.

📌 Note! This pan-GSMM is currently not constrained for representing the lactic fermentation. Switching off the acetate branch will result in a even more deflated biomass yield, comparable to the experimental value.

# remove NH4t reaction:
panmodel.remove_reactions([panmodel.reactions.get_by_id('NH4t')])

# verify the EGC again:
gempipe.verify_egc(panmodel, mid='atp')

atp_c + h2o_c --> adp_c + h_c + pi_c
0 : optimal

# simulate biomass production again:
panmodel.slim_optimize()

1.519447465904365

Now that the ATP EGC is solved, we check all the others. As we can see, no other EGC is detected.

gempipe.verify_egc(panmodel, mid='ctp')

ctp_c + h2o_c --> cdp_c + h_c + pi_c
0 : optimal

gempipe.verify_egc(panmodel, mid='gtp')

gtp_c + h2o_c --> gdp_c + h_c + pi_c
0 : optimal

gempipe.verify_egc(panmodel, mid='utp')

h2o_c + utp_c --> h_c + pi_c + udp_c
0 : optimal

gempipe.verify_egc(panmodel, mid='itp')

h2o_c + itp_c --> h_c + idp_c + pi_c
0 : optimal

gempipe.verify_egc(panmodel, mid='nadh')

nadh_c --> h_c + nad_c
0 : optimal

gempipe.verify_egc(panmodel, mid='nadph')

nadph_c --> h_c + nadp_c
0 : optimal

gempipe.verify_egc(panmodel, mid='fadh2')

fadh2_c --> fad_c + 2.0 h_c
0 : optimal

gempipe.verify_egc(panmodel, mid='accoa')

accoa_c + h2o_c --> ac_c + coa_c + h_c
0 : optimal

gempipe.verify_egc(panmodel, mid='glu__L')

glu__L_c + h2o_c --> akg_c + 2.0 h_c + nh4_c
0 : optimal

gempipe.verify_egc(panmodel, mid='q8h2')

q8h2_c --> 2.0 h_c + q8_c
0 : optimal

We end here the sanity check, so it’s time the save the final pan-GSMM using the appropriate cobrapy function. This way we can continue deriving strain-specific GSMMs, starting from this pan-GSMM and the associated PAM (in this case we didn’t perform any gapfilling).

Keep in mind that gempipe derive will derive GSMMs that have (1) the same objective of the pan-GSMM, and (2) the same reactions bounds of the pan-GSMM. It gives the possibility to indicate one (or more) growth media on which to guarantee the biomass formation, which are know or assumed to sustain the growth of all the input strains.

import cobra

cobra.io.save_json_model(panmodel, 'tutoring_materials/plantarum/panmodel.json')