Tutorial: gap-filling

We generated a draft pan-GSMM and a PAM (presence-absence matrix) using gempipe recon. Taxid 68334 is for Erwinia aphidicola, the species we want to model in this tutorial. Please note that several different species taxids could have been inputted at the same time, but here we are interested in just one species.

gempipe recon -c 8 -s neg -t 68334 -b enterobacterales_odb10 -o docs/tutoring_materials/aphidicola

First of all we load the Gempipe library. Then we load the draft pan-GSMM, the PAM and the functional annotation table, using just one function: gempipe.initialize. Next we load the corresponding universe on which this reference-free reconstruction was based.

import gempipe

# initialize gempipe on the 'gempipe recon' --outdir:
panmodel = gempipe.initialize("tutoring_materials/aphidicola")

Loading PAM (tutoring_materials/aphidicola/pam.csv)...
Loading functional annotation table (tutoring_materials/aphidicola/annotation.csv)...
Loading report table (tutoring_materials/aphidicola/report.csv)...
Loading draft pan-GSMM (tutoring_materials/aphidicola/draft_panmodel.json)...

# grab the gram negative universe:
universe = gempipe.get_universe('neg')

Since we want to check the biomass production for this free-living species, we have to be sure that Growth is the reaction ID set as the current objective. Then we set the growth medium reflecting the concentrations of an old chemically defined medium (CDM) recipe for Erwinia, taken from Grula 1960. As we can see, the biomass production is 0 on this medium, which may be due to a number of things:

some metabolic reaction is missing
some EX_change reaction needs a tuning
the biomass assembly reaction needs adjustments

# check which objective was selected:
gempipe.get_objectives(panmodel)

['Growth']

# define the medium:
def apply_medium(model):
    gempipe.reset_growth_env(model)
    gempipe.set_bounded_uptakes(model, {'EX_k_e': 29.973072, 'EX_asp__L_e': 21.036340, 'EX_pi_e': 19.983346, 'EX_glc__D_e': 16.652235, 'EX_so4_e': 0.125871, 'EX_mg2_e': 0.121714, 'EX_fe2_e': 0.001275, 'EX_fe3_e': 0.001275, 'EX_nh4_e': 0.001275, 'EX_ca2_e': 0.000999, 'EX_zn2_e': 0.000174, 'EX_mn2_e': 0.000118, 'EX_cu2_e': 0.000040})
    gempipe.set_unbounded_exchanges(model, ['EX_h2o_e', 'EX_h_e', 'EX_o2_e'])

    
# apply medium to the panmodel:
apply_medium(panmodel)

# simulate biomass production:
panmodel.slim_optimize()

0.0

Now we have to be shure that the universe, under the same exact conditions, grows. Otherwise, our pan-GSMM will be impossible to gap-fill. Below we can see that even the universe cannot grow. Indeed, using gempipe.check_reactants we can see 2 blocked biomass precursors: chloride and cobalt, trace element present in the generic biomass definition that we are using.

# check if the universe
apply_medium(universe)
universe.slim_optimize()

0.0

# check blocked biomass precursors:
_ = gempipe.check_reactants(universe, 'Growth')

1 : 0.0 : optimal : cl_c : Chloride
2 : 0.0 : optimal : cobalt2_c : Co2+

Instead of removing them from the biomass definition, it’s quicker to open additional exchanges. With gempipe.sensitivity_analysis, we can easily see which EX_change reactions to open: EX_cl_e and EX_cobalt2_e.

# check which EX_change reactions can provide choloride:
gempipe.sensitivity_analysis(universe, mid='cl_c')

{'EX_12dgr160_e': 0.0,
 'EX_12dgr180_e': 0.0,
 'EX_12ppd__R_e': 0.0,
 'EX_xylu__L_e': 0.0,
 'EX_zn2_e': 0.0,
 'EX_cl_e': -2.0}

# check which EX_change reactions can provide colbalt:
gempipe.sensitivity_analysis(universe, mid='cobalt2_c')

{'EX_12dgr160_e': 0.0,
 'EX_12dgr180_e': 0.0,
 'EX_12ppd__R_e': 0.0,
 'EX_xylu__L_e': 0.0,
 'EX_zn2_e': 0.0,
 'EX_cobalt2_e': -2.0}

This way, we can redefine our in-silico medium, adding this 2 EX_change reactions. Now the universe grows, and we can use it to gap-fill our draft pan-GSMM.

# re-define the medium:
def apply_medium(model):
    gempipe.reset_growth_env(model)
    gempipe.set_bounded_uptakes(model, {'EX_k_e': 29.973072, 'EX_asp__L_e': 21.036340, 'EX_pi_e': 19.983346, 'EX_glc__D_e': 16.652235, 'EX_so4_e': 0.125871, 'EX_mg2_e': 0.121714, 'EX_fe2_e': 0.001275, 'EX_fe3_e': 0.001275, 'EX_nh4_e': 0.001275, 'EX_ca2_e': 0.000999, 'EX_zn2_e': 0.000174, 'EX_mn2_e': 0.000118, 'EX_cu2_e': 0.000040})
    gempipe.set_unbounded_exchanges(model, ['EX_h2o_e', 'EX_h_e', 'EX_o2_e'])
    gempipe.set_unbounded_exchanges(model, ['EX_cl_e', 'EX_cobalt2_e'])  

    
# apply new in-silico medium to the universe:
apply_medium(universe)
universe.slim_optimize()

0.056417489421720736

Since the draft pan-GSMM growth is still 0, we check which biomass precursors are blocked:

# pan-GSMM still can't grow:
apply_medium(panmodel)
panmodel.slim_optimize()

0.0

# check blocked biomass precursors:
_ = gempipe.check_reactants(panmodel, 'Growth')

: 0.0 : optimal : cu2_c : Copper
: 0.0 : optimal : fe3_c : Iron (Fe3+)
: 0.0 : optimal : kdo2lipid4_p : KDO(2)-lipid IV(A)
: 0.0 : optimal : pe160_c : Phosphatidylethanolamine (dihexadecanoyl, n-C16:0)
: 0.0 : optimal : pe160_p : Phosphatidylethanolamine (dihexadecanoyl, n-C16:0)
: 0.0 : optimal : pe161_c : Phosphatidylethanolamine (dihexadec-9enoyl, n-C16:1)
: 0.0 : optimal : pe161_p : Phosphatidylethanolamine (dihexadec-9enoyl, n-C16:1)
: 0.0 : optimal : thmpp_c : Thiamine diphosphate

We choose to start from the phosphatidylethanolamine 16:0, to see which reactions are missing. With gempipe.perform_gapfilling it’s possible to focus on the biosythesis of a particular metabolite simply by specifying it’s ID.

💡 Tip! The time requested to solve optimization problems varies with the utilised solver. Gapfilling is an optimization problem, and its computation may run for an unacceptably long time. In these cases, we suggest to try a commercial solver, like for example CPLEX, which is usually faster then the default GLPK. Once installed, Gempipe will automatically switch to CPLEX as default solver.

_ = gempipe.perform_gapfilling(panmodel, universe, minflux=0.1, mid='pe160_c', nsol=5)

Solution 1. Reactions to add: 1.
1 KAS8 B-ketoacyl synthetase (palmitate, n-C16:0)

Solution 2. Reactions to add: 1.
1 KAS7 B-ketoacyl synthetase (n-C16:1)

Solution 3. Reactions to add: 1.
1 KAS17 B-ketoacyl synthetase (n-C18:1)

Solution 4. Reactions to add: 1.
1 KAS13 B-ketoacyl synthetase (octadecanoate)

Solution 5. Reactions to add: 1.
1 KAS13 B-ketoacyl synthetase (octadecanoate)

Mapping these reactions on an Escher map, seems like the fatty acid biosynthesis pathway is missing. This is not possible for a free-living species, which has to be self-sufficient in the biosynthesis of all its biomass precursors. Therefore, we take a look at the corrisponding KEGG module (M00083), to see if the corresponding KEGG Orthologs (KO) are present in this species. We use the gempipe.query_pam function to search for metabolic genes still missing from the model. This function utilizes the PAM and the functional annotation produced by gempipe recon, which were automatically loaded with the gempipe.initialize function called at the beginning.

💡 Tip! With gempipe.query_pam, genes can be searched via KO code, EC code, gene name, function description, and more. For each parameter it is possible to specify, instead of a single value, a list of values, for example ko=['K00209', 'K10780', 'K02371']. This way, more PAM rows may be obtained.

gempipe.query_pam(ko='K00647', name='fabB')

	Erwinia aphidicola GCA_014773485.1	Erwinia aphidicola GCA_024169515.1	Erwinia aphidicola GCA_918698235.1
Cluster_1282	OEIEKCCN_02320	OCILAMFM_00215	LBGCNHPF_01597

gempipe.query_pam(ko='K00059', name='fabG')

	Erwinia aphidicola GCA_014773485.1	Erwinia aphidicola GCA_024169515.1	Erwinia aphidicola GCA_918698235.1
Cluster_3003	OEIEKCCN_00814	OCILAMFM_03222	LBGCNHPF_01227
Cluster_2990	OEIEKCCN_02727	OCILAMFM_00984	LBGCNHPF_02154

gempipe.query_pam(ko='K01716', name='fabA')

	Erwinia aphidicola GCA_014773485.1	Erwinia aphidicola GCA_024169515.1	Erwinia aphidicola GCA_918698235.1
Cluster_3940	OEIEKCCN_00708	OCILAMFM_03060	LBGCNHPF_01333

gempipe.query_pam(ko='K00209', name='fabV')

	Erwinia aphidicola GCA_014773485.1	Erwinia aphidicola GCA_024169515.1	Erwinia aphidicola GCA_918698235.1
Cluster_1346	OEIEKCCN_01394;OEIEKCCN_03959	OCILAMFM_03809	LBGCNHPF_00219

Several alternative KOs can be covered for the same metabolic function (for example fabL K10780 and fabV K00209). Probably, the BiGG genes database is missing good representatives for these KOs. Indeed, the BiGG database is small and biased towards model organisms, so it’s understandable that some metabolic genes are lost during the alignment, despite relaxing the --indentity and --coverage thresholds. As expected, E. aphidicola has all the needed genes to include the KAS / FAS (keto-acyl synthase / fatty-acid synthase) series of reactions. Below we choose to add the FAS series:

# define a GPR for the FAS family of reactions: 
fabB = 'Cluster_1282'
fabG = 'Cluster_2990 or Cluster_3003'
fabA = 'Cluster_3940'
fabV = 'Cluster_1346'  

gpr = f'{fabB} and ({fabG}) and {fabA} and {fabV}'

# copy new reactions from the universe:
gempipe.import_from_universe(panmodel, universe, 'FAS80_L', gpr=gpr) 
gempipe.import_from_universe(panmodel, universe, 'FAS100', gpr=gpr) 
gempipe.import_from_universe(panmodel, universe, 'FAS120', gpr=gpr) 
gempipe.import_from_universe(panmodel, universe, 'FAS140', gpr=gpr) 
gempipe.import_from_universe(panmodel, universe, 'FAS160', gpr=gpr) 
gempipe.import_from_universe(panmodel, universe, 'FAS180', gpr=gpr) 

Checking biomass precursors again, we see we just have effectively gap-filled for the phosphatidylethanolamine 16:0:

# check blocked biomass precursors:
_ = gempipe.check_reactants(panmodel, 'Growth')

: 0.0 : optimal : cu2_c : Copper
: 0.0 : optimal : fe3_c : Iron (Fe3+)
: 0.0 : optimal : kdo2lipid4_p : KDO(2)-lipid IV(A)
: 0.0 : optimal : pe161_c : Phosphatidylethanolamine (dihexadec-9enoyl, n-C16:1)
: 0.0 : optimal : pe161_p : Phosphatidylethanolamine (dihexadec-9enoyl, n-C16:1)
: 0.0 : optimal : thmpp_c : Thiamine diphosphate

Let’s now continue with the next blocked biomass precursor: pe161_c, another phosphatidylethanolamine, but this time with a double bond derived from a desaturase activity, as one could see opening an Escher map. Gap-filling for the biosyntheisis of this precursor we obtain DESATPE160 as the missing desaturase reaction:

_ = gempipe.perform_gapfilling(panmodel, universe, minflux=0.1, mid='pe161_c', nsol=5)

Solution 1. Reactions to add: 1.
1 DESATPE160 PE160 desaturase  pe C160  pe C161d9

Solution 2. Reactions to add: 1.
1 DESATPE160 PE160 desaturase  pe C160  pe C161d9

Solution 3. Reactions to add: 1.
1 DESATPE160 PE160 desaturase  pe C160  pe C161d9

Solution 4. Reactions to add: 3.
1 12DGR161tipp 1,2 diacylglycerol transport via flipping (periplasm to cytoplasm, n-C16:1)
2 CLPNH161pp Cardiolipin hydrolase (periplasm, n-C16:1)
3 DESATPG160 PG160 desaturase  pg C160  pg C161d9

Solution 5. Reactions to add: 1.
1 DESATPE160 PE160 desaturase  pe C160  pe C161d9

This is a little more tricky. Searching this reaction on the BiGG database v1.6, we see it derives from just 1 model: iJN1463, for Pseudomonas putida KT2440. Searching its associated gene PP_0217 in KEGG, we see it isn’t associated with any KO code. Therefore, we try to search PP_0217 in EggNOG. Under the class Gammaproteobacteria, the same of Erwinia, two KO codes are suggested: K00507 and K23054. Anyway, these orthologs seems not to appear in our organism:

gempipe.query_pam(ko=['K00507', 'K23054'])

	Erwinia aphidicola GCA_014773485.1	Erwinia aphidicola GCA_024169515.1	Erwinia aphidicola GCA_918698235.1

Drawing the phosphatidylethanolamine pathway on an Escher map, we see it is complate apart from this desaturase. Therefore, there must be an alternative way to get to this metabolite, or it shuldn’t appear in the biomass definition. Given that we do not dispose phenotipic wet-lab data on membrane composition for our species, we decide to include this reaction without specifying a GPR. Doing so, as expected, all the ramaining phosphatidylethanolamine blocks disappear:

gempipe.import_from_universe(panmodel, universe, 'DESATPE160') 

_ = gempipe.check_reactants(panmodel, 'Growth')

: 0.0 : optimal : cu2_c : Copper
: 0.0 : optimal : fe3_c : Iron (Fe3+)
: 0.0 : optimal : kdo2lipid4_p : KDO(2)-lipid IV(A)
: 0.0 : optimal : thmpp_c : Thiamine diphosphate

We continue the gap-filling with the thiamin biosynthesis. On Escher, drawing the thiamine pathway for the gram negative universe, we see that the metabolite 4-hydroxy-benzyl alcohol (4hba_c) is a product of the thiazole phosphate synthesis (THZPSN), intermediate reaction of the thiamin biosynthetic pathway. Since 4hba_c must be consumed in some way, modelers have encoded several alternatives. One of them is simply a demand reaction to let 4hba_c leave the system. Another one, more biologically meaningful, is to translocate it via a transporter 4HBAt and an associated EX_change reaction:

_ = gempipe.perform_gapfilling(panmodel, universe, minflux=0.1, mid='thmpp_c', nsol=5)

Solution 1. Reactions to add: 1.
1 sink_4hba_c R_sink_4hba_c

Solution 2. Reactions to add: 1.
1 sink_4hba_c R_sink_4hba_c

Solution 3. Reactions to add: 2.
1 4HBAt MNXR68734
2 EX_4hba_e R_EX_4hba_e

Solution 4. Reactions to add: 1.
1 sink_4hba_c R_sink_4hba_c

Solution 5. Reactions to add: 3.
1 4HBADH 4 hydroxy benzyl alcohol dehydrogenase
2 VNDH_2 4-hydroxybenzaldehyde dehydrogenase
3 sink_2ohph_c R_sink_2ohph_c

Searching in literature, we understand that pcaK should be permease for 4hba_c (see Nichols and Harwood, 1997). Since we find a cluster annotate as pcaK, we include the transporter and the associated EX_change reaction in the GSMM. To let the GSMM be a little bit more predictive, we decide not to constrain the transporter bounds. Checking again the blocked biomass precursors, as expected, thmpp_c disappears.

gempipe.query_pam(name='pcaK')

	Erwinia aphidicola GCA_014773485.1	Erwinia aphidicola GCA_024169515.1	Erwinia aphidicola GCA_918698235.1
Cluster_1015	OEIEKCCN_01251	OCILAMFM_03656	LBGCNHPF_00358

gempipe.import_from_universe(panmodel, universe, '4HBAt', bounds=(-1000,1000)) 
gempipe.import_from_universe(panmodel, universe, 'EX_4hba_e', bounds=(0,1000)) 

_ = gempipe.check_reactants(panmodel, 'Growth')

: 0.0 : optimal : cu2_c : Copper
: 0.0 : optimal : fe3_c : Iron (Fe3+)
: 0.0 : optimal : kdo2lipid4_p : KDO(2)-lipid IV(A)

Now it’s the turn of the KDO-lipid-IV (kdo2lipid4_p), a membrane component. From the gap-filling suggestions, 3HAACOAT140 always appears.

_ = gempipe.perform_gapfilling(panmodel, universe, minflux=0.1, mid='kdo2lipid4_p', nsol=5)

Solution 1. Reactions to add: 2.
1 3HAACOAT140 3 Hydroxyacyl ACPCoA Transacylase
2 RECOAH6 3 hydroxyacyl Coa dehydratase   3R  3 hydroxytetradecanoyl CoA

Solution 2. Reactions to add: 2.
1 3HAACOAT140 3 Hydroxyacyl ACPCoA Transacylase
2 RHACOAR140 3R  3 Hydroxyacyl CoANADP oxidoreductase

Solution 3. Reactions to add: 2.
1 3HAACOAT140 3 Hydroxyacyl ACPCoA Transacylase
2 RHACOAR140 3R  3 Hydroxyacyl CoANADP oxidoreductase

Solution 4. Reactions to add: 2.
1 3HAACOAT140 3 Hydroxyacyl ACPCoA Transacylase
2 RECOAH6 3 hydroxyacyl Coa dehydratase   3R  3 hydroxytetradecanoyl CoA

Solution 5. Reactions to add: 2.
1 3HAACOAT140 3 Hydroxyacyl ACPCoA Transacylase
2 RECOAH6 3 hydroxyacyl Coa dehydratase   3R  3 hydroxytetradecanoyl CoA

Once again, searching 3HAACOAT140 in BiGG, we see it’s associated with just 1 model (iJN1463), and just 1 gene: PP_1408. On KEGG, this gene is not annotated with KO or EC codes. Therfore, we search it on EggNOG, revelaing the code K18100 for Gammaproteobacteria. Unfortunately, we do not find any equivalent in our species:

gempipe.query_pam(ko=['K18100'])

	Erwinia aphidicola GCA_014773485.1	Erwinia aphidicola GCA_024169515.1	Erwinia aphidicola GCA_918698235.1

Since all the rest of the KDO-lipid-IV biosynthetic pathway appears complete, we decide to close the gap without GPR. The last two remaining blocks are copper and iron:

gempipe.import_from_universe(panmodel, universe, '3HAACOAT140') 
gempipe.import_from_universe(panmodel, universe, 'RHACOAR140') 

_ = gempipe.check_reactants(panmodel, 'Growth')

1 : 0.0 : optimal : cu2_c : Copper
2 : 0.0 : optimal : fe3_c : Iron (Fe3+)

We first try to compute gap-filling reactions as we did before. Anyway, this time an “infeasible” error appears, meaning that the optimization problem we are asking has no solution. The absence of solution does not depend on the reaction content of the universe, because we know the gap-filling reactions are there, ready to be transfered. In this context, the absence of solution depends on the required minflux parameter, set to 0.1, too much considering the iron content defined in our growth medium: as reported in the paragraph above, it was defined as 0.001275 mmol/L of iron-II plus 0.001275 mmol/L of iron-III. Even considered together, they could never reach the requested minumum 0.1.

_ = gempipe.perform_gapfilling(panmodel, universe, minflux=0.1, mid='fe3_c', nsol=5)

ERROR: cobrapy: gap filling optimization failed (infeasible).

Therefore, we lower the minflux to an amount compatible with our medium, and try the computation again. This time, other two type of errors could appear: (1) the “try lowering the integer threshold” error, or (2) the “check original solver status” error. Both these errors were largely discussed in issue #941 of the cobrapy package.

_ = gempipe.perform_gapfilling(panmodel, universe, minflux=0.0001, mid='fe3_c', nsol=5)

ERROR: cobrapy: gap filling optimization failed (check_original_solver_status).

This time the error is telling us that, simply speaking, there’s some difficulties in handling such low flux values. After having tried several different workarounds, we would suggest to apply the following easy trick. First, boost the nutrient input to an unrealistically high value, then ask the gapfilling again reaising up also the minimal flux through the objective (minflux). This way, the algorithm has no problems in suggesting us the right gap-filling reactions:

gempipe.set_unbounded_exchanges(panmodel, ['EX_fe3_e']) 
gempipe.set_unbounded_exchanges(universe, ['EX_fe3_e']) 


_ = gempipe.perform_gapfilling(panmodel, universe, minflux=0.1, mid='fe3_c', nsol=5)

Solution 1. Reactions to add: 1.
1 FE3t Ferric iron uptake, plasma membrane

Solution 2. Reactions to add: 1.
1 FE3Gabcpp Iron (III) transport via ABC system (GTP) (periplasm)

Solution 3. Reactions to add: 1.
1 FE3abc Iron (III) transport via ABC system

Solution 4. Reactions to add: 1.
1 FE3abcpp Iron (III) transport via ABC system (periplasm to cytoplasm)

Solution 5. Reactions to add: 1.
1 FE3abcpp Iron (III) transport via ABC system (periplasm to cytoplasm)

After adding an iron transporter, only the copper block remains:

gempipe.import_from_universe(panmodel, universe, 'FE3abcpp')

_ = gempipe.check_reactants(panmodel, 'Growth')

1 : 0.0 : optimal : cu2_c : Copper

Since it’s another trace element, the solution is similar to what we just saw for iron. After the addition of a copper transporter, calling again the gempipe.check_reactants function we receive an empty output, meaning that our Erwinia model is able synthetize all precursors defined in the biomass assembly reaction:

gempipe.set_unbounded_exchanges(panmodel, ['EX_cu2_e']) 
gempipe.set_unbounded_exchanges(universe, ['EX_cu2_e']) 


_ = gempipe.perform_gapfilling(panmodel, universe, minflux=0.1, mid='cu2_c', nsol=5)

Solution 1. Reactions to add: 1.
1 CU2tpp Copper transport in via permease (no H+)

Solution 2. Reactions to add: 1.
1 CUabcpp Copper transport via ABC system (periplasm)

Solution 3. Reactions to add: 1.
1 Cuabc Copper transport via ABC system

Solution 4. Reactions to add: 1.
1 Cuabc Copper transport via ABC system

Solution 5. Reactions to add: 1.
1 CU2tpp Copper transport in via permease (no H+)

gempipe.import_from_universe(panmodel, universe, 'CUabcpp')

gempipe.check_reactants(panmodel, 'Growth')

[]

Indeed, we now see our Erwinia model growing on the Erwinia medium defined by Grula 1960:

apply_medium(panmodel)

panmodel.slim_optimize()

0.056417489421720736

We just finshed a gap-filling for the production of biomass. After in-depth sanity checks and gap-fillings, the draft pan-GSMM outputted by gempipe recon could be finally called simply “pan-GSMM”, indicating its final form. From this GSMM, we can now start to derive strain-specific GSMMs with gempipe derive (read Part 3). Assuming that we are now ready to go on with the derivation, we save the pan-GSMM as it will be the main input of gempipe derive:

import cobra

cobra.io.save_json_model(panmodel, 'tutoring_materials/aphidicola/panmodel.json')