{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "6ab8296a-c2cf-4eaf-a47c-87d165e01aa4", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "%load_ext autoreload\n", "%aimport gempipe, gempipe.flowchart\n", "%autoreload 1 " ] }, { "cell_type": "code", "execution_count": 2, "id": "b5a6b58b-aa71-4e3c-be9b-2343ee178888", "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", "\n", "
\n", " \n", " " ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from gempipe import Flowchart\n", "\n", "file = open('flowcharts/part_3.flowchart', 'r')\n", "header = 'flowchart LR \\n'\n", "flowchart = Flowchart(header + file.read())\n", "file.close()\n", "flowchart.render(height=300, zoom=1)" ] }, { "cell_type": "markdown", "id": "42513ac2-81c9-42fa-9f02-0fdf5f977092", "metadata": {}, "source": [ "# Part 3: gempipe derive\n", "\n", "`gempipe derive` is designed to derive strain-specific GSMMs, using as main inputs the PAM (presence/absence matrix) and the pan-GSMM. Below are shown all its options: " ] }, { "cell_type": "code", "execution_count": 1, "id": "10f5f830-219f-46e1-94b0-ea8807aa1344", "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "usage: gempipe derive [-h] [-v] [-c] [-o] [--verbose] [-im] [-ip] [-ir] [-ig]\n", " [-m] [--minflux] [--biolog] [--sbml] [--skipgf]\n", " [--nofig] [--aux] [--cnps] [--cnps_minmed] [--biosynth]\n", "\n", "gempipe v1.38.4, please cite \"TODO\". Full documentation available at\n", "https://gempipe.readthedocs.io/en/latest/index.html.\n", "\n", "optional arguments:\n", " -h, --help Show this help message and exit.\n", " -v, --version Show version number and exit.\n", " -c , --cores How many parallel processes to use. (default: 1)\n", " -o , --outdir Main output directory (will be created if not\n", " existing). (default: ./)\n", " --verbose Make stdout messages more verbose, including debug\n", " messages. (default: False)\n", " -im , --inpanmodel Path to the input pan-model. (default: -)\n", " -ip , --inpam Path to the input PAM. (default: -)\n", " -ir , --inreport Path to the input report file. (default: -)\n", " -ig , --ingannots Path to the input genes annotation file. (default: -)\n", " -m , --media Medium definition file or folder containing media\n", " definitions, to be used during the automatic gap-\n", " filling. (default: -)\n", " --minflux Minimum flux through the objective of strain-specific\n", " models. (default: 0.1)\n", " --biolog Simulate Biolog's utilization tests on strain-specific\n", " models. (default: False)\n", " --sbml Save the output GSMMs in SBML format (L3V1 FBC2) in\n", " addition to JSON. (default: False)\n", " --skipgf Skip the gap-filling step applied to the strain-\n", " specific models. (default: False)\n", " --nofig Skip the generation of figures. (default: False)\n", " --aux Test auxotrophies for aminoacids and vitamins.\n", " (default: False)\n", " --cnps Sistematically simulate growth on all the available\n", " C-N-P-S sources. (default: False)\n", " --cnps_minmed Base the C-N-P-S simulations on a minimal medium\n", " leading to the specified minimum objective value. If 0,\n", " user-defined medium will be used. (default: 0.0)\n", " --biosynth Check biosynthesis of each metabolite while granting\n", " the specified minimum fraction of objective. If 0, this\n", " step will be skipped. (default: 0.0)\n" ] } ], "source": [ "import subprocess\n", "\n", "command = f\"\"\"gempipe derive -h\"\"\"\n", "process = subprocess.Popen(command, shell=True)\n", "response = process.wait()\n" ] }, { "cell_type": "markdown", "id": "a1f1c1a8-4246-4b77-bfcd-755a6bfb1cb2", "metadata": {}, "source": [ "## Deriving strain-specific models\n", "\n", "Three inputs are required by `gempipe derive` to start: \n", "\n", "1. The pan-GSMM (`-im`/`--inpanmodel`), in its final version. Ideally, the pan-GSMM was produced with `gempipe recon` ([Part 1](part_1_gempipe_recon.ipynb)) and subsequently curated ([Part 2](part_2_manual_curation.ipynb)).\n", "\n", "2. The presence/absence matrix (PAM) (`-ip`/`--inpam`), as it was provided by `gempipe recon`. As discussed in [Part 1](https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#creation-of-a-pam), this table contains clusters in row and strains in column, and each cell contains the strain's genes belonging to the corresponding cluster.\n", "\n", "3. The report (`-ir`/`--inreport`) produced by `gempipe recon`, contaning the strain-to-species relationships. \n", "\n", "An example of command line is reported below: \n", "\n", " gempipe derive -c 16 -im curation/panmodel.json -ip reconout/pam.csv -ir reconout/report.csv\n", "\n", "When `gempipe derive` starts, a species-specific GSMM is created for each genome appearing in the PAM. This procedure may be called \"derivation by subtraction\", meaning that a copy of the pan-GSMM is subtracted from all the genes that are missing from the corresponding strain, according to the PAM content. Removing genes can eventually lead to the removal of metabolic reactions, in accordance to the GPR definitions. \n", "\n", "💡 **Tip!** If [`gempipe recon`](part_1_gempipe_recon.ipynb) started from genbanks, then a genome annotation file was produced. It can be passed to `gempipe derive` using `-ig`/`--ingannots` to improve the \"gene annotation\" section of the [Memote reports](https://memote.readthedocs.io/en/latest/index.html)." ] }, { "cell_type": "markdown", "id": "a5690d1d-1238-45eb-87e8-02d1d2b570d8", "metadata": {}, "source": [ "## Gap-filling strain-specific models\n", "\n", "After strain-specific GSMMs are derived, they may not grow. The reasons are several:\n", "\n", "* the biomass equation coming from the pan-GSMM contains some strain-specific precursors. In this case, it would be better to first remove the strain-specific precursors from the biomass assembly during the manual curation in [Part 2](part_2_manual_curation.ipynb).\n", "\n", "* the GPR of some reactions essential to form biomass doesn't take into account strain-specific gene variants. By design, the [clustering step](https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#creation-of-a-pam) in `gempipe recon` is performed at high sequence identity (90%), meaning that two orthologous genes, if sufficiently dissimilar, may be grouped in different clusters even if the function is the same. Despite `gempipe recon` does its best to manage these cases, it could be necessary to expand some GPR during the manual curation in [Part 2](part_2_manual_curation.ipynb).\n", "\n", "To solve the cases where a strain-specific GSMM doesn't grow, an automated and strain-specific step of **gap-filling** is performed. As the gap-filling is dependant on the nutritional inputs, one or more growth media recipes, better if minimal, have to be provided (`--media`): they are assumed to support growth of **all** the strains in input. Despite this automated gap-filling, each strain that doesn't grow should deserve a manual inspection, to see if one of the cases discussed above has occurred, leading to an improved manual curation. \n", "\n", "Medium recipes can be described using a **json file** following a rigid syntax, composed by the keys `name` and `exchanges`, plus the lower bounds of each component. For example:\n", "\n", "```\n", "{\"name\": \"my_medium_name\", \"exchanges\": {\"EX_ca2_e\": -1000, \"EX_cl_e\": -1000, \"EX_co2_e\": -1000, \"EX_cobalt2_e\": -1000, \"EX_cu2_e\": -1000, \"EX_h_e\": -1000, \"EX_fe2_e\": -1000, \"EX_fe3_e\": -1000, \"EX_h2o_e\": -1000, \"EX_pi_e\": -1000, \"EX_glc__D_e\": -1000, \"EX_so4_e\": -1000, \"EX_k_e\": -1000, \"EX_zn2_e\": -1000, \"EX_mg2_e\": -1000, \"EX_mn2_e\": -1000, \"EX_mobd_e\": -1000, \"EX_nh4_e\": -1000, \"EX_o2_e\": -1000}}\n", "```\n", "\n", "However, please note that the gap-filling may hide strain-specific auxotrophies. If the detection of auxotrophies is of interest, and the medium provided is not rich (e.g. it does not include amino acids and vitamins), then the gap-filling phase should be skipped using `--skipgf`.\n" ] }, { "cell_type": "markdown", "id": "bb963baf-27e9-49d5-8aab-ef5963c0580b", "metadata": {}, "source": [ "## Biolog® PM simulations" ] }, { "cell_type": "markdown", "id": "199d5ede-9463-4c16-83a7-9c5d4bcd7f2b", "metadata": {}, "source": [ "[Biolog® PM](https://www.biolog.com/products/metabolic-characterization-microplates/microbial-phenotype/) growth assays are commonly used to drive manual curation and to validate a GSMM. Using the `--biolog` option, the user can automatically simulate all the main [Biolog® PM plates](https://www.biolog.com/wp-content/uploads/2024/06/00A-042-Rev-E-Phenotype-MicroArrays-1-10-Plate-Maps.pdf) (such as PM1, PM2A, PM3, PM4, and so on) for each modeled strain. Alternatively, the same function can be interactively called via the [Gempipe API](autoapi/gempipe/interface/index) (see the function [gempipe.biolog_preview](https://gempipe.readthedocs.io/en/latest/autoapi/gempipe/interface/gaps/index.html#gempipe.interface.gaps.biolog_preview)). \n", "\n", "With this feature, substrates composing the Biolog® PM plates are iterated and tested for utilization, see below for an example output. The same substrate (`substrate`) can be considered as a C, N, P, or S source (`source`) according to the plate or origin; in these cases, a different growth simulation will be performed for each nutritive source. For example, if the starting C source is glucose (`exr_before`), then it is first removed and an FBA is performed; then the new C source (`exr_after`) is introduced, if it is present in the GSMM (`exr_after_present`), and the FBA is performed again; if the objective value increases, then the strain is assumed to be able to use the substrate (`growth=True`). For more information please read the [Gempipe paper](how_to_cite.ipynb)." ] }, { "cell_type": "code", "execution_count": 5, "id": "8cd70cbc-fd35-4394-b516-37a7b67d168e", "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
substratesourceformulaexr_beforeexr_before_presentexr_afterexr_after_presentgrowthvaluestatus
0L-ArabinoseCNaNEX_glc__D_eTrueEX_arab__L_eFalseNaNNaNNaN
1D-Saccharic AcidCC6H8O8EX_glc__D_eTrueEX_glcr_eTrueTrue19.728400optimal
2D-GalactoseCC6H12O6EX_glc__D_eTrueEX_gal_eTrueFalse0.000000optimal
3L-Aspartic AcidNC4H6NO4EX_nh4_eTrueEX_asp__L_eTrueTrue28.743739optimal
4L-Aspartic AcidCC4H6NO4EX_glc__D_eTrueEX_asp__L_eTrueTrue17.614535optimal
.................................
646ThioureaSNaNEX_so4_eTrueNaNFalseNaNNaNNaN
647D-SerineNC3H7NO3EX_nh4_eTrueEX_ser__D_eTrueTrue34.731331optimal
648D-SerineCC3H7NO3EX_glc__D_eTrueEX_ser__D_eTrueTrue33.295776optimal
649Succinic AcidCC4H4O4EX_glc__D_eTrueEX_succ_eTrueTrue17.473351optimal
650Fumaric AcidCC4H2O4EX_glc__D_eTrueEX_fum_eTrueTrue17.614535optimal
\n", "

651 rows Ă— 10 columns

\n", "
" ], "text/plain": [ " substrate source formula exr_before exr_before_present \n", "0 L-Arabinose C NaN EX_glc__D_e True \\\n", "1 D-Saccharic Acid C C6H8O8 EX_glc__D_e True \n", "2 D-Galactose C C6H12O6 EX_glc__D_e True \n", "3 L-Aspartic Acid N C4H6NO4 EX_nh4_e True \n", "4 L-Aspartic Acid C C4H6NO4 EX_glc__D_e True \n", ".. ... ... ... ... ... \n", "646 Thiourea S NaN EX_so4_e True \n", "647 D-Serine N C3H7NO3 EX_nh4_e True \n", "648 D-Serine C C3H7NO3 EX_glc__D_e True \n", "649 Succinic Acid C C4H4O4 EX_glc__D_e True \n", "650 Fumaric Acid C C4H2O4 EX_glc__D_e True \n", "\n", " exr_after exr_after_present growth value status \n", "0 EX_arab__L_e False NaN NaN NaN \n", "1 EX_glcr_e True True 19.728400 optimal \n", "2 EX_gal_e True False 0.000000 optimal \n", "3 EX_asp__L_e True True 28.743739 optimal \n", "4 EX_asp__L_e True True 17.614535 optimal \n", ".. ... ... ... ... ... \n", "646 NaN False NaN NaN NaN \n", "647 EX_ser__D_e True True 34.731331 optimal \n", "648 EX_ser__D_e True True 33.295776 optimal \n", "649 EX_succ_e True True 17.473351 optimal \n", "650 EX_fum_e True True 17.614535 optimal \n", "\n", "[651 rows x 10 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pnd\n", "\n", "biolog_example = pnd.read_csv('tutoring_materials/K27.csv', index_col=0)\n", "biolog_example" ] }, { "cell_type": "markdown", "id": "9d7dc9bf-a7fe-4de6-926f-9a32bb4f7ded", "metadata": {}, "source": [ "## Strain-specific screenings\n", "\n", "GSMMs can be used to predict strain-specific metabolic features, on which a multi-strain analysis can be based (see [Part 4](part_4_multi_strain_analysis.ipynb)). With `gempipe derive`, for each strain, it is possible to predict some of these features. They are binary, meaning that only their presence/absence is discribed, not their magnitude.\n", "\n", "Using the `--aux` option, each strain is screened for auxotrophies for amino acids and vitamins. The output is a binary feature table (`aux.csv`) having strains in column and amino acids or vitamins in row, for which the auxotrophy is evaluated. Cells contain 1 when the auxotrophy is present, otherwise 0.\n", "\n", "Using the `--cnps` option, the ability to catabolize alternative C/N/P/S sources is evaluated for each strain. Similarly to Biolog substrates, a starting C source is first removed and an FBA is performed; then an alternative C source is introduced and the FBA is performed again; if the objective value increases, this indicates that the strain can potentially catabolize the alternative source. This is iterated for every C source having a dedicated exchange reaction in the GSMM, then alternative N, P, and S sources are evaluated using the same procedure. The output is a binary feature table (`cnps.csv`) containing 1 where a catabolic activity is detected, otherwise 0. To perform this analysis on a computed minimal medium, add the `--cnps_minmed` option.\n", "\n", "Using the `--biosynth` option, the potential biosynthesis of target metabolites, is evaluated for each strain. Biomass production is constrained to a user-specified fraction of its maximum. The resulting a binary feature table (`biosynth.csv`) contains 1 when strains are considered potential producers of the tested metabolite. By default, all cytosolic metabolites are tested. \n" ] }, { "cell_type": "markdown", "id": "c91b0213-2c2b-42b4-90d4-d2343b00abed", "metadata": {}, "source": [ "## Deriving species-specific models\n", "\n", "A species-specific GSMM is here defined as a GSMM composed by all the reactions **always** present in **all** the strain-specific GSMMs of the same species. Having a representative GSMM for each species enables to perform **comparisons** at the species level, detecting for example which pathways distinguish a species from other phylogenetically close species.\n", "\n", "In `gempipe derive`, after gap-filling the strain-specific GSMMs, one species-specific GSMM is derived for each of the species in input. Moreover, like strain-specific GSMMs, species-specific GSMMs are also gapfilled on the media indicated with `--media`, unless `--skipgf` is activated. " ] }, { "cell_type": "markdown", "id": "c1e1ffd9-70f0-4da7-adea-601c89c38939", "metadata": {}, "source": [ "## Output files\n", "\n", "`gempipe derive` produces 2 **main** output files in the current directory (`-o`/`--outdir`):\n", "\n", "* **strain_models_gf/*.json.** The gap-filled strain-specific GSMMs.\n", "\n", "* **species_model_gf/*.json.** The gap-filled species-specific GSMMs.\n", "\n", "Other useful files are the following:\n", "\n", "* **strain_models/*.json.** The strain-specific GSMMs, before gap-filling. \n", "\n", "* **species_models/*.json.** The species-specific GSMMs, before gap-filling. \n", "\n", "* **derive_strains.csv.** Table showing metrics for each strain-specific GSMM, such as the number of reactions, the objective value and the solver status, before and after the gap-filling step. \n", "\n", "* **derive_species.csv.** Table showing metrics for each species-specific GSMM, such as the number of reaction, the objective value and the solver status. One row for each species inputted in `gempipe recon`.\n", "\n", "* **rpam.csv** Binary feature table reporting the presence of each reaction in each strain-specific GSMM.\n", "\n", "* **aux.csv** Binary feature table reporting the presence amino acids and vitamins auxotrophies for each strain-sspecific GSMM (requires `--aux`).\n", "\n", "* **cnps.csv** Binary feature table reporting the potential catabolic activity for alternative C, N, P, and S sources (requires `--cnps`).\n", "\n", "* **biosynth.csv** Binary feature table reporting potential biosynthetic activities (requires `--biosynth`)." ] }, { "cell_type": "code", "execution_count": null, "id": "82206a8b-97e7-409b-9716-c8119095f691", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.15" } }, "nbformat": 4, "nbformat_minor": 5 }