{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1ee80d82-557d-483f-98c4-a704a6d931c7",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "\n",
    "Gempipe is a tool for drafting, curating and analyzing pan and multi-strain genome-scale metabolic models (GSMMs or GEMs).\n",
    "\n",
    "## In brief\n",
    "\n",
    "Gempipe can start from genomes or directly from proteomes, if a reliable annotation is available. Genomes are filtered for quality using both technical and biological metrics. Then, genes are annotated and grouped into clusters, and an extensive gene-recovery procedure is applied to counteract possible errors introduced during genome assembling or gene calling. \n",
    "\n",
    "Gene clusters are used to build a reference-free reconstruction based on the [CarveMe semi-curated universe](https://github.com/cdanielmachado/carveme), applying different rules for the generation of GPRs (gene-to-reaction associations), accounting for alternative isoforms and respecting the original enzyme complex definitions stored in [BiGG](http://bigg.ucsd.edu). \n",
    "\n",
    "The reference-free reconstruction is used as a source of new reactions for the expansion of an **optional** user-provided **reference**, thus taking into account the strain-specificity outside the scope of the reference itself. This expansion respects the design decision of the reference in terms of metabolites formula and charge and reactions balance. \n",
    "\n",
    "The resulting draft pan-GSMM is then annotated _de novo_ with accessions from many databases and duplicated metabolites and reactions are optionally removed. \n",
    "\n",
    "Unlike other tools like [CarveMe](https://carveme.readthedocs.io/en/latest/index.html) or [gapseq](https://gapseq.readthedocs.io/en/latest/), and even if a totally automated reconstruction mode ([`gempipe autopilot`](gempipe_autopilot.ipynb)) is provided, manual curation is strongly encouraged. To facilitate [manual curation](part_2_manual_curation.ipynb), Gempipe provides an application programming interface ([API](autoapi/gempipe/interface/index)) with dedicated functions. \n",
    "\n",
    "Once the pan-GSMM is finalized, it is used to derive a strain-specific GSMM for each input genome or proteome, exploiting the gene clusters information, eventually granting biomass production on a set of user-defined growth media. At this point, auxotrophies and growth-enabling substrates can be predicted, and [Biolog® screenings](https://www.biolog.com/products/metabolic-characterization-microplates/microbial-phenotype/) can be simulated.\n",
    "\n",
    "Finally, specific functions of the Gempipe [API](autoapi/gempipe/interface/index) can be used to analyze the deck of strain-specific GSMMs: phylometabolic trees can be created, strains can be divided in homogeneous metabolic groups, discriminative metabolic features can be extracted, core metabolism of species can be identified, etc.\n",
    "\n",
    "## Components and workflow\n",
    "\n",
    "Gempipe is composed by 3 command-line programs and an API. The Gempipe workflow is divided in four parts: \n",
    "\n",
    "* [**Part 1.**](part_1_gempipe_recon.ipynb) Creation of the **draft pan-GSMM** and the presence/absence matrix (**PAM**), starting either from genomes or proteomes (command line program [`gempipe recon`](part_1_gempipe_recon.ipynb)).\n",
    "* [**Part 2.**](part_2_manual_curation.ipynb) Manual curation of the draft pan-GSMM, for example using functions provided by the [Gempipe API](autoapi/gempipe/interface/index). \n",
    "* [**Part 3.**](part_3_gempipe_derive.ipynb) Derivation of strain-specific GSMMs, starting from the PAM and the curated pan-GSMM (command line program [`gempipe derive`](part_3_gempipe_derive.ipynb)). \n",
    "* [**Part 4.**](part_4_multi_strain_analysis.ipynb) Analysis of the deck of strain-specific GSMMs, for example using functions provided by the [Gempipe API](autoapi/gempipe/interface/index). \n",
    "\n",
    "As a (_discouraged_) alternative to the manual curation, the additional command line program [`gempipe autopilot`](gempipe_autopilot.ipynb) is provided, which internally calls [`gempipe recon`](part_1_gempipe_recon.ipynb) and [`gempipe derive`](part_3_gempipe_derive.ipynb), linking them together performing an automated gap-filling on the draft pan-GSMM. \n",
    "\n",
    "Below it is reported the **interactive** flowchart of Gempipe. It can be **zoomed** and **panned** to see the details. Some nodes are **clickable** and point to the corresponding doc section. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "1b67e8db-da1c-4830-85f1-e788ae66cae2",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%aimport gempipe, gempipe.flowchart\n",
    "%autoreload 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a962bf80-7c09-4868-bd4c-d87229e79691",
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <style> #outcellbox {display: flex; justify-content: center; overflow: hidden; width: 99%; height: 300px; background-color: #ffffff; border: 1px solid grey;} </style>\n",
       "        \n",
       "        <script src=\"https://unpkg.com/@panzoom/panzoom@4.5.1/dist/panzoom.min.js\"></script>\n",
       "\n",
       "        <div class=\"mermaid-dacbb39d-7697-403c-a5ed-a107fbe1e98e\" id=\"outcellbox\"></div> \n",
       "        <script type=\"module\">\n",
       "            import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10.6.1/+esm';\n",
       "            const graphDefinition = 'flowchart LR \\n\\nsubgraph Part_1[Part 1]\\n\\nInput_proteomes{{-p/--proteomes}} --> Proteomes --> BRH\\nInput_genbanks{{-gb/--genbanks}} --> Proteomes\\nInput_genomes{{-g/--genomes}} --> initial_Genomes\\nInput_taxids{{-t/--taxids}} --> NCBI_download --> initial_Genomes{{initial_Genomes}}\\nInput_ref_proteome{{-rp/--refproteome}} --> BRH\\nInput_ref_model{{-rm/--refmodel}} --> BRH --> translated_ref{{translated_ref}}\\nInput_manual_corrections{{-mc/--mancor}} --> expanded_RB_recon\\n\\ninitial_Genomes --> CDS_prediction --> initial_Proteomes{{initial_Proteomes}} --> Bio_metrics --> Busco(Busco)\\ninitial_Genomes --> Tech_metrics --> N50(N50) & n_contigs(n_contigs)\\nN50 & n_contigs & Busco & initial_Genomes & initial_Proteomes --> Filtering \\nFiltering --> Genomes{{Genomes}} & Proteomes{{Proteomes}}\\nBusco_db{{Busco_db}} --> Bio_metrics\\n\\nProteomes --> Clustering --> initial_PAM{{initial_PAM}} & initial_Ref_seqs{{initial_Ref_seqs}}\\ninitial_PAM & initial_Ref_seqs & Proteomes & Genomes --> Gene_recovery\\n\\n\\nGene_recovery --> PAM{{PAM}} & Ref_seqs{{Ref_seqs}}\\nRef_seqs --> Func_annot --> Gene_functions{{Gene_functions}} --> RF_recon\\nBiGG_genes{{BiGG_genes}} & Ref_seqs & Universe{{Universe}} --> RF_recon --> RF_draft_panmodel{{RF_draft_panmodel}}\\neggNOG_db{{eggNOG_db}} --> Func_annot\\n\\nRF_draft_panmodel & translated_ref --> expanded_RB_recon --> Initial_draft_panmodel{{Initial_draft_panmodel}}\\nRF_draft_panmodel -. reference not provided .-> Initial_draft_panmodel\\nInitial_draft_panmodel --> TCDB_expansion --> reannotation[\"reannotation\"] --> deduplication --> draft_panmodel{{draft_panmodel}}\\n\\nend\\n\\n\\n\\nclick Input_proteomes href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#starting-from-proteomes\" \"Link\"\\nclick Input_genomes href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#starting-from-genomes\" \"Link\"\\nclick Input_taxids href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#starting-from-genomes\" \"Link\"\\nclick Input_genbanks href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#starting-from-genbanks\" \"Link\"\\n\\n\\nclick Filtering href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#filtering-the-genomes\" \"Link\"\\nclick Gene_recovery href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#gene-recovery\" \"Link\"\\nclick RF_recon href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#draft-pan-gsmm-reference-free-reconstruction\" \"Link\"\\nclick expanded_RB_recon href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#draft-pan-gsmm--expanded-reference-based-reconstruction\" \"Link\"\\nclick TCDB_expansion href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#transport-reactions-expansion-with-tcdb\" \"Link\"\\nclick reannotation href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#annotation-and-duplicates-removal\" \"Link\"\\nclick deduplication href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#annotation-and-duplicates-removal\" \"Link\"\\n\\n\\nstyle Part_1 fill:white\\n\\n\\n\\n\\nstyle Input_proteomes fill:lightgreen\\nstyle Input_genomes fill:lightgreen\\nstyle Input_taxids fill:lightgreen\\nstyle Input_genbanks fill:lightgreen\\nstyle Input_ref_proteome fill:lightgreen\\nstyle Input_ref_model fill:lightgreen\\nstyle Input_manual_corrections fill:lightgreen\\n\\nstyle initial_Genomes fill:lightblue \\nstyle Busco fill:burlywood\\nstyle N50 fill:burlywood\\nstyle n_contigs fill:burlywood\\nstyle Genomes fill:lightblue\\nstyle initial_Proteomes fill:lightblue\\nstyle Proteomes fill:lightblue\\nstyle initial_PAM fill:lightblue\\nstyle initial_Ref_seqs fill:lightblue\\nstyle PAM fill:gold\\nstyle Ref_seqs fill:lightblue\\nstyle BiGG_genes fill:violet\\nstyle Busco_db fill:violet\\nstyle Universe fill:violet\\nstyle eggNOG_db fill:violet\\nstyle RF_draft_panmodel fill:lightblue\\nstyle translated_ref fill:lightblue\\nstyle Initial_draft_panmodel fill:lightblue\\nstyle draft_panmodel fill:lightblue\\nstyle Gene_functions fill:lightblue\\n\\n\\nstyle Gene_recovery fill:whitesmoke\\nstyle NCBI_download fill:whitesmoke\\nstyle CDS_prediction fill:whitesmoke\\nstyle Filtering fill:whitesmoke\\nstyle Bio_metrics fill:whitesmoke\\nstyle Tech_metrics fill:whitesmoke\\nstyle Clustering fill:whitesmoke\\nstyle BRH fill:whitesmoke\\nstyle RF_recon fill:whitesmoke\\nstyle expanded_RB_recon fill:whitesmoke\\nstyle Func_annot fill:whitesmoke\\nstyle TCDB_expansion fill:whitesmoke\\nstyle reannotation fill:whitesmoke\\nstyle deduplication fill:whitesmoke \\n\\nsubgraph Part_2[Part 2]\\n\\nManual_curation((Manual_curation)) --> panmodel{{panmodel}}\\n\\nend\\n\\ndraft_panmodel{{draft_panmodel}} --> Manual_curation((Manual_curation))\\nUniverse{{Universe}} --> Manual_curation((Manual_curation))\\nGene_functions{{Gene_functions}} --> Manual_curation((Manual_curation))\\nPAM{{PAM}} --> Manual_curation((Manual_curation))\\n\\n\\nclick Manual_curation href \"https://gempipe.readthedocs.io/en/latest/part_2_manual_curation.html\" \"Link\"\\n\\n\\n\\nstyle Part_2 fill:white\\n\\nstyle panmodel fill:gold\\n\\nstyle Manual_curation fill:whitesmoke\\nstyle draft_panmodel fill:lightblue\\nstyle Universe fill:violet\\nstyle Gene_functions fill:lightblue\\nstyle PAM fill:gold\\nsubgraph Part_3[Part 3]\\n\\nstrain_derivation --> strain_models{{strain_models}} --> gapfilling --> gf_strain_models{{gf_strain_models}} \\ngf_strain_models --> species_derivation --> species_models{{species_models}}\\ngf_strain_models --> aux_prediction --> aux{{aux}}\\ngf_strain_models --> cnps_prediction --> cnps{{cnps}}\\ngf_strain_models --> biosynth_prediction --> biosynth{{biosynth}}\\ngf_strain_models --> rpam{{rpam}}\\n\\nend\\n\\npanmodel{{panmodel}} & PAM{{PAM}} --> strain_derivation\\n\\n\\n\\nclick strain_derivation href \"https://gempipe.readthedocs.io/en/latest/part_3_gempipe_derive.html#deriving-strain-specific-models\" \"Link\"\\nclick gapfilling href \"https://gempipe.readthedocs.io/en/latest/part_3_gempipe_derive.html#gap-filling-strain-specific-models\" \"Link\"\\nclick species_derivation href \"https://gempipe.readthedocs.io/en/latest/part_3_gempipe_derive.html#deriving-species-specific-models\" \"Link\"\\nclick aux_prediction href \"https://gempipe.readthedocs.io/en/latest/part_3_gempipe_derive.html#strain-specific-screenings\" \"Link\"\\nclick cnps_prediction href \"https://gempipe.readthedocs.io/en/latest/part_3_gempipe_derive.html#strain-specific-screenings\" \"Link\"\\nclick biosynth_prediction href \"https://gempipe.readthedocs.io/en/latest/part_3_gempipe_derive.html#strain-specific-screenings\" \"Link\"\\n\\n\\n\\nstyle Part_3 fill:white\\n\\nstyle strain_models fill:lightblue\\nstyle gf_strain_models fill:gold\\nstyle species_models fill:gold\\nstyle gapfilling fill:whitesmoke\\nstyle aux_prediction fill:whitesmoke\\nstyle cnps_prediction fill:whitesmoke\\nstyle biosynth_prediction fill:whitesmoke\\nstyle panmodel fill:gold\\nstyle PAM fill:gold\\nstyle aux fill:salmon\\nstyle cnps fill:salmon\\nstyle biosynth fill:salmon\\nstyle rpam fill:salmon\\n\\nstyle strain_derivation fill:whitesmoke\\nstyle species_derivation fill:whitesmoke \\n\\n\\nsubgraph Part_4[Part 4]\\n\\nMulti_strain_analysis((Multi_strain_analysis)) \\n\\nend\\n\\ngf_strain_models{{gf_strain_models}} --> Multi_strain_analysis\\naux{{aux}} --> Multi_strain_analysis\\ncnps{{cnps}} --> Multi_strain_analysis\\nrpam{{rpam}} --> Multi_strain_analysis\\nbiosynth{{biosynth}} --> Multi_strain_analysis\\n\\n\\nclick Multi_strain_analysis href \"https://gempipe.readthedocs.io/en/latest/part_4_multi_strain_analysis.html\" \"Link\"\\n\\n\\n\\nstyle Part_4 fill:white\\n\\n\\nstyle gf_strain_models fill:gold\\nstyle aux fill:salmon\\nstyle cnps fill:salmon\\nstyle rpam fill:salmon\\nstyle biosynth fill:salmon\\nstyle Multi_strain_analysis fill:whitesmoke';\n",
       "            const element = document.querySelector('.mermaid-dacbb39d-7697-403c-a5ed-a107fbe1e98e');\n",
       "            const { svg } = await mermaid.render('graphDiv-dacbb39d-7697-403c-a5ed-a107fbe1e98e', graphDefinition);\n",
       "            element.innerHTML = svg;\n",
       "            \n",
       "            const elem = document.getElementById('graphDiv-dacbb39d-7697-403c-a5ed-a107fbe1e98e');\n",
       "            const panzoom = Panzoom(elem, {maxScale: 50});\n",
       "            panzoom.zoom(2);\n",
       "            elem.parentElement.addEventListener('wheel', panzoom.zoomWithWheel);\n",
       "        </script>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from gempipe import Flowchart\n",
    "\n",
    "file = open('flowcharts/part_1.flowchart', 'r')\n",
    "part_1 = file.read()\n",
    "file.close()\n",
    "\n",
    "file = open('flowcharts/part_2.flowchart', 'r')\n",
    "part_2 = file.read()\n",
    "file.close()\n",
    "\n",
    "file = open('flowcharts/part_3.flowchart', 'r')\n",
    "part_3 = file.read()\n",
    "file.close()\n",
    "\n",
    "file = open('flowcharts/part_4.flowchart', 'r')\n",
    "part_4 = file.read()\n",
    "file.close()\n",
    "\n",
    "header = 'flowchart LR \\n'\n",
    "flowchart = Flowchart(header + part_1 + part_2 + part_3 + part_4)\n",
    "flowchart.render(height=300, zoom=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f9ff72b-e82b-45a0-8c4a-ba438d9ca541",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}