{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "edd270d4-8321-4e78-9e8b-6163ce561319",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%aimport gempipe, gempipe.flowchart\n",
    "%autoreload 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "d3e227e3-3ffe-4b0c-870e-574b51fe4159",
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <style> #outcellbox {display: flex; justify-content: center; overflow: hidden; width: 99%; height: 300px; background-color: #ffffff; border: 1px solid grey;} </style>\n",
       "        \n",
       "        <script src=\"https://unpkg.com/@panzoom/panzoom@4.5.1/dist/panzoom.min.js\"></script>\n",
       "\n",
       "        <div class=\"mermaid-3e09f712-5004-4cd8-b018-ce975b243c75\" id=\"outcellbox\"></div> \n",
       "        <script type=\"module\">\n",
       "            import mermaid from 'https://cdn.jsdelivr.net/npm/mermaid@10.6.1/+esm';\n",
       "            const graphDefinition = 'flowchart TB \\n\\nsubgraph Part_1[Part 1]\\n\\nInput_proteomes{{-p/--proteomes}} --> Proteomes --> BRH\\nInput_genbanks{{-gb/--genbanks}} --> Proteomes\\nInput_genomes{{-g/--genomes}} --> initial_Genomes\\nInput_taxids{{-t/--taxids}} --> NCBI_download --> initial_Genomes{{initial_Genomes}}\\nInput_ref_proteome{{-rp/--refproteome}} --> BRH\\nInput_ref_model{{-rm/--refmodel}} --> BRH --> translated_ref{{translated_ref}}\\nInput_manual_corrections{{-mc/--mancor}} --> expanded_RB_recon\\n\\ninitial_Genomes --> CDS_prediction --> initial_Proteomes{{initial_Proteomes}} --> Bio_metrics --> Busco(Busco)\\ninitial_Genomes --> Tech_metrics --> N50(N50) & n_contigs(n_contigs)\\nN50 & n_contigs & Busco & initial_Genomes & initial_Proteomes --> Filtering \\nFiltering --> Genomes{{Genomes}} & Proteomes{{Proteomes}}\\nBusco_db{{Busco_db}} --> Bio_metrics\\n\\nProteomes --> Clustering --> initial_PAM{{initial_PAM}} & initial_Ref_seqs{{initial_Ref_seqs}}\\ninitial_PAM & initial_Ref_seqs & Proteomes & Genomes --> Gene_recovery\\n\\n\\nGene_recovery --> PAM{{PAM}} & Ref_seqs{{Ref_seqs}}\\nRef_seqs --> Func_annot --> Gene_functions{{Gene_functions}} --> RF_recon\\nBiGG_genes{{BiGG_genes}} & Ref_seqs & Universe{{Universe}} --> RF_recon --> RF_draft_panmodel{{RF_draft_panmodel}}\\neggNOG_db{{eggNOG_db}} --> Func_annot\\n\\nRF_draft_panmodel & translated_ref --> expanded_RB_recon --> Initial_draft_panmodel{{Initial_draft_panmodel}}\\nRF_draft_panmodel -. reference not provided .-> Initial_draft_panmodel\\nInitial_draft_panmodel --> TCDB_expansion --> reannotation[\"reannotation\"] --> deduplication --> draft_panmodel{{draft_panmodel}}\\n\\nend\\n\\n\\n\\nclick Input_proteomes href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#starting-from-proteomes\" \"Link\"\\nclick Input_genomes href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#starting-from-genomes\" \"Link\"\\nclick Input_taxids href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#starting-from-genomes\" \"Link\"\\nclick Input_genbanks href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#starting-from-genbanks\" \"Link\"\\n\\n\\nclick Filtering href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#filtering-the-genomes\" \"Link\"\\nclick Gene_recovery href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#gene-recovery\" \"Link\"\\nclick RF_recon href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#draft-pan-gsmm-reference-free-reconstruction\" \"Link\"\\nclick expanded_RB_recon href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#draft-pan-gsmm--expanded-reference-based-reconstruction\" \"Link\"\\nclick TCDB_expansion href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#transport-reactions-expansion-with-tcdb\" \"Link\"\\nclick reannotation href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#annotation-and-duplicates-removal\" \"Link\"\\nclick deduplication href \"https://gempipe.readthedocs.io/en/latest/part_1_gempipe_recon.html#annotation-and-duplicates-removal\" \"Link\"\\n\\n\\nstyle Part_1 fill:white\\n\\n\\n\\n\\nstyle Input_proteomes fill:lightgreen\\nstyle Input_genomes fill:lightgreen\\nstyle Input_taxids fill:lightgreen\\nstyle Input_genbanks fill:lightgreen\\nstyle Input_ref_proteome fill:lightgreen\\nstyle Input_ref_model fill:lightgreen\\nstyle Input_manual_corrections fill:lightgreen\\n\\nstyle initial_Genomes fill:lightblue \\nstyle Busco fill:burlywood\\nstyle N50 fill:burlywood\\nstyle n_contigs fill:burlywood\\nstyle Genomes fill:lightblue\\nstyle initial_Proteomes fill:lightblue\\nstyle Proteomes fill:lightblue\\nstyle initial_PAM fill:lightblue\\nstyle initial_Ref_seqs fill:lightblue\\nstyle PAM fill:gold\\nstyle Ref_seqs fill:lightblue\\nstyle BiGG_genes fill:violet\\nstyle Busco_db fill:violet\\nstyle Universe fill:violet\\nstyle eggNOG_db fill:violet\\nstyle RF_draft_panmodel fill:lightblue\\nstyle translated_ref fill:lightblue\\nstyle Initial_draft_panmodel fill:lightblue\\nstyle draft_panmodel fill:lightblue\\nstyle Gene_functions fill:lightblue\\n\\n\\nstyle Gene_recovery fill:whitesmoke\\nstyle NCBI_download fill:whitesmoke\\nstyle CDS_prediction fill:whitesmoke\\nstyle Filtering fill:whitesmoke\\nstyle Bio_metrics fill:whitesmoke\\nstyle Tech_metrics fill:whitesmoke\\nstyle Clustering fill:whitesmoke\\nstyle BRH fill:whitesmoke\\nstyle RF_recon fill:whitesmoke\\nstyle expanded_RB_recon fill:whitesmoke\\nstyle Func_annot fill:whitesmoke\\nstyle TCDB_expansion fill:whitesmoke\\nstyle reannotation fill:whitesmoke\\nstyle deduplication fill:whitesmoke \\n\\n';\n",
       "            const element = document.querySelector('.mermaid-3e09f712-5004-4cd8-b018-ce975b243c75');\n",
       "            const { svg } = await mermaid.render('graphDiv-3e09f712-5004-4cd8-b018-ce975b243c75', graphDefinition);\n",
       "            element.innerHTML = svg;\n",
       "            \n",
       "            const elem = document.getElementById('graphDiv-3e09f712-5004-4cd8-b018-ce975b243c75');\n",
       "            const panzoom = Panzoom(elem, {maxScale: 50});\n",
       "            panzoom.zoom(2);\n",
       "            elem.parentElement.addEventListener('wheel', panzoom.zoomWithWheel);\n",
       "        </script>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from gempipe import Flowchart\n",
    "\n",
    "file = open('flowcharts/part_1.flowchart', 'r')\n",
    "header = 'flowchart TB \\n'\n",
    "flowchart = Flowchart(header + file.read())\n",
    "file.close()\n",
    "flowchart.render(height=300, zoom=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4fa5b804-736f-4203-89ff-26bcdb958753",
   "metadata": {},
   "source": [
    "# Part 1: gempipe recon\n",
    "\n",
    "`gempipe recon` is designed to reconstruct a **draft pan-GSMM** and a presence/absence matrix (**PAM**), starting either from genomes or proteomes. Below are shown all its options:  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d0f4ae97-ca78-4732-826d-90d56eb65e8a",
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "usage: gempipe recon [-h] [-v] [-c] [-o] [--verbose] [--overwrite] [--dbs]\n",
      "                     [-t] [-g] [-p] [-gb] [-s] [-b] [--buscoM] [--buscoF]\n",
      "                     [--ncontigs] [--N50] [--identity] [--coverage] [-rm]\n",
      "                     [-rp] [-rs] [-mc] [--tcdb] [--dedup] [--norec] [--dbmem]\n",
      "                     [--sbml] [--nofig] [-md]\n",
      "\n",
      "gempipe v1.38.4, please cite \"TODO\". Full documentation available at\n",
      "https://gempipe.readthedocs.io/en/latest/index.html.\n",
      "\n",
      "optional arguments:\n",
      "  -h, --help            Show this help message and exit.\n",
      "  -v, --version         Show version number and exit.\n",
      "  -c , --cores          Number of parallel processes to use. (default: 1)\n",
      "  -o , --outdir         Main output directory (will be created if not\n",
      "                        existing). (default: ./)\n",
      "  --verbose             Make stdout messages more verbose, including debug\n",
      "                        messages. (default: False)\n",
      "  --overwrite           Delete the working/ directory at the startup.\n",
      "                        (default: False)\n",
      "  --dbs                 Path were the needed databases are stored (or\n",
      "                        downloaded if not already existing). (default:\n",
      "                        ./working/dbs/)\n",
      "  -t , --taxids         Taxids of the species to model (comma separated, for\n",
      "                        example '252393,68334'). (default: -)\n",
      "  -g , --genomes        Input genome files or folder containing the genomes\n",
      "                        (see documentation). (default: -)\n",
      "  -p , --proteomes      Input proteome files or folder containing the\n",
      "                        proteomes (see documentation). (default: -)\n",
      "  -gb , --genbanks      Input genbank files (.gb, .gbff) or folder containing\n",
      "                        the genbanks (see documentation). (default: -)\n",
      "  -s , --staining       Gram staining, 'pos' or 'neg'. (default: neg)\n",
      "  -b , --buscodb        Busco database to use ('show' to see the list of\n",
      "                        available databases). (default: bacteria_odb10)\n",
      "  --buscoM              Maximum number of missing Busco's single copy\n",
      "                        orthologs (absolute or percentage). (default: 2%)\n",
      "  --buscoF              Maximum number of fragmented Busco's single copy\n",
      "                        orthologs (absolute or percentage). (default: 100%)\n",
      "  --ncontigs            Maximum number of contigs allowed per genome.\n",
      "                        (default: 200)\n",
      "  --N50                 Minimum N50 allowed per genome. (default: 50000)\n",
      "  --identity            Minimum percentage amino acidic sequence identity to\n",
      "                        use when aligning against the BiGG gene database.\n",
      "                        (default: 30)\n",
      "  --coverage            Minimum percentage coverage to use when aligning\n",
      "                        against the BiGG gene database. (default: 70)\n",
      "  -rm , --refmodel      Model to be used as reference. (default: -)\n",
      "  -rp , --refproteome   Proteome to be used as reference. (default: -)\n",
      "  -rs , --refspont      Reference gene marking spontaneous reactions.\n",
      "                        (default: spontaneous)\n",
      "  -mc , --mancor        Manual corrections to apply during the reference\n",
      "                        expansion. (default: -)\n",
      "  --tcdb                Experimental feature: try to build transport reactions\n",
      "                        using TCDB. (default: False)\n",
      "  --dedup               Try to remove duplicate metabolites and reactions\n",
      "                        using MNX annotation, when a reference is provided.\n",
      "                        (default: False)\n",
      "  --norec               Skip gene recovery when starting from genomes.\n",
      "                        (default: False)\n",
      "  --dbmem               Load the entire eggNOG-mapper database into memory\n",
      "                        (should speed up the functional annotation step).\n",
      "                        (default: False)\n",
      "  --sbml                Save the output GSMMs in SBML format (L3V1 FBC2) in\n",
      "                        addition to JSON. (default: False)\n",
      "  --nofig               Skip the generation of figures. (default: False)\n",
      "  -md , --metadata      Table for manual correction of genome metadata.\n",
      "                        (default: -)\n"
     ]
    }
   ],
   "source": [
    "import subprocess\n",
    "\n",
    "command = f\"\"\"gempipe recon -h\"\"\"\n",
    "process = subprocess.Popen(command, shell=True)\n",
    "response = process.wait()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1160a253-c948-4f93-885c-1f5037e145b1",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "## Starting from genomes\n",
    "\n",
    "Gempipe accepts both genomes and proteomes. Each file in input is assumed to belong to a different strain. There are several options to specifiy the input genomes: \n",
    "\n",
    "* Specify a **folder** containing the input genomes. They are all assumed to belong to the same species. For example: \n",
    "\n",
    "```bash\n",
    "    gempipe recon -g my_genomes/\n",
    "```\n",
    "\n",
    "* Specify a list of input genome **files** (comma separated). They are all assumed to belong to the same species. For example: \n",
    "\n",
    "```bash\n",
    "    gempipe recon -g my_genomes/GCA_001689725.1.fa,my_genomes/GCA_001756855.1.fa,my_genomes/GCA_003058285.2.fa\n",
    "```\n",
    "\n",
    "* Specify a list of input genome files, **grouped** by species. This follows a rigid sintax: `SpA@G1,G2,G3+SpB@G4,G5+SpC@G6,G7...`. Please replace spaces with underscores. For example:\n",
    "\n",
    "```bash\n",
    "    gempipe recon -g Erwinia_dacicola@my_genomes/GCA_001689725.1.fa,my_genomes/GCA_001756855.1.fa,my_genomes/GCA_003058285.2.fa+Erwinia_aphidicola@my_genomes/GCA_016925695.1.fa,my_genomes/GCA_024169515.1.fa\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9f0e959f-a5ee-4d9b-b364-e6d293820022",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "## Starting from species taxids\n",
    "\n",
    "As an alternative starting point, it is possible to insert the list of NCBI Species Taxids under study (comma separated). All the available genome assemblies under those taxids will be **automatically dowloaded** from NCBI, and grouped based on the species. For example: \n",
    "\n",
    "```bash\n",
    "    gempipe recon -t 252393,68334\n",
    "```\n",
    "\n",
    "💡 **Tip!** Each species appearing on NCBI is linked to its specific NCBI Species Taxid. For example, to retrieve the taxid of _Erwinia aphidicola_, enter the [NCBI Taxonomy Database](https://www.ncbi.nlm.nih.gov/taxonomy) and type the name of the species. Once the webpage for the species is [opened](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?command=show&mode=node&id=68334), the taxid can be found right below its name: 68334 for _E. aphidicola_!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d583c8f5-9084-4f5c-862c-834b78d7cda9",
   "metadata": {},
   "source": [
    "## Filtering the genomes\n",
    "\n",
    "Bad-quality genomes in input are detected and **ignored** in subsequent analysis. Bad-quality is defined by 4 metrics, which can be either **techinical** or **biological**: \n",
    "\n",
    "1. The maximum **number of contigs** per genome (`--ncontigs`).\n",
    "2. The minimum **N50** per genome (`--N50`).\n",
    "3. The maximum number of Busco's **missing** single-copy orthologs per genome (`--buscoM`) (it can be expressed also in percentage). \n",
    "4. The maximum number of Busco's **fragmented** single-copy orthologs per genome (`--buscoF`) (it can be expressed also in percentage). \n",
    "\n",
    "Please note that the total number of evaluated Busco's single-copy orthologs depends on the selected Busco **database**. The default database is `bacteria_odb10`, but it is strongly suggested to select a more **specific** database according to the organisms under study. This way, the number the evaluated Busco's single-copy orthologs will **increase**. To change the Busco database, use `-b`/`--buscodb`. \n",
    "\n",
    "💡 **Tip!** To read the list of available Busco databases, type `gempipe recon --buscodb show`.\n",
    "\n",
    "The example below deals with many metagenome-derived assemblies: to avoid the loss of key assemblies, the default thresholds are relaxed:\n",
    "\n",
    "```bash\n",
    "    gempipe recon -t 252393,68334 -b enterobacterales_odb10 --ncontigs 2000 --N50 5000 --buscoM 10% --buscoF 15%\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e77a53dd-a93c-40a3-8649-3a55c0e3aa82",
   "metadata": {},
   "source": [
    "## Starting from proteomes\n",
    "\n",
    "Gempipe gives its best when starting from genomes, as a gene recovery procedure is applied. Anyway, starting **directly** from proteomes is also allowed (see flowchart), if reliable genome annotations are available. A proteome is here defined as a multifasta file containing all the strain-specific gene sequences written in aminoacidic format. There are several options to specifiy the input proteomes: \n",
    "\n",
    "* Specify a **folder** containing the input proteomes. They are all assumed to belong to the same species. For example:\n",
    "\n",
    "```bash\n",
    "    gempipe recon -p my_proteomes/\n",
    "```\n",
    "\n",
    "* Specify a list of input proteome **files** (comma separated). They are all assumed to belong to the same species. For example:\n",
    "\n",
    "```bash\n",
    "    gempipe recon -p my_proteomes/GCA_001689725.1.fa,my_proteomes/GCA_001756855.1.fa,my_proteomes/GCA_003058285.2.fa\n",
    "```\n",
    "\n",
    "* Specify a list of input proteome files, **grouped** by species. This follows a rigid sintax: `SpA@P1,P2,P3+SpB@P4,P5+SpC@P6,P7...`. Please replace spaces with underscores. For example:\n",
    "\n",
    "```bash\n",
    "    gempipe recon -p Erwinia_dacicola@my_proteomes/GCA_001689725.1.fa,my_proteomes/GCA_001756855.1.fa,my_proteomes/GCA_003058285.2.fa+Erwinia_aphidicola@my_proteomes/GCA_016925695.1.fa,my_proteomes/GCA_024169515.1.fa\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33af101b-ce5b-4073-a68d-3faeac15342b",
   "metadata": {},
   "source": [
    "## Starting from genbanks\n",
    "\n",
    "Proteomes can be also passed as genbank files. In this case, Gempipe will try to recover gene annotations to boost the [Memote metrics](https://memote.readthedocs.io/en/latest/index.html) under the \"gene annotation\" section. There are several options to specifiy the input genbanks: \n",
    "\n",
    "* Specify a **folder** containing the input genbanks. They are all assumed to belong to the same species. For example:\n",
    "\n",
    "```bash\n",
    "    gempipe recon -gb my_genbanks/\n",
    "```\n",
    "\n",
    "* Specify a list of input genbank **files** (comma separated). They are all assumed to belong to the same species. For example:\n",
    "\n",
    "```bash\n",
    "    gempipe recon -gb my_genbanks/GCA_001689725.1.gb,my_genbanks/GCA_001756855.1.gb,my_genbanks/GCA_003058285.2.gb\n",
    "```\n",
    "\n",
    "* Specify a list of input genbank files, **grouped** by species. This follows a rigid sintax: `SpA@P1,P2,P3+SpB@P4,P5+SpC@P6,P7...`. Please replace spaces with underscores. For example:\n",
    "\n",
    "```bash\n",
    "    gempipe recon -gb Erwinia_dacicola@my_genbanks/GCA_001689725.1.gb,my_genbanks/GCA_001756855.1.gb,my_genbanks/GCA_003058285.2.gb+Erwinia_aphidicola@my_genbanks/GCA_016925695.1.gb,my_genbanks/GCA_024169515.1.gb\n",
    "```\n",
    "\n",
    "💡 **Tip!** Start from genbanks if the highest [Memote metrics](https://memote.readthedocs.io/en/latest/index.html) are desired: only with this input mode, Gempipe will try to improve also the \"gene annotation\" section."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1213f022-9b39-4710-80f8-1a89d541fc73",
   "metadata": {},
   "source": [
    "## Creation of a PAM\n",
    "\n",
    "Genes are **clustered** based on their amino acidic sequence **identity**. Each cluster has one **representative** sequence. \n",
    "\n",
    "Following the clustering, a presence/absence matrix (**PAM**) is derived. The PAM has clusters in row and strains in column. If a strain harbors two or more genes belonging to the same cluster, those will appear in the same cell separated by a **semicolon** (`;`). As an example, a PAM with just 3 strains is reported below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7b153734-7d56-4f48-9122-6a3fd5f841f1",
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>GCA_918698235.1</th>\n",
       "      <th>GCA_014773485.1</th>\n",
       "      <th>GCA_024169515.1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Cluster_0</th>\n",
       "      <td>LBGCNHPF_02539</td>\n",
       "      <td>OEIEKCCN_02232</td>\n",
       "      <td>OCILAMFM_00111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cluster_1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>OEIEKCCN_03649</td>\n",
       "      <td>OCILAMFM_04027</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cluster_2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>OEIEKCCN_04206</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cluster_3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>OCILAMFM_04171</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cluster_4</th>\n",
       "      <td>LBGCNHPF_03507</td>\n",
       "      <td>OEIEKCCN_03842</td>\n",
       "      <td>OCILAMFM_04138</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cluster_5475</th>\n",
       "      <td>LBGCNHPF_02700</td>\n",
       "      <td>OEIEKCCN_04289</td>\n",
       "      <td>OCILAMFM_01419</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cluster_5476</th>\n",
       "      <td>LBGCNHPF_01132</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cluster_5477</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>OCILAMFM_04217</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cluster_5478</th>\n",
       "      <td>LBGCNHPF_03644</td>\n",
       "      <td>OEIEKCCN_00956</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cluster_5479</th>\n",
       "      <td>LBGCNHPF_00113</td>\n",
       "      <td>OEIEKCCN_03524;OEIEKCCN_04081</td>\n",
       "      <td>OCILAMFM_03912</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5460 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "             GCA_918698235.1                GCA_014773485.1 GCA_024169515.1\n",
       "Cluster_0     LBGCNHPF_02539                 OEIEKCCN_02232  OCILAMFM_00111\n",
       "Cluster_1                NaN                 OEIEKCCN_03649  OCILAMFM_04027\n",
       "Cluster_2                NaN                 OEIEKCCN_04206             NaN\n",
       "Cluster_3                NaN                            NaN  OCILAMFM_04171\n",
       "Cluster_4     LBGCNHPF_03507                 OEIEKCCN_03842  OCILAMFM_04138\n",
       "...                      ...                            ...             ...\n",
       "Cluster_5475  LBGCNHPF_02700                 OEIEKCCN_04289  OCILAMFM_01419\n",
       "Cluster_5476  LBGCNHPF_01132                            NaN             NaN\n",
       "Cluster_5477             NaN                            NaN  OCILAMFM_04217\n",
       "Cluster_5478  LBGCNHPF_03644                 OEIEKCCN_00956             NaN\n",
       "Cluster_5479  LBGCNHPF_00113  OEIEKCCN_03524;OEIEKCCN_04081  OCILAMFM_03912\n",
       "\n",
       "[5460 rows x 3 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pnd\n",
    "\n",
    "pnd.read_csv('tutoring_materials/aphidicola/pam.csv', index_col=0)#.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1e3586d-ec44-4fd8-8a87-59df3f0bc589",
   "metadata": {},
   "source": [
    "## Gene recovery\n",
    "\n",
    "When starting from genomes (`-t` or `-g`), Gempipe executes an extensive gene recovery composed by 3 subsequent modules. To skip the gene recovery, activate the `--norec` option. To have more details on the gene recovery implementation, please read Methods in the [Gempipe paper](how_to_cite.ipynb).\n",
    "\n",
    "* **Module 1.** Sometimes a gene in a genome undergo little mutations, such as base insertions, deletions or sobstitutions. This will probably lead to the formation of **premature stop** codons in the middle of the sequence. This in turn will lead to the prediction of two or more coding sequences, instead of one, from the same genomic region. Anyway, a similar behaviour can be also be due to sequencing or assembling **errors**. Gempipe search for proteins broken in **two** high-identity pieces, and **assumes** the broke up to be due to technical (non-biological) issues. Sequences recovered in this way have a `_frag` suffix in the PAM. \n",
    "\n",
    "* **Module 2.** The prediction of coding sequences in a genome (gene calling) may not be perfect. Some genes may simply not be seen. To cope with this problem, Gempipe searches for extra coding sequences by aligning representative sequences on the genome **masked** from its predicted genes. Like in all recovery modules, only high-identity and high-coverage alignments are considered. Sequences recovered in this way have a `_refound` suffix in the PAM. \n",
    "\n",
    "* **Module 3.** From a bioinformatics point of view, sometimes a coding sequence ends **internally** to another coding sequence. In these cases, gene callers may have problems in distinguishing the two overlapping genes. Module 3 takes the short fragments recovered in Module 2, and seeks to extend them over the previously masked regions (where other genes were originally annotated by the gene caller). Sequences recovered in this way have a `_overlap` suffix in the PAM. \n",
    "\n",
    "⏩ **Warning!** Gempipe gives its best when starting from genomes: starting from proteomes (multifasta or genbank) will **skip** the gene recovery. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52296514-bf18-41ae-a59d-cfd9cf6c0d53",
   "metadata": {},
   "source": [
    "## Draft pan-GSMM reference-free reconstruction\n",
    "\n",
    "Gempipe reconstructions are based on the [BiGG database](http://bigg.ucsd.edu), preferred over [ModelSEED](https://modelseed.org/biochem/compounds) because of its **human readable** IDs (compare for example the BiGG ID for glucose `glc__D` with its ModelSEED counterpart `cpd00027`). These turn out to be particularly convenient when drawing metabolic maps with [Escher](https://escher.github.io) or [Escher-FBA](https://sbrg.github.io/escher-fba/).\n",
    "\n",
    "Gempipe uses the representative sequences of the clusters to create a reference-**free** draft pan-GSMM using the same [CarveMe](https://carveme.readthedocs.io/en/latest/index.html) gene database and **universes**, but applying different reconstruction rules. To choose between gram positive and gram negative reconstructions, use the `-s`/`--staining` option.\n",
    "\n",
    "During the reconstruction process, only reactions with genetic support are included, and no automatic gap-filling is performed, this way **minimizing** the number of **false-positives**. This approach goes in the opposite direction compared to other famous tools like [CarveMe](https://carveme.readthedocs.io/en/latest/index.html), which include an _unskippable_ gap-filling algorithm needed to ensure \"FBA-ready\" models in output: this forced gap-filling might be the cause of the inclusion of false-positive reactions (please read Results in the [Gempipe paper](how_to_cite.ipynb)).\n",
    "\n",
    "Note that a BiGG reaction can appear in different BiGG models with different **protein complex definitions** encoded in the GPR. With `gempipe recon`, a reaction is copied from the selected universe only if **at least one** of the alternative original protein complexes definitions is **fully** satisfied. Moreover, if two or more slightly different genes align equally well on the same BiGG gene, all the involved GPRs will take into account these alternative **isoforms**. These two design decisions should lead to more accurate GPRs, solving two known CarveMe issues, [#180](https://github.com/cdanielmachado/carveme/issues/180) and [#182](https://github.com/cdanielmachado/carveme/issues/182). \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ca5a804-2487-4f0e-88aa-ccd9fa3d03ae",
   "metadata": {},
   "source": [
    "## Draft pan-GSMM expanded reference-based reconstruction\n",
    "\n",
    "Gempipe accetps in input an **optional reference GSMM** (any type: pan or strain-specific): the reference GSMM itself is provided with `-rm`/`--refmodel`, while the associated proteome (aminoacidic multifasta) is provided with `-rp`/`--refproteome`. Providing a reference, the reference-free reconstruction previously produced (see above) is used to expand the reference itself with new genes and reactions. \n",
    "\n",
    "In \"pure\" reference-based reconstructions, **orthologs** are determined between each strain and the reference, for example via a best-reciprocal hits alignment (**BRH**), as described in the [2019 Nature Protocol Extension](https://doi.org/10.1038/s41596-019-0254-3) and implemented in a pipeline named [Bactabolize](https://doi.org/10.7554/eLife.87406.2). Then, for each strain, a **copy** of the reference is **subtracted** from the genes missing an ortholog, leading to the retention of reactions specific to the strain to model. Since it is **rare** to dispose **in advance** of a manually curated _pan-model_ for the species under study, usually the curated reference used in the BRH approach is a **strain-specific** model. This implies that the strain to model will harbour only a **subset** of the reference strain's reactions. This clearly limits the exploration of the biodiversity at the strain level. Therefore, reference-based reconstructions made with `gempipe recon` are always expanded with new reactions coming from the previously produced reference-free reconstruction (see flowchart).\n",
    "\n",
    "First, for each strain to model, a BRH with the reference is performed. This enables a \"**translation**\" of reference gene IDs into the respective cluster IDs (rows of the PAM). Then, the translated reference GSMM is **expanded** with new reactions and genes coming from the reference-**free** reconstruction. After this expansion process, the expanded reference GSMM becomes a draft pan-GSMM (see flowchart). \n",
    "\n",
    "In these reference-based reconstructions, the reference is the cornerstone of the reconstruction, providing the biomass equation, the NGAM energy (non-growth associated maintainence energy), all the chemical formulas and charges assigned to metabolites, and so on. Therefore, the transfer of reactions from the reference-free reconstruction to the reference GSMM is not conducted blindly, but follows some principles: \n",
    "\n",
    "* If the reaction ID is already included in the reference, the corresponding GPR is updated including eventual new gene clusters not yet included.\n",
    "\n",
    "* If the reaction ID is not included in the reference, **synonym** reactions are searched, having the same reactant and product IDs, ignoring protons. If a synonym reaction is found in the reference, the expansion is limited to new gene clusters that may be included in its GPR.\n",
    "\n",
    "* If a reaction is new, it is added to the reference model. New metabolites and gene clusters are also transferred if needed.\n",
    "\n",
    "During the reaction transfer, if a metabolite is encoded both in the reference and in the reference-free reconstruction with same ID but different formula and/or charge, the metabolite definition of the reference is maintained. During the reference expansion, however, users are allowed to **superimpose** specific reaction balances and metabolite formulas and charges using the `-m`/`--mancor` option. This will provide a text file of **manual corrections** to be applied during the reference expansion, particularly useful to avoid unbalanced reactions and stoichiometric inconsistencies. It describes one rule per line, following a rigid syntax: `rule.ID:value`. Rules are divided in 4 categories:\n",
    "\n",
    "* Formula of a metabolite, with the rule `formula`. For example:\n",
    "\n",
    "```bash\n",
    "    formula.isocapcoa:C27H42N7O17P3S\n",
    "```\n",
    "\n",
    "* Charge of a metabolite, with the rule `charge`. For example:\n",
    "\n",
    "```bash\n",
    "    charge.cdigmp:-2\n",
    "```\n",
    "\n",
    "* Definition of a reaction, with the rule `reaction`. For example:\n",
    "\n",
    "```bash\n",
    "    reaction.NTD12:dimp_c + h2o_c --> din_c + pi_c + h_c\n",
    "```\n",
    "\n",
    "* Exclusion of a reaction from the transfer, with the rule `blacklist`. For example:\n",
    "\n",
    "```bash\n",
    "    blacklist.LIPO3S24_BS\n",
    "```\n",
    "\n",
    "Plase note that metabolite IDs must be provided **without** compartment in the `formula` and `charge` rules. It is possible to comment a line (that is, to ignore a rule) prepending a `%` at the beginnning of the line. \n",
    "\n",
    "💡 **Tip!** Often, an artifact gene marking all the spontanous reactions is utilized in GSMMs (usually `spontaneous`, literally). This gene, here called the \"spontaneous\" gene, is useful to distinguish spontaneous reactions from reactions that remained without GPR (orphans). In `gempipe recon`, the \"spontaneous\" gene adopted in the reference can be specified using `-rs`/`--refspont`."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed0b67b5-c947-4fc5-94db-4a764d1c5a9b",
   "metadata": {},
   "source": [
    "## Transport reactions expansion with TCDB\n",
    "\n",
    "Since the goodness of qualitative growth simulations usually depends on transport reactions, and since the BiGG gene database is small and biased towards model organisms, `gempipe recon` provides an optional expansion of the tranport reactions using the `--tcdb` switch (still an experimental implementation). \n",
    "\n",
    "Briefly, representative sequences of the clusters are aligned on the entire TCDB gene database. For many transport families, BiGG-compatible reaction strings were generated in advance. These are imported into a GSMM as new reactions only if all the components of the trasport protein complex have been detected. Be default, the addition of TCDB-derived transport reactions is disabled, as they could impaire the stoichiometric consistency. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42fe6499-5209-490e-affe-14bb134ff692",
   "metadata": {},
   "source": [
    "## Annotation and duplicates removal\n",
    "\n",
    "When a draft pan-GSMM has been obtained, it is subjected to metabolites and reactions **re-annnotation** (see flowchart), performed _de novo_ using [MetaNetX](https://doi.org/10.1093/nar/gkaa992) v4.4 (MNX). This adds annotations for [HMDB](https://hmdb.ca), [KEGG](https://www.kegg.jp), [MetaCyc](https://metacyc.org), [Rhea](https://www.rhea-db.org), [ChEBI](https://www.ebi.ac.uk/chebi/), and many other databases. As a consequence, [Memote metrics](https://memote.readthedocs.io/en/latest/index.html) of the output GSMMs should improve.\n",
    "\n",
    "Then, the MNX annotation is optionally used to detect and correct eventual duplicated metabolites and reactions in the draft pan-GSMM, activating the `--dedup` option. Duplicates are present in the CarveMe universes and could be inherited directly, for example [`glc_D_B`](http://bigg.ucsd.edu/universal/metabolites/glc_D_B) and [`glc__bD`](http://bigg.ucsd.edu/universal/metabolites/glc__bD); other duplicates could be formed during the reference expansion. Duplicate metabolites are replaced by a consensus metabolite, giving **priority** to metabolites in the reference GSMM, if available. If the duplicate and the consensus metabolites are both present in the reference GSMM, they are both retained.\n",
    "\n",
    "Next, if `--dedup` is active, the MNX annotation is used to correct for **duplicate reactions**. As for duplicate metabolites, the consensus reaction is selected giving **priority** to those in the reference GSMM; if duplicates are present in the reference, they are kept. Otherwise, the GPR of the consensus reaction is updated to incorporate all the gene clusters appearing among the duplicates. By default, the metabolites and reactions deduplication is disabled, as it could impaire the stoichiometric consistency. \n",
    "\n",
    "The **re-annotated** draft-pan GSMM can be seen as the ultimate draft pan-GSMM, main output of `gempipe recon` together with the PAM."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37916c53-9ec6-41ad-a965-7a9636b2cb5c",
   "metadata": {},
   "source": [
    "## Output files\n",
    "\n",
    "`gempipe recon` produces 4 **main** output files in the specified output directory (`-o`/`--outdir`). These are required for manual curating the draft pan-GSMM using the [Gempipe API](autoapi/gempipe/interface/index) (see the function [gempipe.initialize](https://gempipe.readthedocs.io/en/latest/autoapi/gempipe/interface/gaps/index.html#gempipe.interface.gaps.initialize)):\n",
    "\n",
    "* **draft_panmodel.json.** The draft pan-GSMM, which will likely need a manual curation (see [Part 2](part_2_manual_curation.ipynb)).\n",
    "\n",
    "* **draft_panmodel.faa** representative aminoacidic sequences of gene clusters modelled in the draft pan-GSMM. \n",
    "\n",
    "* **pam.csv.** The final presence/absence matrix (PAM), accounting for the gene recovery (if performed).\n",
    "\n",
    "* **annotation.csv.** Functional annotation of gene clusters made by [eggNOG-mapper](http://eggnog-mapper.embl.de).\n",
    "\n",
    "Moreover, Gempipe produces also a working directory (`./working/`) with all the intermediate files and logs needed during reconstruction. The working directory allows Gempipe to **restart** from the last completed task in case of interruption. To avoid the use of cached files (i.e., to perform a fresh run), utilize the `--overwrite` option. Please note that the working directory can include also **~50 GB** of databases needed for the functional annotation. \n",
    "\n",
    "Some files in the working directory are higly informative, and are described below: \n",
    "\n",
    "* **filtering/tmetrics.csv.** Technical metrics (N50, n_contigs) used to filter good quality genomes.\n",
    "\n",
    "* **filtering/bmetrics.csv.** Biological metrics (BUSCO) used to filter good quality genomes.\n",
    "\n",
    "* **expansion/results.csv.** Table listing the reactions of the reference-free reconstruction (`rid`) and their fate respect to the expanding reference (`action`). If a reaction ID or a synonym is found in the reference (`rid_found` or `synonym`, respectively), then the reaction will not be used (`action:ignore`), unless some new genes have to be handled to update the GPR (`action:update_gpr`). If a reference reaction is found (either same ID or a synonym), then the presence of the exact same reactants and products (regardless protons) is checked (`same_mids`), as well as their formula and charge (`same_fc`). If the final reaction included ends up to be unbalaced, then balancing suggestions are included (`bal_suggestions`). \n",
    "\n",
    "* **duplicates/dup_m_edits.csv.** Table listing the duplicate metabolites (`duplicated`) and their replacing metabolite (`replacement`). Formula, charge, presence in the reference and presence in the reference-free reconstruction are reported both for the duplicate and the replacement. \n",
    "\n",
    "* **duplicates/dup_m_translations.csv.** Table listing the reactions (`rid`) affected by duplication removal, which are visible before (`reaction_old`) and after (`reaction_new`) the removal. Eventual new metabolites created appear under `added_mids`. If the final reaction ends up to be unbalaced, then balancing suggestions are included (`bal_suggestions`). \n",
    "\n",
    "* **duplicates/dup_r_edits.csv.** Table listing the duplicate reactions (`duplicated`) and their replacing reaction (`replacement`). Reaction definition, GPR, presence in the reference and presence in the reference-free reconstruction are reported both for the duplicate and the replacement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "89c2ef68-a721-47c4-9b3b-c3d9ebfaf686",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}