:py:mod:`gempipe.interface.clusters` ==================================== .. py:module:: gempipe.interface.clusters Module Contents --------------- Functions ~~~~~~~~~ .. autoapisummary:: gempipe.interface.clusters.silhouette_analysis gempipe.interface.clusters.heatmap_multilayer gempipe.interface.clusters.discriminant_feat .. py:function:: silhouette_analysis(tables, figsize=(10, 5), drop_const=True, ctotest=None, forcen=None, derive_report=None, report_key='species', excludekeys=[], legend_ratio=0.7, outfile=None, verbose=False, anchor=[None, None, None], key_to_color=None) Perform a silhuette analysis to detect the optimal number of clusters. :param tables: feature tables with genome accessions are in columns and features are in rows. Can also be a dictionary of feature tables (example: ``{'auxotrophies': aux_df, 'substrates': sub_df})``. In this case, any number of tables (pandas.DataFrame) can be used. For each table, genome accessions are in columns, features are in rows. Directly compatible tables are: `rpam.csv`, `cnps.csv`, and `aux.csv` (all produced by `gempipe derive`). :type tables: pnd.DataFrame :param figsize: width and height of the figure. :type figsize: int, int :param drop_const: if `True`, remove constant features. :type drop_const: bool :param ctotest: number of clusters to test (example: ``[5,7,10]`` to test five, seven and ten clusters). If `None`, all the combinations from 2 to the number of accessions -1 will be used. :type ctotest: list :param forcen: force the number of cluster, otherwise the optimal number will picked up according to the sihouette value. :type forcen: int :param derive_report: report table for the generation of strain-specific GSMMs, made by `gempipe derive` in the output directory (`derive_strains.csv`). :type derive_report: pandas.DataFrame :param excludekeys: keys (iches/species) not to show in legend. Bug: no more than 1 key is allowed. :type excludekeys: list :param report_key: name of the attribute (column) appearing in `derive_report`, to be compared to the metabolilc clusters. Usually it is 'species' or 'niche'. :type report_key: str :param legend_ratio: space reserved for the legend. :type legend_ratio: float :param outfile: filepath to be used to save the image. If `None` it will not be saved. :type outfile: str :param verbose: if `True`, print more log messages. :type verbose: bool :param anchor: list of tuples (X,Y) for customixing the position of legends. ``None`` will leave default positioning. :type anchor: list :param key_to_color: dict mapping each category in `report_key` to a color in the format ([0:1],[0:1],[0:1]). ``None`` will leave default color and order in the legend. :type key_to_color: dict :returns: A tuple containing: - matplotlib.figure.Figure: figure representing the sinhouette analysis. - dict: genome-to-cluster associations. - dict: an RGB color for each cluster. :rtype: tuple .. py:function:: heatmap_multilayer(tables, figsize=(10, 5), drop_const=True, derive_report=None, report_key='species', excludekeys=[], acc_to_cluster=None, cluster_to_color=None, legend_ratio=0.7, label_ratio=0.02, outfile=None, verbose=False, anchor=[None, None, None], key_to_color=None, xlabels=False) Create a phylo-metabolic dendrogram. :param tables: feature tables with genome accessions are in columns and features are in rows. Can also be a dictionary of feature tables (example: ``{'auxotrophies': aux_df, 'substrates': sub_df})``. In this case, any number of tables (pandas.DataFrame) can be used. For each table, genome accessions are in columns, features are in rows. Directly compatible tables are: `rpam.csv`, `cnps.csv`, and `aux.csv` (all produced by `gempipe derive`). :type tables: pnd.DataFrame :param figsize: width and height of the figure. :type figsize: int, int :param drop_const: if `True`, remove constant features. :type drop_const: bool :param derive_report: report table for the generation of strain-specific GSMMs, made by `gempipe derive` in the output directory (`derive_strains.csv`). :type derive_report: pandas.DataFrame :param report_key: name of the attribute (column) appearing in `derive_report`, to be compared to the metabolilc clusters. Usually it is 'species' or 'niche'. :type report_key: str :param excludekeys: keys (iches/species) not to show in legend. Bug: no more than 1 key is allowed. :type excludekeys: list :param acc_to_cluster: genome-to-cluster associations produced by `silhouette_analysis()`. :type acc_to_cluster: dict :param cluster_to_color: cluster-to-RGB color associations produced by `silhouette_analysis()`. :type cluster_to_color: dict :param legend_ratio: space reserved for the legend. :type legend_ratio: float :param label_ratio: space reserved for the Y-axis labels. :type label_ratio: float :param outfile: filepath to be used to save the image. If `None` it will not be saved. :type outfile: str :param verbose: if `True`, print more log messages :type verbose: bool :param anchor: list of tuples (X,Y) for customixing the position of legends. ``None`` will leave default positioning. :type anchor: list :param key_to_color: dict mapping each category in `report_key` to a color in the format ([0:1],[0:1],[0:1]). ``None`` will leave default color and order in the legend. :type key_to_color: dict :param xlabels: if `True`, show x-axis labels (feature IDs). :type xlabels: bool :returns: A tuple containing: - matplotlib.figure.Figure: figure representing the phylometabolic tree and associated heatmap. - pnd.DataFrame: table representing the binary features contained in the heatmap. :rtype: tuple .. py:function:: discriminant_feat(binary_feats, acc_to_cluster, cluster_to_color, threshold=0.9) Extract discriminant features from cluster of strains. :param tables: binary features table such as the one produced by `heatmap_multilayer` (genomes in row, binary featuresin column). :type tables: pnd.DataFrame :param acc_to_cluster: dictionary such as the one produced by `silhouette_analysis`` (accessions as keys, cluster assignment as value). :type acc_to_cluster: dict :param cluster_to_color: dictionary such as the one produced by `silhouette_analysis`` (clusters as keys, colors as value). :type cluster_to_color: dict :param threshold: features are shown if at least one cluster has relative frequency >= `threshold` and, at the same time, at least another cluster has relative frequency <= 1-`threshold`. :type threshold: float :returns: A tuple containing: - matplotlib.figure.Figure: figure representing the discriminative binary features. :rtype: tuple