gempipe.interface.clusters

Module Contents

Functions

silhouette_analysis(tables[, figsize, drop_const, ...])

Perform a silhuette analysis to detect the optimal number of clusters.

heatmap_multilayer(tables[, figsize, drop_const, ...])

Create a phylo-metabolic dendrogram.

discriminant_feat(binary_feats, acc_to_cluster, ...[, ...])

Extract discriminant features from cluster of strains.

gempipe.interface.clusters.silhouette_analysis(tables, figsize=(10, 5), drop_const=True, ctotest=None, forcen=None, derive_report=None, report_key='species', excludekeys=[], legend_ratio=0.7, outfile=None, verbose=False, anchor=[None, None, None], key_to_color=None)[source]

Perform a silhuette analysis to detect the optimal number of clusters.

Parameters:
  • tables (pnd.DataFrame) – feature tables with genome accessions are in columns and features are in rows. Can also be a dictionary of feature tables (example: {'auxotrophies': aux_df, 'substrates': sub_df}). In this case, any number of tables (pandas.DataFrame) can be used. For each table, genome accessions are in columns, features are in rows. Directly compatible tables are: rpam.csv, cnps.csv, and aux.csv (all produced by gempipe derive).

  • figsize (int, int) – width and height of the figure.

  • drop_const (bool) – if True, remove constant features.

  • ctotest (list) – number of clusters to test (example: [5,7,10] to test five, seven and ten clusters). If None, all the combinations from 2 to the number of accessions -1 will be used.

  • forcen (int) – force the number of cluster, otherwise the optimal number will picked up according to the sihouette value.

  • derive_report (pandas.DataFrame) – report table for the generation of strain-specific GSMMs, made by gempipe derive in the output directory (derive_strains.csv).

  • excludekeys (list) – keys (iches/species) not to show in legend. Bug: no more than 1 key is allowed.

  • report_key (str) – name of the attribute (column) appearing in derive_report, to be compared to the metabolilc clusters. Usually it is ‘species’ or ‘niche’.

  • legend_ratio (float) – space reserved for the legend.

  • outfile (str) – filepath to be used to save the image. If None it will not be saved.

  • verbose (bool) – if True, print more log messages.

  • anchor (list) – list of tuples (X,Y) for customixing the position of legends. None will leave default positioning.

  • key_to_color (dict) – dict mapping each category in report_key to a color in the format ([0:1],[0:1],[0:1]). None will leave default color and order in the legend.

Returns:

A tuple containing:
  • matplotlib.figure.Figure: figure representing the sinhouette analysis.

  • dict: genome-to-cluster associations.

  • dict: an RGB color for each cluster.

Return type:

tuple

gempipe.interface.clusters.heatmap_multilayer(tables, figsize=(10, 5), drop_const=True, derive_report=None, report_key='species', excludekeys=[], acc_to_cluster=None, cluster_to_color=None, legend_ratio=0.7, label_ratio=0.02, outfile=None, verbose=False, anchor=[None, None, None], key_to_color=None, xlabels=False)[source]

Create a phylo-metabolic dendrogram.

Parameters:
  • tables (pnd.DataFrame) – feature tables with genome accessions are in columns and features are in rows. Can also be a dictionary of feature tables (example: {'auxotrophies': aux_df, 'substrates': sub_df}). In this case, any number of tables (pandas.DataFrame) can be used. For each table, genome accessions are in columns, features are in rows. Directly compatible tables are: rpam.csv, cnps.csv, and aux.csv (all produced by gempipe derive).

  • figsize (int, int) – width and height of the figure.

  • drop_const (bool) – if True, remove constant features.

  • derive_report (pandas.DataFrame) – report table for the generation of strain-specific GSMMs, made by gempipe derive in the output directory (derive_strains.csv).

  • report_key (str) – name of the attribute (column) appearing in derive_report, to be compared to the metabolilc clusters. Usually it is ‘species’ or ‘niche’.

  • excludekeys (list) – keys (iches/species) not to show in legend. Bug: no more than 1 key is allowed.

  • acc_to_cluster (dict) – genome-to-cluster associations produced by silhouette_analysis().

  • cluster_to_color (dict) – cluster-to-RGB color associations produced by silhouette_analysis().

  • legend_ratio (float) – space reserved for the legend.

  • label_ratio (float) – space reserved for the Y-axis labels.

  • outfile (str) – filepath to be used to save the image. If None it will not be saved.

  • verbose (bool) – if True, print more log messages

  • anchor (list) – list of tuples (X,Y) for customixing the position of legends. None will leave default positioning.

  • key_to_color (dict) – dict mapping each category in report_key to a color in the format ([0:1],[0:1],[0:1]). None will leave default color and order in the legend.

  • xlabels (bool) – if True, show x-axis labels (feature IDs).

Returns:

A tuple containing:
  • matplotlib.figure.Figure: figure representing the phylometabolic tree and associated heatmap.

  • pnd.DataFrame: table representing the binary features contained in the heatmap.

Return type:

tuple

gempipe.interface.clusters.discriminant_feat(binary_feats, acc_to_cluster, cluster_to_color, threshold=0.9)[source]

Extract discriminant features from cluster of strains.

Parameters:
  • tables (pnd.DataFrame) – binary features table such as the one produced by heatmap_multilayer (genomes in row, binary featuresin column).

  • acc_to_cluster (dict) – dictionary such as the one produced by silhouette_analysis` (accessions as keys, cluster assignment as value).

  • cluster_to_color (dict) – dictionary such as the one produced by silhouette_analysis` (clusters as keys, colors as value).

  • threshold (float) – features are shown if at least one cluster has relative frequency >= threshold and, at the same time, at least another cluster has relative frequency <= 1-threshold.

Returns:

A tuple containing:
  • matplotlib.figure.Figure: figure representing the discriminative binary features.

Return type:

tuple