gempipe.interface.clusters
Module Contents
Functions
|
Perform a silhuette analysis to detect the optimal number of clusters. |
|
Create a phylo-metabolic dendrogram. |
|
Extract discriminant features from cluster of strains. |
- gempipe.interface.clusters.silhouette_analysis(tables, figsize=(10, 5), drop_const=True, ctotest=None, forcen=None, derive_report=None, report_key='species', excludekeys=[], legend_ratio=0.7, outfile=None, verbose=False, anchor=[None, None, None], key_to_color=None)[source]
Perform a silhuette analysis to detect the optimal number of clusters.
- Parameters:
tables (pnd.DataFrame) – feature tables with genome accessions are in columns and features are in rows. Can also be a dictionary of feature tables (example:
{'auxotrophies': aux_df, 'substrates': sub_df}). In this case, any number of tables (pandas.DataFrame) can be used. For each table, genome accessions are in columns, features are in rows. Directly compatible tables are: rpam.csv, cnps.csv, and aux.csv (all produced by gempipe derive).figsize (int, int) – width and height of the figure.
drop_const (bool) – if True, remove constant features.
ctotest (list) – number of clusters to test (example:
[5,7,10]to test five, seven and ten clusters). If None, all the combinations from 2 to the number of accessions -1 will be used.forcen (int) – force the number of cluster, otherwise the optimal number will picked up according to the sihouette value.
derive_report (pandas.DataFrame) – report table for the generation of strain-specific GSMMs, made by gempipe derive in the output directory (derive_strains.csv).
excludekeys (list) – keys (iches/species) not to show in legend. Bug: no more than 1 key is allowed.
report_key (str) – name of the attribute (column) appearing in derive_report, to be compared to the metabolilc clusters. Usually it is ‘species’ or ‘niche’.
legend_ratio (float) – space reserved for the legend.
outfile (str) – filepath to be used to save the image. If None it will not be saved.
verbose (bool) – if True, print more log messages.
anchor (list) – list of tuples (X,Y) for customixing the position of legends.
Nonewill leave default positioning.key_to_color (dict) – dict mapping each category in report_key to a color in the format ([0:1],[0:1],[0:1]).
Nonewill leave default color and order in the legend.
- Returns:
- A tuple containing:
matplotlib.figure.Figure: figure representing the sinhouette analysis.
dict: genome-to-cluster associations.
dict: an RGB color for each cluster.
- Return type:
tuple
- gempipe.interface.clusters.heatmap_multilayer(tables, figsize=(10, 5), drop_const=True, derive_report=None, report_key='species', excludekeys=[], acc_to_cluster=None, cluster_to_color=None, legend_ratio=0.7, label_ratio=0.02, outfile=None, verbose=False, anchor=[None, None, None], key_to_color=None, xlabels=False)[source]
Create a phylo-metabolic dendrogram.
- Parameters:
tables (pnd.DataFrame) – feature tables with genome accessions are in columns and features are in rows. Can also be a dictionary of feature tables (example:
{'auxotrophies': aux_df, 'substrates': sub_df}). In this case, any number of tables (pandas.DataFrame) can be used. For each table, genome accessions are in columns, features are in rows. Directly compatible tables are: rpam.csv, cnps.csv, and aux.csv (all produced by gempipe derive).figsize (int, int) – width and height of the figure.
drop_const (bool) – if True, remove constant features.
derive_report (pandas.DataFrame) – report table for the generation of strain-specific GSMMs, made by gempipe derive in the output directory (derive_strains.csv).
report_key (str) – name of the attribute (column) appearing in derive_report, to be compared to the metabolilc clusters. Usually it is ‘species’ or ‘niche’.
excludekeys (list) – keys (iches/species) not to show in legend. Bug: no more than 1 key is allowed.
acc_to_cluster (dict) – genome-to-cluster associations produced by silhouette_analysis().
cluster_to_color (dict) – cluster-to-RGB color associations produced by silhouette_analysis().
legend_ratio (float) – space reserved for the legend.
label_ratio (float) – space reserved for the Y-axis labels.
outfile (str) – filepath to be used to save the image. If None it will not be saved.
verbose (bool) – if True, print more log messages
anchor (list) – list of tuples (X,Y) for customixing the position of legends.
Nonewill leave default positioning.key_to_color (dict) – dict mapping each category in report_key to a color in the format ([0:1],[0:1],[0:1]).
Nonewill leave default color and order in the legend.xlabels (bool) – if True, show x-axis labels (feature IDs).
- Returns:
- A tuple containing:
matplotlib.figure.Figure: figure representing the phylometabolic tree and associated heatmap.
pnd.DataFrame: table representing the binary features contained in the heatmap.
- Return type:
tuple
- gempipe.interface.clusters.discriminant_feat(binary_feats, acc_to_cluster, cluster_to_color, threshold=0.9)[source]
Extract discriminant features from cluster of strains.
- Parameters:
tables (pnd.DataFrame) – binary features table such as the one produced by heatmap_multilayer (genomes in row, binary featuresin column).
acc_to_cluster (dict) – dictionary such as the one produced by silhouette_analysis` (accessions as keys, cluster assignment as value).
cluster_to_color (dict) – dictionary such as the one produced by silhouette_analysis` (clusters as keys, colors as value).
threshold (float) – features are shown if at least one cluster has relative frequency >= threshold and, at the same time, at least another cluster has relative frequency <= 1-threshold.
- Returns:
- A tuple containing:
matplotlib.figure.Figure: figure representing the discriminative binary features.
- Return type:
tuple