iMVP_utils package
Submodules
iMVP_utils.clustering module
- iMVP_utils.clustering.run_HDBSCAN(df=None, X=None, Y=None, soft_clustering=True, min_cluster_size=100, min_samples=10, cluster_selection_epsilon=0.0, cluster_selection_method='eom', draw_condensed_tree=True, core_dist_n_jobs=6)
An implement of HDBSCAN (CPU version)
- Parameters:
df (pd.DataFrame) – A DataFrame with columns X and Y.
X (iterable) – A list of X values.
Y (iterable) – A list of Y values.
soft_clustering (boolean) – Use soft clustering or not. Default=True.
min_cluster_size (int) – min_cluster_size in HDBSCAN.
min_samples (int) – min_samples in HDBSCAN
cluster_selection_epsilon (float) – cluster_selection_epsilon in HDBSCAN
cluster_selection_method (str) – cluster_selection_method in HDBSCAN. Should be “eom” or “leaf”.
draw_condensed_tree (boolean) – Draw the condensed tree of HDBSCAN or not.
core_dist_n_jobs – core_dist_n_jobs in HDBSCAN.
- Returns:
sequences_onehot – A list of one-hot encoded sequences.
- Return type:
list
- iMVP_utils.clustering.run_HDBSCAN_GPU(df=None, X=None, Y=None, min_cluster_size=100, min_samples=10, cluster_selection_epsilon=0.0, cluster_selection_method='eom')
An implement of HDBSCAN (GPU version). Only standard clustering mode is available.
- Parameters:
df (pd.DataFrame) – A DataFrame with columns X and Y.
X (iterable) – A list of X values.
Y (iterable) – A list of Y values.
min_cluster_size (int) – min_cluster_size in HDBSCAN.
min_samples (int) – min_samples in HDBSCAN
cluster_selection_epsilon (float) – cluster_selection_epsilon in HDBSCAN
cluster_selection_method (str) – cluster_selection_method in HDBSCAN. Should be “eom” or “leaf”.
- Returns:
sequences_onehot – A list of one-hot encoded sequences.
- Return type:
list
- iMVP_utils.clustering.run_HDBSCAN_subclustering(df=None, target=None, cluster_col='Cluster', soft_clustering=True, min_cluster_size=100, min_samples=10, cluster_selection_epsilon=0.0, cluster_selection_method='eom', draw_condensed_tree=True, core_dist_n_jobs=None)
An implement of HDBSCAN (CPU version) for further clustering of a subcluster.
- Parameters:
df (pd.DataFrame) – A DataFrame with columns X, Y, and clusters.
soft_clustering (boolean) – Use soft clustering or not. Default=True.
min_cluster_size (int) – min_cluster_size in HDBSCAN.
min_samples (int) – min_samples in HDBSCAN
cluster_selection_epsilon (float) – cluster_selection_epsilon in HDBSCAN
cluster_selection_method (str) – cluster_selection_method in HDBSCAN. Should be “eom” or “leaf”.
draw_condensed_tree (boolean) – Draw the condensed tree of HDBSCAN or not.
core_dist_n_jobs – core_dist_n_jobs in HDBSCAN.
- Returns:
sequences_onehot – A list of one-hot encoded sequences.
- Return type:
list
- iMVP_utils.clustering.run_Leiden(graph, df=None, random_state=42, resolution_parameter=1.0)
Clustering UMAP result with Leiden.
- Parameters:
graph (iGraph object) – An iGraph object computed from UMAP nearest neighbor.
df (pd.DataFrame) – If given, will add a column named “Cluster” to the DataFrame; else will return the labels.
random_state (int) – Random seed.
resolution_parameter (float) – resolution_parameter for Louvain
- Return type:
pd.DataFrame or a list
- iMVP_utils.clustering.run_Louvain(graph, df=None, random_state=42, resolution_parameter=1.0)
Clustering UMAP result with Louvain.
- Parameters:
graph (iGraph object) – An iGraph object computed from UMAP nearest neighbor.
df (pd.DataFrame) – If given, will add a column named “Cluster” to the DataFrame; else will return the labels.
random_state (int) – Random seed.
resolution_parameter (float) – resolution_parameter for Louvain
- Return type:
pd.DataFrame or a list
iMVP_utils.embedding module
- iMVP_utils.embedding.compute_connectivities_umap(knn_indices, knn_dists, n_obs, n_neighbors, set_op_mix_ratio=1.0, local_connectivity=1.0)
A helper function for Louvain and Leiden. Adopted from Scanpy.
- Parameters:
knn_indices (object) –
knn_dists (object) –
n_obs (int) –
n_neighbors (int) –
set_op_mix_ratio (float) –
local_connectivity (float) –
- iMVP_utils.embedding.get_igraph(onehot_input, random_state=42, metric='euclidean', n_neighbors=20, metric_kwds={}, n_jobs=6, angular=False, verbose=False)
Prepare iGraph object for Louvain and Leiden
- Parameters:
onehot_input (np.array) – The one-hot encoded sequences.
random_state (int) – Random seed.
metric (str) – Same as UMAP performed.
n_neighbors (int) – Same as UMAP.
metric_kwds (dict) –
angular (boolean) –
verbose (boolean) –
- Return type:
iGraph object
- iMVP_utils.embedding.get_igraph_from_adjacency(adjacency, directed=None)
A helper function for Louvain and Leiden. Adopted from Scanpy.
- Parameters:
adjacency (object) – Generated by compute_connectivities_umap
- Return type:
iGraph object
- iMVP_utils.embedding.get_sparse_matrix_from_indices_distances_umap(knn_indices, knn_dists, n_obs, n_neighbors)
A helper function for Louvain and Leiden. Adopted from Scanpy.
- Parameters:
knn_indices (object) –
knn_dists (object) –
n_obs (int) –
n_neighbors (int) –
- iMVP_utils.embedding.onehot_encoder_df(df, column='seq', enc_bases='ATCGN')
This function is used for generate One-Hot encoding sequences from a DataFrame.
- Parameters:
df (pd.DataFrame) – A DataFrame.
column (str or tuple) – The column containing the sequences
enc_bases (str) – The encoding bases. Default=”ATCGN”.
- Returns:
sequences_onehot – A list of one-hot encoded sequences.
- Return type:
list
- iMVP_utils.embedding.onehot_encoder_iterable(iter_obj, enc_bases='ATCGN')
This function is used for generate One-Hot encoding sequences from a iterable object.
- Parameters:
iter_obj (iterable) – An iterable object containing the sequences.
enc_bases (str) – The encoding bases. Default=”ATCGN”.
- Returns:
sequences_onehot – A list of one-hot encoded sequences.
- Return type:
list
- iMVP_utils.embedding.run_UMAP(onehot_input, df=None, init='random', random_state=42, min_dist=0.01, n_neighbors=20, densmap=False, verbose=True, n_jobs=6)
An implement of UMAP (CPU version).
- Parameters:
onehot_input (iterable.) – A list of one-hot encoded sequences.
df (pd.DataFrame) – A DataFrame to process. If given, it will return a DataFrame with X and Y columns. If not, it will return X and Y, separatively.
init (str.) – init value for UMAP.
random_state (int) – random seed.
min_dist (float) – min_dist for UMAP
n_neighbors (int) – n_neighbors for UMAP
densmap (boolean) – If use DensMAP.
verbose (boolean) – verbose level
- Return type:
A DataFrame or [X and Y]
- iMVP_utils.embedding.run_UMAP_GPU(onehot_input, df=None, init='random', random_state=42, min_dist=0.01, n_neighbors=20, densmap=False, verbose=True)
An implement of UMAP (GPU version).
- Parameters:
onehot_input (iterable.) – A list of one-hot encoded sequences.
df (pd.DataFrame) – A DataFrame to process. If given, it will return a DataFrame with X and Y columns. If not, it will return X and Y, separatively.
init (str.) – init value for UMAP.
random_state (int) – random seed.
min_dist (float) – min_dist for UMAP
n_neighbors (int) – n_neighbors for UMAP
densmap (boolean) – If use DensMAP.
verbose (boolean) – verbose level
- Return type:
A DataFrame or [X and Y]
iMVP_utils.interactive module
iMVP_utils.interactive_functions module
iMVP_utils.plots module
iMVP plots
- iMVP_utils.plots.draw_2D_hist(df, vmax=0.05, cmin=None, density=True, xlim=None, ylim=None, bins=[600, 600])
This function is used for draw a 2D histogram.
- Parameters:
df (pd.DataFrame) – A DataFrame containing the columns X and Y.
vmax (float) – The vmax parameter for hist2d. Default=0.05.
cmin (float) – The cmin parameter for hist2d. Default=None.
density (boolean) – If draw density histogram. Default=True.
xlim (tuple) – xlim for hist2d. Default=None.
ylim (tuple) – ylim for hist2d. Default=None.
bins (tuple) – Bin numbers for hist2d. Default=[600,600]
- Returns:
hist2d (np.array) – A 2D array representing the values of the histogram. Please note that this array has been 90-degree rotated to fit the real X-Y and hence can be drawn with plt.imshow() directly.
edgesX (np.array) – The X edges.
edgesY (np.array) – The Y edges
- iMVP_utils.plots.show_logos_cols(prefix, names=None, cols=3, figsize=(8, 8), auto_size=True, auto_width=4, auto_height=1.5, savefig_name=None, dpi=300)
This function is used for plot a series of motif logos in PNG format.
- Parameters:
prefix (str) – The name of output path, required. This function will scan all PNG files in this path.
names (tuple) – If given, only plot the given file names. Default=None
cols (int) – The number of columns. Default=3
figsize (tuple) – The figsize parameter for matplotlib.pyplot.subpolots()
auto_size (boolean) – If True, ignore figsze and compute the width and height automatically.
auto_width (float) – The width factor used for auto_size.
auto_height (float) – The height factor used for auto_size.
savefig_name (str) – The plot to save, should end with .pdf or .png or ect. If None, figure will not be drawn.
dpi (int) – The dpi value for the figure.
- Return type:
matplotlib.axes
iMVP_utils.setup module
iMVP_utils.utils module
iMVP helper functions.
- iMVP_utils.utils.extract_fasta_and_draw_motifs(prefix, df, cluster_col='Cluster', filter=None, motif_column='seq', draw_logos=True)
This function is used for quick extraction of sequences strored in a DataFrame into a FASTA file and then draw the motif logos with Weblogo.
- Parameters:
prefix (str) – The name of output path, required.
df (pd.DataFrame) – A DataFrame containing the sequences used, required.
cluster_col (str or tuple) – The column name for the clusters, default=”Cluster”.
filter (boolean) – The column name used for filtering results, where only TRUE values will be used, default=None (not applied).
motif_column (str or tuple) – The column that cotaining the motif sequences, default=”motif_F10”.
draw_logos (boolean) – If use Weblogo to draw logos, default=True.
- Return type:
None
- iMVP_utils.utils.hist_to_spots(hist2d, cutoff=5, bins=[600, 600], pixel_lower=1, pixel_upper=10, show_small_clusters_id=True, show_big_clusters_id=True, figsize=(12, 12), figure_name='hist2D.png')
This function is used for converting 2D histogram to spots (clusters).
- Parameters:
hist2d (str) – The 2D histogram. (From draw_hist2d function)
cutoff (int) – The cutoff for cv2.threshold, range from 0 to 255. Default=5.
bins (tuple) – Should be equal to that of the hist2D.
pixel_lower (int) – The lower limit of the pixels considering as a “small spot”. Spots smaller than this will be ignored.
pixel_upper (int) – The upper limit of the pixels considering as a “small spot”. Spots larger than this will be considered as “big spot”
show_small_clusters_id (boolean) – If draw the ids for small clusters.
show_big_clusters_id (boolean) – If draw the ids for big clusters.
figsize (tuple) – Figure size for matplotlib.
figure_name (str) – The name of hist2D figure.
- Returns:
axes (matplotlib.axes) – The axes.
dict_cnt_small – A dictionary of {id: locations} for the small spots.
dict_cnt_big – A dictionary of {id: locations} for the big spots.
- iMVP_utils.utils.load_sequences_from_fasta(fn)
This function is used for load sequences from a FASTA file into a pandas DataFrame.
- Parameters:
fn (str) – The file to load.
- Return type:
pd.DataFrame
- iMVP_utils.utils.phase_shift(df, dict_all_5mers, cluster_id=None, column_motif_F10='motif_F10', current_phase=0, target_base='A')
Perform phase shift.
- Parameters:
df (pd.DataFrame) – A DataFrame object with X and Y column.
dict_all_5mers (dict) – The dictionary from prepare_kmers_dict.
cluster_id (int) – The id of specific cluster.
column_motif_F10 (str) – The column name of the 10-nt flanking sequences.
current_phase (int) – The current phase of the cluster.
target_base (str) – The target base to perform phase matching.
- Return type:
pd.DataFrame
- iMVP_utils.utils.prepare_kmers_dict(df, column='motif_F14')
Prepare all kmers from a DataFrame with flanking 14 nt sequences.
- Parameters:
df (pd.DataFrame) – A DataFrame object with X and Y column.
column (str) – The name of column containing flanking 14 nt sequences.
- Return type:
dict
- iMVP_utils.utils.retrive_clusters(df, edgesX, edgesY, dict_clusters, bins=[600, 600], cluster_ids=None, spot_name='spot')
This function is used for annotate sites with clusters.
- Parameters:
df (pd.DataFrame) – A DataFrame object with X and Y column.
edgesX (np.array) – The array of X edges generated by hist2D.
edgesY (np.array) – The array of Y edges generated by hist2D.
dict_clusters (dict) – The dictionary generated by hist_to_spots.
bins (tuple) – Should equal to that of hist2D.
cluster_ids (iterable) – If given, only find clusters with that ids.
spot_name (str) – The column name of the clusters.
- Return type:
pd.DataFrame