iMVP_utils package

Submodules

iMVP_utils.clustering module

iMVP_utils.clustering.run_HDBSCAN(df=None, X=None, Y=None, soft_clustering=True, min_cluster_size=100, min_samples=10, cluster_selection_epsilon=0.0, cluster_selection_method='eom', draw_condensed_tree=True, core_dist_n_jobs=6)

An implement of HDBSCAN (CPU version)

Parameters:
  • df (pd.DataFrame) – A DataFrame with columns X and Y.

  • X (iterable) – A list of X values.

  • Y (iterable) – A list of Y values.

  • soft_clustering (boolean) – Use soft clustering or not. Default=True.

  • min_cluster_size (int) – min_cluster_size in HDBSCAN.

  • min_samples (int) – min_samples in HDBSCAN

  • cluster_selection_epsilon (float) – cluster_selection_epsilon in HDBSCAN

  • cluster_selection_method (str) – cluster_selection_method in HDBSCAN. Should be “eom” or “leaf”.

  • draw_condensed_tree (boolean) – Draw the condensed tree of HDBSCAN or not.

  • core_dist_n_jobs – core_dist_n_jobs in HDBSCAN.

Returns:

sequences_onehot – A list of one-hot encoded sequences.

Return type:

list

iMVP_utils.clustering.run_HDBSCAN_GPU(df=None, X=None, Y=None, min_cluster_size=100, min_samples=10, cluster_selection_epsilon=0.0, cluster_selection_method='eom')

An implement of HDBSCAN (GPU version). Only standard clustering mode is available.

Parameters:
  • df (pd.DataFrame) – A DataFrame with columns X and Y.

  • X (iterable) – A list of X values.

  • Y (iterable) – A list of Y values.

  • min_cluster_size (int) – min_cluster_size in HDBSCAN.

  • min_samples (int) – min_samples in HDBSCAN

  • cluster_selection_epsilon (float) – cluster_selection_epsilon in HDBSCAN

  • cluster_selection_method (str) – cluster_selection_method in HDBSCAN. Should be “eom” or “leaf”.

Returns:

sequences_onehot – A list of one-hot encoded sequences.

Return type:

list

iMVP_utils.clustering.run_HDBSCAN_subclustering(df=None, target=None, cluster_col='Cluster', soft_clustering=True, min_cluster_size=100, min_samples=10, cluster_selection_epsilon=0.0, cluster_selection_method='eom', draw_condensed_tree=True, core_dist_n_jobs=None)

An implement of HDBSCAN (CPU version) for further clustering of a subcluster.

Parameters:
  • df (pd.DataFrame) – A DataFrame with columns X, Y, and clusters.

  • soft_clustering (boolean) – Use soft clustering or not. Default=True.

  • min_cluster_size (int) – min_cluster_size in HDBSCAN.

  • min_samples (int) – min_samples in HDBSCAN

  • cluster_selection_epsilon (float) – cluster_selection_epsilon in HDBSCAN

  • cluster_selection_method (str) – cluster_selection_method in HDBSCAN. Should be “eom” or “leaf”.

  • draw_condensed_tree (boolean) – Draw the condensed tree of HDBSCAN or not.

  • core_dist_n_jobs – core_dist_n_jobs in HDBSCAN.

Returns:

sequences_onehot – A list of one-hot encoded sequences.

Return type:

list

iMVP_utils.clustering.run_Leiden(graph, df=None, random_state=42, resolution_parameter=1.0)

Clustering UMAP result with Leiden.

Parameters:
  • graph (iGraph object) – An iGraph object computed from UMAP nearest neighbor.

  • df (pd.DataFrame) – If given, will add a column named “Cluster” to the DataFrame; else will return the labels.

  • random_state (int) – Random seed.

  • resolution_parameter (float) – resolution_parameter for Louvain

Return type:

pd.DataFrame or a list

iMVP_utils.clustering.run_Louvain(graph, df=None, random_state=42, resolution_parameter=1.0)

Clustering UMAP result with Louvain.

Parameters:
  • graph (iGraph object) – An iGraph object computed from UMAP nearest neighbor.

  • df (pd.DataFrame) – If given, will add a column named “Cluster” to the DataFrame; else will return the labels.

  • random_state (int) – Random seed.

  • resolution_parameter (float) – resolution_parameter for Louvain

Return type:

pd.DataFrame or a list

iMVP_utils.embedding module

iMVP_utils.embedding.compute_connectivities_umap(knn_indices, knn_dists, n_obs, n_neighbors, set_op_mix_ratio=1.0, local_connectivity=1.0)

A helper function for Louvain and Leiden. Adopted from Scanpy.

Parameters:
  • knn_indices (object) –

  • knn_dists (object) –

  • n_obs (int) –

  • n_neighbors (int) –

  • set_op_mix_ratio (float) –

  • local_connectivity (float) –

iMVP_utils.embedding.get_igraph(onehot_input, random_state=42, metric='euclidean', n_neighbors=20, metric_kwds={}, n_jobs=6, angular=False, verbose=False)

Prepare iGraph object for Louvain and Leiden

Parameters:
  • onehot_input (np.array) – The one-hot encoded sequences.

  • random_state (int) – Random seed.

  • metric (str) – Same as UMAP performed.

  • n_neighbors (int) – Same as UMAP.

  • metric_kwds (dict) –

  • angular (boolean) –

  • verbose (boolean) –

Return type:

iGraph object

iMVP_utils.embedding.get_igraph_from_adjacency(adjacency, directed=None)

A helper function for Louvain and Leiden. Adopted from Scanpy.

Parameters:

adjacency (object) – Generated by compute_connectivities_umap

Return type:

iGraph object

iMVP_utils.embedding.get_sparse_matrix_from_indices_distances_umap(knn_indices, knn_dists, n_obs, n_neighbors)

A helper function for Louvain and Leiden. Adopted from Scanpy.

Parameters:
  • knn_indices (object) –

  • knn_dists (object) –

  • n_obs (int) –

  • n_neighbors (int) –

iMVP_utils.embedding.onehot_encoder_df(df, column='seq', enc_bases='ATCGN')

This function is used for generate One-Hot encoding sequences from a DataFrame.

Parameters:
  • df (pd.DataFrame) – A DataFrame.

  • column (str or tuple) – The column containing the sequences

  • enc_bases (str) – The encoding bases. Default=”ATCGN”.

Returns:

sequences_onehot – A list of one-hot encoded sequences.

Return type:

list

iMVP_utils.embedding.onehot_encoder_iterable(iter_obj, enc_bases='ATCGN')

This function is used for generate One-Hot encoding sequences from a iterable object.

Parameters:
  • iter_obj (iterable) – An iterable object containing the sequences.

  • enc_bases (str) – The encoding bases. Default=”ATCGN”.

Returns:

sequences_onehot – A list of one-hot encoded sequences.

Return type:

list

iMVP_utils.embedding.run_UMAP(onehot_input, df=None, init='random', random_state=42, min_dist=0.01, n_neighbors=20, densmap=False, verbose=True, n_jobs=6)

An implement of UMAP (CPU version).

Parameters:
  • onehot_input (iterable.) – A list of one-hot encoded sequences.

  • df (pd.DataFrame) – A DataFrame to process. If given, it will return a DataFrame with X and Y columns. If not, it will return X and Y, separatively.

  • init (str.) – init value for UMAP.

  • random_state (int) – random seed.

  • min_dist (float) – min_dist for UMAP

  • n_neighbors (int) – n_neighbors for UMAP

  • densmap (boolean) – If use DensMAP.

  • verbose (boolean) – verbose level

Return type:

A DataFrame or [X and Y]

iMVP_utils.embedding.run_UMAP_GPU(onehot_input, df=None, init='random', random_state=42, min_dist=0.01, n_neighbors=20, densmap=False, verbose=True)

An implement of UMAP (GPU version).

Parameters:
  • onehot_input (iterable.) – A list of one-hot encoded sequences.

  • df (pd.DataFrame) – A DataFrame to process. If given, it will return a DataFrame with X and Y columns. If not, it will return X and Y, separatively.

  • init (str.) – init value for UMAP.

  • random_state (int) – random seed.

  • min_dist (float) – min_dist for UMAP

  • n_neighbors (int) – n_neighbors for UMAP

  • densmap (boolean) – If use DensMAP.

  • verbose (boolean) – verbose level

Return type:

A DataFrame or [X and Y]

iMVP_utils.interactive module

iMVP_utils.interactive_functions module

iMVP_utils.plots module

iMVP plots

iMVP_utils.plots.draw_2D_hist(df, vmax=0.05, cmin=None, density=True, xlim=None, ylim=None, bins=[600, 600])

This function is used for draw a 2D histogram.

Parameters:
  • df (pd.DataFrame) – A DataFrame containing the columns X and Y.

  • vmax (float) – The vmax parameter for hist2d. Default=0.05.

  • cmin (float) – The cmin parameter for hist2d. Default=None.

  • density (boolean) – If draw density histogram. Default=True.

  • xlim (tuple) – xlim for hist2d. Default=None.

  • ylim (tuple) – ylim for hist2d. Default=None.

  • bins (tuple) – Bin numbers for hist2d. Default=[600,600]

Returns:

  • hist2d (np.array) – A 2D array representing the values of the histogram. Please note that this array has been 90-degree rotated to fit the real X-Y and hence can be drawn with plt.imshow() directly.

  • edgesX (np.array) – The X edges.

  • edgesY (np.array) – The Y edges

iMVP_utils.plots.show_logos_cols(prefix, names=None, cols=3, figsize=(8, 8), auto_size=True, auto_width=4, auto_height=1.5, savefig_name=None, dpi=300)

This function is used for plot a series of motif logos in PNG format.

Parameters:
  • prefix (str) – The name of output path, required. This function will scan all PNG files in this path.

  • names (tuple) – If given, only plot the given file names. Default=None

  • cols (int) – The number of columns. Default=3

  • figsize (tuple) – The figsize parameter for matplotlib.pyplot.subpolots()

  • auto_size (boolean) – If True, ignore figsze and compute the width and height automatically.

  • auto_width (float) – The width factor used for auto_size.

  • auto_height (float) – The height factor used for auto_size.

  • savefig_name (str) – The plot to save, should end with .pdf or .png or ect. If None, figure will not be drawn.

  • dpi (int) – The dpi value for the figure.

Return type:

matplotlib.axes

iMVP_utils.setup module

iMVP_utils.utils module

iMVP helper functions.

iMVP_utils.utils.extract_fasta_and_draw_motifs(prefix, df, cluster_col='Cluster', filter=None, motif_column='seq', draw_logos=True)

This function is used for quick extraction of sequences strored in a DataFrame into a FASTA file and then draw the motif logos with Weblogo.

Parameters:
  • prefix (str) – The name of output path, required.

  • df (pd.DataFrame) – A DataFrame containing the sequences used, required.

  • cluster_col (str or tuple) – The column name for the clusters, default=”Cluster”.

  • filter (boolean) – The column name used for filtering results, where only TRUE values will be used, default=None (not applied).

  • motif_column (str or tuple) – The column that cotaining the motif sequences, default=”motif_F10”.

  • draw_logos (boolean) – If use Weblogo to draw logos, default=True.

Return type:

None

iMVP_utils.utils.hist_to_spots(hist2d, cutoff=5, bins=[600, 600], pixel_lower=1, pixel_upper=10, show_small_clusters_id=True, show_big_clusters_id=True, figsize=(12, 12), figure_name='hist2D.png')

This function is used for converting 2D histogram to spots (clusters).

Parameters:
  • hist2d (str) – The 2D histogram. (From draw_hist2d function)

  • cutoff (int) – The cutoff for cv2.threshold, range from 0 to 255. Default=5.

  • bins (tuple) – Should be equal to that of the hist2D.

  • pixel_lower (int) – The lower limit of the pixels considering as a “small spot”. Spots smaller than this will be ignored.

  • pixel_upper (int) – The upper limit of the pixels considering as a “small spot”. Spots larger than this will be considered as “big spot”

  • show_small_clusters_id (boolean) – If draw the ids for small clusters.

  • show_big_clusters_id (boolean) – If draw the ids for big clusters.

  • figsize (tuple) – Figure size for matplotlib.

  • figure_name (str) – The name of hist2D figure.

Returns:

  • axes (matplotlib.axes) – The axes.

  • dict_cnt_small – A dictionary of {id: locations} for the small spots.

  • dict_cnt_big – A dictionary of {id: locations} for the big spots.

iMVP_utils.utils.load_sequences_from_fasta(fn)

This function is used for load sequences from a FASTA file into a pandas DataFrame.

Parameters:

fn (str) – The file to load.

Return type:

pd.DataFrame

iMVP_utils.utils.phase_shift(df, dict_all_5mers, cluster_id=None, column_motif_F10='motif_F10', current_phase=0, target_base='A')

Perform phase shift.

Parameters:
  • df (pd.DataFrame) – A DataFrame object with X and Y column.

  • dict_all_5mers (dict) – The dictionary from prepare_kmers_dict.

  • cluster_id (int) – The id of specific cluster.

  • column_motif_F10 (str) – The column name of the 10-nt flanking sequences.

  • current_phase (int) – The current phase of the cluster.

  • target_base (str) – The target base to perform phase matching.

Return type:

pd.DataFrame

iMVP_utils.utils.prepare_kmers_dict(df, column='motif_F14')

Prepare all kmers from a DataFrame with flanking 14 nt sequences.

Parameters:
  • df (pd.DataFrame) – A DataFrame object with X and Y column.

  • column (str) – The name of column containing flanking 14 nt sequences.

Return type:

dict

iMVP_utils.utils.retrive_clusters(df, edgesX, edgesY, dict_clusters, bins=[600, 600], cluster_ids=None, spot_name='spot')

This function is used for annotate sites with clusters.

Parameters:
  • df (pd.DataFrame) – A DataFrame object with X and Y column.

  • edgesX (np.array) – The array of X edges generated by hist2D.

  • edgesY (np.array) – The array of Y edges generated by hist2D.

  • dict_clusters (dict) – The dictionary generated by hist_to_spots.

  • bins (tuple) – Should equal to that of hist2D.

  • cluster_ids (iterable) – If given, only find clusters with that ids.

  • spot_name (str) – The column name of the clusters.

Return type:

pd.DataFrame

Module contents