Motif and Transcription Factor enrichment after Mowgli integration¶
This notebook sums up guidelines to:
extract the top features (e.g., genes or peaks) for all dimensions in the Mowgli embedding.
perform a motif enrichment analysis (using peaks) (as done for Figure 5 of our manuscript).
perform a TF enrichment analysis (using genes) using a collection of TF->Targets.
NOTE #1: This notebook uses both R and Python code. We recommend to copy paste in your local Jupyter or Rstudio session and run the code in the corresponding language.
NOTE #2: The enrichments are performed on human data. Change this code and the databases accordingly if you are working with other species.
NOTE #3: It is possible also to match the TF and motif enrichment to get a better viee of the relationship between the transcriptional and epigenetic landscape. We do not cover here this analysis since it’s very case-specific
Extract top features of a modality from Mowgli¶
We assume that the results of Mowgli integration are in a mdata object that store genes in the rna and peaks in the atac slot.
[ ]:
# Python
# method to extract the top features
def top_mowgli(dim, n, H_mowgli):
"""
Get the top n peaks for a given dimension.
"""
H_scaled = H_mowgli / H_mowgli.sum(axis=1, keepdims=True)
return H_scaled[:, dim].argsort()[::-1][:n]
Extract the top peaks¶
[ ]:
# Python
# Peaks
# we set the number of peaks to look at
n_peaks = 100
# Get the genes or peak dictionaries
H_mowgli_atac = mdata["atac"].uns["H_OT"]
# actual features extraction
mdata["atac"].var_names = mdata["atac"].var_names.str.replace("atac:", "")
top_in_mowgli = mdata["atac"].var.copy()
# Fill the Mowgli top peaks.
for dim in range(H_mowgli_atac.shape[1]):
col_name = f"top_in_dim_{dim}"
idx = top_in_mowgli.index[top_mowgli(dim, n_peaks, H_mowgli_atac)]
top_in_mowgli[col_name] = False
top_in_mowgli.loc[idx, col_name] = True
# Save Mowgli's top peaks.
top_in_mowgli.to_csv("top_peaks_in_mowgli.csv")
Extract the top genes (for other expression-space based enrchments)¶
[ ]:
# Python
# we set the number of peaks to look at
n_genes = 100
H_mowgli_rna = mdata["rna"].uns["H_OT"]
# select the top genes to probe using only the highly variable genes (our universe)
top_in_mowgli = (
mdata["rna"].var.loc[mdata["rna"].var["highly_variable"] == True, :].copy()
) # the var coordinates
for dim in range(H_mowgli_rna.shape[1]):
print(dim)
# name of the column iun the var object that will be used to extract the top peaks for each gfiven dimenssion
col_name = f"top_in_dim_{dim}"
idx = top_in_mowgli.index[
top_mowgli(dim, n_genes, H_mowgli_rna)
] # indices of the top features for that given dimensions. will be used for localizing the peaks afterwasrds
top_in_mowgli[col_name] = False # set all value for that dimesions to False
top_in_mowgli.loc[idx, col_name] = True # set to True only the peaks that are
# Save Mowgli's top genes.
top_in_mowgli.to_csv(
os.path.join(top_feats_dir, f"top_genes_in mowgli.csv"),
)
Motif enrichment¶
This code was used in the original publication to perform motif enrichment analysis from chromatin accessibility (Figure 5-C). This notebook is a summarisation of the code that is stored in our Mowgli reproducibility repository.
[ ]:
# R
# Imports.
library(GenomicRanges)
library(motifmatchr)
library(chromVAR)
library(TFBSTools)
library(JASPAR2022)
library(Signac)
library(BSgenome.Hsapiens.UCSC.hg38)
library(chromVARmotifs)
library(MuData)
[ ]:
# R
# Read atac file.
in_atac <- "top_peaks_in_mowgli.csv" # nolint
peaks_csv <- read.csv(in_atac, row.names = 2)
[ ]:
# R
# Optional: Remove non-canonical chromosomes.
peaks_csv < -peaks_csv[peaks_csv["Chromosome"] != "GL000194.1",]
peaks_csv < -peaks_csv[peaks_csv["Chromosome"] != "GL000205.2",]
peaks_csv < -peaks_csv[peaks_csv["Chromosome"] != "GL000205.2",]
peaks_csv < -peaks_csv[peaks_csv["Chromosome"] != "GL000219.1",]
peaks_csv < -peaks_csv[peaks_csv["Chromosome"] != "GL000219.1",]
peaks_csv < -peaks_csv[peaks_csv["Chromosome"] != "KI270721.1",]
peaks_csv < -peaks_csv[peaks_csv["Chromosome"] != "KI270726.1",]
peaks_csv < -peaks_csv[peaks_csv["Chromosome"] != "KI270726.1",]
peaks_csv < -peaks_csv[peaks_csv["Chromosome"] != "KI270713.1",]
[ ]:
# R
# Convert the peaks to GRanges.
chromosomes <- peaks_csv["Chromosome"][, 1]
ranges <- IRanges::IRanges(
start = peaks_csv["Start"][, 1],
end = peaks_csv["End"][, 1]
)
peaks <- GenomicRanges::GRanges(seqnames = chromosomes, ranges = ranges)
[ ]:
# R
# Get JASPAR motifs.
opts <- list()
opts["species"] <- "Homo sapiens"
opts["collection"] <- "CORE"
motifs <- TFBSTools::getMatrixSet(JASPAR2022::JASPAR2022, opts)
motifs_pwm <- TFBSTools::toPWM(motifs)
# Get cisBP motifs.
data("human_pwms_v2")
# Fuse JASPAR and cisBP motifs.
for (name in names(motifs_pwm)) {
human_pwms_v2[name] <- motifs_pwm[name]
}
[ ]:
# R
# Create a 'dummy' Signac object from the peaks.
# Actually giving peaks_csv is nonsense.
# But we only care about the rownames so it's fine.
assay <- Signac::CreateChromatinAssay(
peaks_csv,
ranges = peaks,
sep = c(":", "-")
)
# Create statistics about peaks.
assay <- Signac::RegionStats(
object = assay,
genome = BSgenome.Hsapiens.UCSC.hg38
)
# Add the downloaded motif PWM annotation.
assay <- Signac::AddMotifs(
object = assay,
genome = BSgenome.Hsapiens.UCSC.hg38,
pfm = human_pwms_v2
)
[ ]:
# R
# Define where to save the motif enrichment outputs.
out_motif <- "motifs_"
# Get all top peaks.
background <- c()
for (dim in 0:49) {
# Get the top peaks for that dimension.
features <- rownames(assay)[peaks_csv[paste0("top_in_dim_", dim)] == "True"]
background <- c(background, features)
}
# Iterate over Mowgli's dimensions.
for (dim in 0:49) {
# Get the top peaks for that dimension.
features <- rownames(assay)[peaks_csv[paste0("top_in_dim_", dim)] == "True"]
# Do motif enrichment analysis.
enriched_motifs <- Signac::FindMotifs(
object = assay,
features = features,
background = background
)
# Save the enrichment.
write.csv(enriched_motifs, paste0(out_motif, dim, ".csv"))
}
TF Enrichment using top genes in mowgli dimensions¶
We perform here a standard TF enrichment using the top features identifuied in the RNA space for each dimension of Mowgli.
In this case example, we made use of the Regulatory Circuits database (link), but we recommend the users to choose the most appropriate TF->Target database according to his domain and prior biological information.
[ ]:
# Reload libraries
library(stats)
library(dplyr)
library(stringr)
[ ]:
# R
# reload GRN and make a TF-> Target list
# network of epithelial cells
grn.path <- "Regulatory_circuits_mammary_epithelial_cell.txt"
grn <- read.table(grn.path, sep="\t", header = F)
colnames(grn) <- c("TF", "Target", "score")
# make a TF -> Target list
tf_list <- split(grn$Target, grn$TF)
[ ]:
# R
# reload top features for RNA and reparse it
top_feats.path <- "top_genes_in_mowgli.csv"
top_feats <- read.table(top_feats.path, sep = ",", header = T)
# set row names to index
row.names(top_feats) <- top_feats$hgnc_symbol
cols_to_keep <- c("highly_variable", grep("top_in_dim", names(top_feats), value = TRUE))
top_feats.filtered <- top_feats %>%
select(all_of(cols_to_keep))
top_feats.filtered <- top_feats.filtered %>%
mutate(
`highly_variable` = as.logical(ifelse(`highly_variable` == "True", TRUE, FALSE)),
across(starts_with("top_in_dim"), ~ as.logical(ifelse(. == "True", TRUE, FALSE)))
)
# define the universe -> in this case, a list of highly variable genes in the RNA slot
universe <- readLines("highly_variable_genes.txt")
universe.len <- length(universe)
In brief: - we loop through each dimension and we select for each dimension the top features - we loop through each TF (using a groupby) and we identify the top sets of features - we make a hypergeometric test - we calculate an enrichment score enrichment - we correct the pvalue using Benjamini-hochberg correction - we write the results to a file (only for significant TFs enriched)
[ ]:
# R
# directory that will store the results
res.dir <- "tf_enrichment"
# colnames of the dataframe storing the results
res.colnames <- c("TF", "number_of_targets", "enrichment_score", "p.val", "p.adjust")
# loop through all the top mowgli dimensions
for (dim in names(top_feats.filtered)[grepl("^top_in", names(top_feats.filtered))]) {
dim_number <- str_extract(dim, "\\d+$")
print(paste("enriching:", dim_number))
# select the top features for that dimensiob
top_genes <- rownames(top_feats.filtered[top_feats.filtered[[dim]] == TRUE, ])
top_genes.len <- length(top_genes) # should always be 100
ratioDim <- top_genes.len / universe.len # number of genes in the universe, shoukd always be the same numbe, useful for the enrichment
# open the output file
output_file_name <- file.path(res.dir, paste0("enriched_TFs_dim", dim_number, ".tsv"))
res.df <- data.frame(matrix(ncol= length(res.colnames), nrow = 0))
colnames(res.df) <- res.colnames
# define the list to store files
p_values <- numeric()
tfs <- character()
n_targets <- numeric()
enrichment_scores <- numeric()
# loop through all the TFs and perform the enrichment
for (tf in names(tf_list)){
targets <- tf_list[[tf]]
x <- length(intersect(targets, top_genes)) # white balls, i.e. how many genes in top dim are in the TF targets
m <- length(intersect(universe, targets)) # number of white balls in the urn, i.e., how many targets are in the universe
n <- universe.len - length(intersect(targets, universe)) # number of black balls in the urn, i.e. how many genes in the universe are NOT in the targets
k <- top_genes.len # the size of the balls drawn, always 100 (the number of top genes in the dimension)
p_value <- phyper(x, m, n, k, lower.tail = FALSE) #select as significantonly the over enriched
n_targets_expressed <- length(intersect(targets, rownames(universe)))
n_targets_feats <- x/universe.len
ratioTargets <- ifelse(n_targets_expressed == 0, 0, x / n_targets_expressed)
enrichment_score <- 1/ (ratioTargets / ratioDim)
# Store results for FDR adjustment
p_values <- c(p_values, p_value)
tfs <- c(tfs, tf)
n_targets <- c(n_targets, x)
enrichment_scores <- c(enrichment_scores, enrichment_score)
}
# Adjust p-values using Benjamini-Hochberg correction
adjusted_p_values <- p.adjust(p_values, method = "BH")
# Combine results into a dataframe
results <- data.frame(TF = tfs, number_of_targets = n_targets, enrichment_score = enrichment_scores, p_value = p_values, adjusted_p_value = adjusted_p_values)
# Filter results for significant adjusted p-values
significant_results <- results[results$adjusted_p_value <= 0.05, ]
# Save significant results to file
output_file_name <- file.path(res.dir, paste0("enriched_TFs_dim", dim_number, ".tsv"))
write.table(significant_results, file = output_file_name, sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE)
}