If you want to integrate multiple scRNA-seq datasets, correcting batch effects is likely to be an essential part of your analysis pipeline. Yet this is a super tricky part as there are so many questions that need to be answered. To what extent does your data suffer from the batch effect problem? Should you correct them or not? How to pick an effective method for your data? Are you “correcting” away the biological differences?
In this blog, we provide you with 4 handy tips to improve your batch effect correction process.
Tip no. 1: Start by asking if there are batch effects.
Before correcting the batch effects, it’s better to assess if there are batch effects or not. Sometimes it’s pure biological differences that cause variations among the samples. Or sometimes you’re just lucky enough to work with cell hashing or sample multiplexed data (where multiple samples are processed all at once in one single run).
There are several tools you can try to assess batch effects in your data:
Principal Component Analysis (PCA): You can identify batch effects by performing PCA on raw data and analyzing the top principal components. The next step is to scatter-plot your data by these top principal components and see whether the data are separated by batches rather than biological sources.
t-SNE or UMAP: You can employ a t-SNE or UMAP for batch effect detection. Simply overlay the labels of batches on the t-SNE or UMAP plot and see how they are clustered. In the presence of batch effects, cells from different batches tend to cluster together instead of grouping based on biological similarities (similar cell types, similar disease conditions, etc.) (see the picture below).
Clear separation of batches on the UMAP signals batch effects. Visualization by CDIAM Multi-Omics Studio
Clustering: Ideally samples with the same treatment will be clustered together. Therefore, when data are clustered by batches instead of treatments, it can indicate a batch effect. Heatmaps and dendrograms are two common approaches to visualize the clusters.
Quantitative metrics: You can try out some quantitative metrics to identify the batch effects with less human bias. Below is a great summary of metrics by Lütge et al. (2021).
Summary of quantitative metrics to identify batch effects. Source: Lütge et al. (2021)
Tip no.2: Test out different batch effect correction methods.
There are many different methods for batch effect correction. So how do you pick an effective method for your data?
First, we recommend having a look at these excellent reviews of batch effect correction methods:
Tran, Hoa Thi Nhu, et al. "A benchmark of batch-effect correction methods for single-cell RNA sequencing data." Genome biology 21 (2020): 1-32.
Luecken, Malte D., and Fabian J. Theis. "Current best practices in single‐cell RNA‐seq analysis: a tutorial." Molecular systems biology 15.6 (2019): e8746.
Luecken, Malte D., et al. "Benchmarking atlas-level data integration in single-cell genomics." Nature methods 19.1 (2022): 41-50.
Chen, Wanqiu, et al. "A multicenter study benchmarking single-cell RNA sequencing technologies using reference samples." Nature Biotechnology 39.9 (2021): 1103-1114.
The most popular ones are Harmony (Korsunsky et al., 2019), Mutual Nearest Neighbors (MNN) (Haghverdi et al., 2018), LIGER (Welch et al., 2019), and Seurat Integration (Stuart et al., 2019). Overall, Tran et al. (2020) recommended Harmony and Seurat CCA, with preference given to Harmony due to its faster runtime. Luecken et al. (2022) also performed a very comprehensive benchmark, suggesting scANVI performs the best, while Harmony is good but less scalable, and Seurat CCA has low scalability (see below). Perhaps you can try scANVI and Harmony first, and if results are not satisfactory, other methods should be considered.
Batch correction method benchmarking by Luecken et al. (2022)
Keep in mind that different tools may perform better on different data sets, so try a variety of methods.
Tip no.3: Check these signs of over-correction.
We are always advised against “correcting” away the biological signals. Yet how do we know if we are over-correcting our scRNA-seq data?
There are some indicative signs of over-correction:
Distinct cell types are clustered together on the dimensionality reduction plot (PCA, t-SNE or UMAP). Below is a very nice visualization from Andreatta et al. describing four simulated scenarios where datasets have increasing levels of batch effects (“batch0” with no batch effects, “batchMild” with mild batch effects, and “batchStrong” with strong batch effects), and one case where there is no batch effect and no biological cell type signal (simulating the result of an extreme batch-effect “overcorrection”).
Four simulated scenarios where datasets have increasing levels of batch effects (Andreatta et al., 2024)
A complete overlap of samples. This tends to occur when very similar samples with minor differences drive the experimental design. If you see a complete overlap of samples originating from very different conditions or experiments, you may be over-correcting your data.
A significant portion of cluster-specific markers comprised of genes with widespread high expression across various cell types, such as ribosomal genes.
If this is the case, consider trying a different batch correction method that is less aggressive on data fitting.
Tip no.4: Beware of the imbalance between samples.
What is sample imbalance? It is the case where there are differences in the number of cell types present, the number of cells per cell type and cell type proportions across samples. Imbalanced samples can happen a lot in cancer biology, which exhibit significant intra-tumoral and intra-patient discrepancies.
Maan et al. (2024) has benchmarked five integration techniques across 2,600 integration experiments and found that “sample imbalance has substantial impacts on downstream analyses and the biological interpretation of integration results.” This highlights that sample imbalance must be taken into consideration when we integrate scRNA-seq data.
The authors also presented a refined end-user guideline for integrating data when sample imbalance occurs. See below or read the article for further details.
Guidelines for single-cell data integration in imbalanced settings (Maan et al., 2024)
Conveniently integrate data with CDIAM Multi-Omics Studio
If you are seeking a convenient platform to correct batch effects and analyze scRNA-seq data, try the CDIAM Multi-Omics Studio. The CDIAM platform enables you to conveniently explore various kinds of omics data with an interactive UI containing preset and customizable workflows for correcting batch effects, diverse scRNA-seq visualizations and analytics, target prioritization pipelines, and biomarker validation pipelines, as well as multi-omics data integration workflows for summarizing insights across omics.
References
CellMixS: quantifying and visualizing batch effects in single-cell RNA-seq data - PMC (nih.gov)
Chapter 2 Batch effect detection | Managing Batch Effects in Microbiome Data (evayiwenwang.github.io)
A benchmark of batch-effect correction methods for single-cell RNA sequencing data - PubMed (nih.gov)
Current best practices in single‐cell RNA‐seq analysis: a tutorial | Molecular Systems Biology (embopress.org)
Benchmarking atlas-level data integration in single-cell genomics | Nature Methods
Semi-supervised integration of single-cell transcriptomics data | Nature Communications
Publication highlight: Benchmarking scRNA-seq batch correction methods - 10x Genomics
BTEP: When should we batch correct scRNA-Seq data? When should we avoid it? (cancer.gov)
The Tricky Problem of ‘Batch Effects' in Biological Data (bigomics.ch)
Comments