How to Mess up Single-Cell Data Curation

Mar 54 min read

Curated single-cell data can be a powerful resource for therapeutic innovation, validating insights, identifying targets and biomarkers, and training AI/ML models in drug research and development. However, properly curating single-cell data is notoriously tricky. Even well-intentioned researchers can stumble into common pitfalls that compromise data integrity and downstream analyses. Here, we’ve rounded up some of the most common ways to mess up single-cell data curation. (Or, if you’d rather not ruin your data, take these as cautionary tales.)

1. Ignore Data Quality and Use a One-Size-Fits-All Filtering Approach

One of the easiest ways to ruin your single-cell data curation is to completely ignore data quality and apply generic filtering parameters across all datasets. Sure, setting arbitrary thresholds for mitochondrial gene expression or read depth might seem convenient, but different datasets—and even different cell types—require tailored approaches [1].

Failing to perform manual quality control (QC) checks means you might:

Retain low-quality cells that introduce noise and artifacts
Remove biologically relevant cells that just happen to have different sequencing characteristics
Overlook batch effects and technical biases that could distort your findings
Be unable to reproduce the findings of the studies

A better approach? Always perform manual QC checks alongside automated filtering and adjust parameters based on the specific dataset.

2. Completely Disregard Standardization and Ignore Differences in Normalization Methods

Some authors publish raw data, while others don’t. Some provide mtx count matrices, while others use tsv files. Different authors may apply different normalization approaches (CPM, TPM, Seurat’s log-normalization, SCTransform, etc.) and other preprocessing parameters to treat their data. On one hand, this flexibility allows normalization to be tailored to the nature of each dataset and experiment. On the other, it creates inconsistencies when curating data into a centralized database or integrating datasets across studies.

Here are some surefire ways to create chaos:

Mix raw counts with normalized data across your curated database
Fail to systematically document normalization and preprocessing parameters for each dataset
Neglect efforts to standardize the data

The result? Confounded downstream analyses, misleading biological interpretations, and an inability to compare results across datasets. Standardization of preprocessing methods and metadata annotation is essential for meaningful comparisons and reproducibility.

3. Neglect Metadata Harmonization

Metadata harmonization is tough—but that doesn’t mean you should ignore it. The ultimate goal of curating a database is to integrate data across studies, cross-validate insights, and build models. Messy terminologies and inconsistent vocabularies can cause major confusion. If you want to maximize disorder, simply:

Ignore variations in sample annotation across datasets
Fail to standardize key variables such as disease state, tissue type, or experimental conditions
Use inconsistent or uncontrolled vocabulary across datasets

Metadata inconsistencies make integration and downstream interpretation nearly impossible. Proper metadata harmonization—using controlled vocabularies, ontologies, and consistent annotations—ensures that datasets can be meaningfully compared and analyzed together.

Want to Avoid the Mess? Here are Some Best Practices

If you actually want to curate high-quality single-cell data, avoid these pitfalls at all costs. Pay close attention to QC, standardization, and invest time in harmonizing metadata. Single-cell data curation is challenging, but getting it right is the key to unlocking robust and reproducible biological insights.

Build Your Standard Operating Procedures (SOP)

Carefully design an SOP for data curation to cover all possible scenarios and approaches to address them. Check out the Pythiomics database we built, which includes a highly detailed SOP for curating and processing data.

Strike the Balance – Define What Should Be Tailored, What Should Not and The Tradeoffs

Would you care more about the reproducibility of individual datasets (e.g., for validating insights on a case-by-case basis) or about building a standardized database for model training and querying? Here are some key tradeoffs to consider:

QC parameters: Do you want to apply a standardized QC approach or customize it per dataset? Standard QC ensures consistency but may cause the problems we mentioned above. Meanwhile, custom QC preserves study-specific characteristics but may introduce some levels of inconsistencies across datasets.
Preprocessing parameters: Should all datasets be standardized or tailored individually? Customization improves reproducibility with the original study, but standardization creates a uniform database.
Cell type annotations: Do the authors' original cell type labels matter to you, or would you prefer to standardize them across datasets? Do you have sufficient time and resources to manually harmonize the cell types, or rely on automatic mapping (which is less time-consuming but can be less accurate)?

Striking the right balance in these decisions will help you build a robust and usable single-cell database that meets your research objectives.

Meet Pythiomics – A Meticulously Curated Multi-omics Database for Drug Research

Thinking of building a single-cell curated database but lacking the resources to do it? Check out our Pythiomics database, a meticulously curated, standardized, and harmonized database that integrates more than 10,000 multi-omics datasets from diverse sources – including single-cell RNA-seq, bulk RNA-seq, proteomics, spatial transcriptomics and more.

Pythiomics addresses all the common pitfalls in multi-omics/ single-cell data curation, by following a carefully designed SOP for data curation, including rigorous QC, standardized preprocessing, and careful metadata harmonization with AI support. The database is interactively accessible via the C-DIAM Multi-omics Studio platform – ready for researchers and data scientists to explore, validate insights, train AI/ML models for target and biomarker discovery, and many more applications.

Pythiomics database statistics as of Jan 2025

Interactive exploration of Pythiomics in the C-DIAM Multi-omics Platform

Learn more about Pythiomics and request access

References

1. Luecken, Malte D., and Fabian J. Theis. "Current best practices in single‐cell RNA‐seq analysis: a tutorial." Molecular systems biology 15.6 (2019): e8746.