Skip to content

Dataset Cleaning Best Practices

A practical guide to cleaning messy data before analysis or model training.

The 80/20 Rule

Data scientists spend ~80% of their time cleaning data. These steps catch 80% of issues:

1. Profile first

Before cleaning anything, profile your data to understand what you're working with:

  • How many rows and columns?
  • What are the data types?
  • How many nulls per column?
  • Are there duplicate rows?
  • What are the value distributions?

Use our Data Profiler for instant profiling.

2. Remove duplicates

Exact duplicate rows are almost always errors.

-- Find duplicates
SELECT *, COUNT(*) as cnt FROM data GROUP BY ALL HAVING cnt > 1;

-- Remove them
CREATE TABLE clean AS SELECT DISTINCT * FROM data;

3. Handle missing values

Strategy When to use
Drop rows < 5% nulls, rows are independent
Fill with mean/median Numeric columns, MCAR (missing completely at random)
Fill with mode Categorical columns
Fill with "Unknown" Categorical, missingness is informative
Forward/backward fill Time series
Drop column > 50% nulls, not critical

4. Fix data types

Common issues: - Numbers stored as strings ("123" instead of 123) - Dates as strings ("2024-01-15" instead of DATE) - Booleans as strings ("true"/"false") - Mixed types in same column

5. Trim whitespace

Leading/trailing spaces cause invisible duplicate categories: - "London " ≠ "London" ≠ " London"

6. Standardize categories

  • Inconsistent casing: "usa", "USA", "Us"
  • Abbreviations: "CA", "California", "Calif."
  • Typos: "Londn", "Londnon"

7. Handle outliers

Method When
IQR (1.5x) General purpose, resistant to extremes
Z-score (> 3) Normal distributions
Domain knowledge age > 150, price < 0
Keep them When extremes are valid data points

8. Validate with schema

Define expected rules and validate:

{
  "columns": [
    { "name": "email", "required": true, "pattern": "^[^@]+@[^@]+$" },
    { "name": "age", "required": true, "min": 0, "max": 150 },
    { "name": "status", "allowedValues": ["active", "inactive", "pending"] }
  ]
}

Use our Schema Validator to automate this.

Common pitfalls

Pitfall Why it's bad Fix
Cleaning before profiling You don't know what to fix Always profile first
Dropping too many rows Loses statistical power Consider imputation
Not documenting changes Can't reproduce results Log every cleaning step
Cleaning training + test together Data leakage Clean separately
Ignoring domain context Removing valid outliers Consult domain experts