Dataset Cleaning Best Practices¶

A practical guide to cleaning messy data before analysis or model training.

The 80/20 Rule¶

Data scientists spend ~80% of their time cleaning data. These steps catch 80% of issues:

1. Profile first¶

Before cleaning anything, profile your data to understand what you're working with:

How many rows and columns?
What are the data types?
How many nulls per column?
Are there duplicate rows?
What are the value distributions?

Use our Data Profiler for instant profiling.

2. Remove duplicates¶

Exact duplicate rows are almost always errors.

-- Find duplicates
SELECT *, COUNT(*) as cnt FROM data GROUP BY ALL HAVING cnt > 1;

-- Remove them
CREATE TABLE clean AS SELECT DISTINCT * FROM data;

3. Handle missing values¶

Strategy	When to use
Drop rows	< 5% nulls, rows are independent
Fill with mean/median	Numeric columns, MCAR (missing completely at random)
Fill with mode	Categorical columns
Fill with "Unknown"	Categorical, missingness is informative
Forward/backward fill	Time series
Drop column	> 50% nulls, not critical

4. Fix data types¶

Common issues: - Numbers stored as strings ("123" instead of 123) - Dates as strings ("2024-01-15" instead of DATE) - Booleans as strings ("true"/"false") - Mixed types in same column

5. Trim whitespace¶

Leading/trailing spaces cause invisible duplicate categories: - "London " ≠ "London" ≠ " London"

6. Standardize categories¶

Inconsistent casing: "usa", "USA", "Us"
Abbreviations: "CA", "California", "Calif."
Typos: "Londn", "Londnon"

7. Handle outliers¶

Method	When
IQR (1.5x)	General purpose, resistant to extremes
Z-score (> 3)	Normal distributions
Domain knowledge	age > 150, price < 0
Keep them	When extremes are valid data points

8. Validate with schema¶

Define expected rules and validate:

{
  "columns": [
    { "name": "email", "required": true, "pattern": "^[^@]+@[^@]+$" },
    { "name": "age", "required": true, "min": 0, "max": 150 },
    { "name": "status", "allowedValues": ["active", "inactive", "pending"] }
  ]
}

Use our Schema Validator to automate this.

Common pitfalls¶

Pitfall	Why it's bad	Fix
Cleaning before profiling	You don't know what to fix	Always profile first
Dropping too many rows	Loses statistical power	Consider imputation
Not documenting changes	Can't reproduce results	Log every cleaning step
Cleaning training + test together	Data leakage	Clean separately
Ignoring domain context	Removing valid outliers	Consult domain experts