Dataset Cleaning Best Practices¶
A practical guide to cleaning messy data before analysis or model training.
The 80/20 Rule¶
Data scientists spend ~80% of their time cleaning data. These steps catch 80% of issues:
1. Profile first¶
Before cleaning anything, profile your data to understand what you're working with:
- How many rows and columns?
- What are the data types?
- How many nulls per column?
- Are there duplicate rows?
- What are the value distributions?
Use our Data Profiler for instant profiling.
2. Remove duplicates¶
Exact duplicate rows are almost always errors.
-- Find duplicates
SELECT *, COUNT(*) as cnt FROM data GROUP BY ALL HAVING cnt > 1;
-- Remove them
CREATE TABLE clean AS SELECT DISTINCT * FROM data;
3. Handle missing values¶
| Strategy | When to use |
|---|---|
| Drop rows | < 5% nulls, rows are independent |
| Fill with mean/median | Numeric columns, MCAR (missing completely at random) |
| Fill with mode | Categorical columns |
| Fill with "Unknown" | Categorical, missingness is informative |
| Forward/backward fill | Time series |
| Drop column | > 50% nulls, not critical |
4. Fix data types¶
Common issues: - Numbers stored as strings ("123" instead of 123) - Dates as strings ("2024-01-15" instead of DATE) - Booleans as strings ("true"/"false") - Mixed types in same column
5. Trim whitespace¶
Leading/trailing spaces cause invisible duplicate categories: - "London " ≠ "London" ≠ " London"
6. Standardize categories¶
- Inconsistent casing: "usa", "USA", "Us"
- Abbreviations: "CA", "California", "Calif."
- Typos: "Londn", "Londnon"
7. Handle outliers¶
| Method | When |
|---|---|
| IQR (1.5x) | General purpose, resistant to extremes |
| Z-score (> 3) | Normal distributions |
| Domain knowledge | age > 150, price < 0 |
| Keep them | When extremes are valid data points |
8. Validate with schema¶
Define expected rules and validate:
{
"columns": [
{ "name": "email", "required": true, "pattern": "^[^@]+@[^@]+$" },
{ "name": "age", "required": true, "min": 0, "max": 150 },
{ "name": "status", "allowedValues": ["active", "inactive", "pending"] }
]
}
Use our Schema Validator to automate this.
Common pitfalls¶
| Pitfall | Why it's bad | Fix |
|---|---|---|
| Cleaning before profiling | You don't know what to fix | Always profile first |
| Dropping too many rows | Loses statistical power | Consider imputation |
| Not documenting changes | Can't reproduce results | Log every cleaning step |
| Cleaning training + test together | Data leakage | Clean separately |
| Ignoring domain context | Removing valid outliers | Consult domain experts |