Ten Data Science Mistakes (and How to Avoid Them)
There’s a lot of noise around tools, frameworks, and model types. But most data science failures aren’t technical, they’re basic. You can avoid them by getting a few fundamentals right.
This post covers ten common mistakes people make when building data-driven solutions, and includes working Python examples so you can test them yourself (or show them to your colleagues when they ignore you).
1. Ignoring Domain Knowledge
You can’t model what you don’t understand.
Without knowing how the business works, you won’t know what the data represents, which features are important, or what the model should optimise. Worse, you might solve the wrong problem entirely.
2. Overvaluing Tools
Python. R. SQL. Spark. Tableau. Whatever. None of these matter if you can’t ask the right questions or structure the problem clearly.
Too many teams spend time debating tech stacks instead of thinking about outcomes. The algorithms underneath haven’t changed that much in decades — logistic regression is still logistic regression.
Use the tools you know. Solve the problem first. Optimise later.
3. Messy Joins and Dirty Data
Joining tables without understanding the data relationships is a fast way to create duplicates, nulls, or broken logic.
Here’s a Python example showing how a bad join can silently duplicate rows:
import pandas as pd
# Orders: one row per order
orders = pd.DataFrame({
'order_id': [101, 102, 103],
'customer_id': [1, 1, 2],
'order_value': [100, 150, 200]
})
# Customers: multiple addresses per customer
customers = pd.DataFrame({
'customer_id': [1, 1, 2],
'address_type': ['home', 'work', 'home'],
'region': ['North', 'North', 'South']
})
# Join to add region (but accidentally duplicate orders)
merged = pd.merge(orders, customers, on='customer_id', how='left')
print("Merged Data:")
print(merged)
# Try to sum order value per customer
summary = merged.groupby('customer_id')['order_value'].sum()
print("\nIncorrect Total Order Value:")
print(summary)
order_id | customer_id | order_value | address_type | region | |
---|---|---|---|---|---|
0 | 101 | 1 | 100 | home | North |
1 | 101 | 1 | 100 | work | North |
2 | 102 | 1 | 150 | home | North |
3 | 102 | 1 | 150 | work | North |
4 | 103 | 2 | 200 | home | South |
Incorrect Total Order Value: | customer_id — | — 1 | 500 2 | 200
Fix: Deduplicate Before Joining
import pandas as pd
# Orders: one row per order
orders = pd.DataFrame({
'order_id': [101, 102, 103],
'customer_id': [1, 1, 2],
'order_value': [100, 150, 200]
})
# Customers: multiple addresses per customer
customers = pd.DataFrame({
'customer_id': [1, 1, 2],
'address_type': ['home', 'work', 'home'],
'region': ['North', 'North', 'South']
})
# Fix: only keep one row per customer
customers_deduped = customers.drop_duplicates(subset='customer_id')
merged_clean = pd.merge(orders, customers_deduped, on='customer_id', how='left')
summary_clean = merged_clean.groupby('customer_id')['order_value'].sum()
print("\nCorrect Total Order Value:")
print(summary_clean)
Correct Total Order Value: | customer_id — | — 1 | 250 2 | 200
4. Confusing Correlation with Causation
“A and B are correlated, so A must cause B.”
Here’s a simple (fake) example:
import pandas as pd
df = pd.DataFrame({
'ice_cream_sales': [100, 150, 200, 250],
'sunburns': [10, 20, 30, 40]
})
correlation = df.corr()
print(correlation)
This shows a near-perfect correlation! But no, selling Cornettos doesn’t cause sunburns. The weather does. Always ask what external factors might be driving your relationships.
5. Creating Redundant Features
Highly correlated features confuse models and inflate coefficients.
import pandas as pd
df = pd.DataFrame({
'income_monthly': [2000, 3000, 4000],
'income_yearly': [24000, 36000, 48000] # Just monthly * 12
})
print(df.corr())
The correlation will be 1.0… they’re the same signal. That’s not helpful to a model.
income_monthly | income_yearly | |
---|---|---|
income_monthly | 1.0 | 1.0 |
income_yearly | 1.0 | 1.0 |
Fix: drop or combine them before training.
6. Skipping Normalisation
Distance-based models (like K-means, k-NN, or SVM with RBF kernel) are sensitive to scale.
Fix: apply StandardScaler before fitting the model.
7. Not Validating
It’s not enough to train and test on the same data and call it a day. Your model must be evaluated on unseen data.
Here’s how to do a simple train/test split:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
X = np.random.rand(100, 1)
y = 3 * X.flatten() + np.random.randn(100)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, predictions))
Always track training vs testing error. Big gaps suggest overfitting.
8. Using the Wrong Metric
Don’t blindly optimise for accuracy — especially on imbalanced datasets.
Here’s why:
from sklearn.metrics import accuracy_score, precision_score, recall_score
y_true = [0, 0, 0, 0, 1]
y_pred = [0, 0, 0, 0, 0]
print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred, zero_division=0))
print("Recall:", recall_score(y_true, y_pred))
Accuracy: 0.8 Precision: 0.0 Recall: 0.0
It “predicts correctly” 80% of the time, but it never spots the positive case. Metrics should reflect the actual problem you’re solving.
9. Overfitting with Too Much Model Complexity
If you add enough parameters, you can fit anything… including noise.
Here’s a regression example using PolynomialFeatures:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).flatten() + np.random.normal(0, 0.1, 100)
model = make_pipeline(PolynomialFeatures(degree=9), LinearRegression())
model.fit(X, y)
plt.scatter(X, y, label='Data')
plt.plot(X, model.predict(X), color='r', label='Degree 9 Fit')
plt.legend()
plt.show()
Looks great, until you test it elsewhere. Overfit models don’t generalise.
10. Bad Visualisation
- No 3D pie charts.
- No bar charts with different Y-axis scales.
- No line charts with 30 overlapping series!!!!!
Better approach: * Use consistent colours * Label your axis * Keep it simple * Remove clutter
Make charts that explain, not confuse.
Final Thoughts
These mistakes aren’t advanced. They’re basic.
The best way to avoid them? Slow down. Check your work. Validate your logic. Use the simplest tools that solve the problem, and understand the context before writing a single line of code.
If you’re building something important, get these things right first.
Share on: