There’s a lot of noise around tools, frameworks, and model types. But most data science failures aren’t technical, they’re basic. You can avoid them by getting a few fundamentals right.

This post covers ten common mistakes people make when building data-driven solutions, and includes working Python examples so you can test them yourself (or show them to your colleagues when they ignore you).

1. Ignoring Domain Knowledge

You can’t model what you don’t understand.

Without knowing how the business works, you won’t know what the data represents, which features are important, or what the model should optimise. Worse, you might solve the wrong problem entirely.

2. Overvaluing Tools

Python. R. SQL. Spark. Tableau. Whatever. None of these matter if you can’t ask the right questions or structure the problem clearly.

Too many teams spend time debating tech stacks instead of thinking about outcomes. The algorithms underneath haven’t changed that much in decades — logistic regression is still logistic regression.

Use the tools you know. Solve the problem first. Optimise later.

3. Messy Joins and Dirty Data

Joining tables without understanding the data relationships is a fast way to create duplicates, nulls, or broken logic.

Here’s a Python example showing how a bad join can silently duplicate rows:

import pandas as pd

# Orders: one row per order
orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 1, 2],
    'order_value': [100, 150, 200]
})

# Customers: multiple addresses per customer
customers = pd.DataFrame({
    'customer_id': [1, 1, 2],
    'address_type': ['home', 'work', 'home'],
    'region': ['North', 'North', 'South']
})

# Join to add region (but accidentally duplicate orders)
merged = pd.merge(orders, customers, on='customer_id', how='left')
print("Merged Data:")
print(merged)

# Try to sum order value per customer
summary = merged.groupby('customer_id')['order_value'].sum()
print("\nIncorrect Total Order Value:")
print(summary)
order_id customer_id order_value address_type region
0 101 1 100 home North
1 101 1 100 work North
2 102 1 150 home North
3 102 1 150 work North
4 103 2 200 home South

Incorrect Total Order Value: | customer_id — | — 1 | 500 2 | 200

Fix: Deduplicate Before Joining

import pandas as pd

# Orders: one row per order
orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 1, 2],
    'order_value': [100, 150, 200]
})

# Customers: multiple addresses per customer
customers = pd.DataFrame({
    'customer_id': [1, 1, 2],
    'address_type': ['home', 'work', 'home'],
    'region': ['North', 'North', 'South']
})

# Fix: only keep one row per customer
customers_deduped = customers.drop_duplicates(subset='customer_id')

merged_clean = pd.merge(orders, customers_deduped, on='customer_id', how='left')
summary_clean = merged_clean.groupby('customer_id')['order_value'].sum()
print("\nCorrect Total Order Value:")
print(summary_clean)

Correct Total Order Value: | customer_id — | — 1 | 250 2 | 200

4. Confusing Correlation with Causation

“A and B are correlated, so A must cause B.”

Here’s a simple (fake) example:

import pandas as pd

df = pd.DataFrame({
    'ice_cream_sales': [100, 150, 200, 250],
    'sunburns': [10, 20, 30, 40]
})

correlation = df.corr()
print(correlation)

This shows a near-perfect correlation! But no, selling Cornettos doesn’t cause sunburns. The weather does. Always ask what external factors might be driving your relationships.

5. Creating Redundant Features

Highly correlated features confuse models and inflate coefficients.

import pandas as pd

df = pd.DataFrame({
    'income_monthly': [2000, 3000, 4000],
    'income_yearly': [24000, 36000, 48000]  # Just monthly * 12
})

print(df.corr())

The correlation will be 1.0… they’re the same signal. That’s not helpful to a model.

income_monthly income_yearly
income_monthly 1.0 1.0
income_yearly 1.0 1.0

Fix: drop or combine them before training.

6. Skipping Normalisation

Distance-based models (like K-means, k-NN, or SVM with RBF kernel) are sensitive to scale.

Fix: apply StandardScaler before fitting the model.

7. Not Validating

It’s not enough to train and test on the same data and call it a day. Your model must be evaluated on unseen data.

Here’s how to do a simple train/test split:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

X = np.random.rand(100, 1)
y = 3 * X.flatten() + np.random.randn(100)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print("MSE:", mean_squared_error(y_test, predictions))

Always track training vs testing error. Big gaps suggest overfitting.

8. Using the Wrong Metric

Don’t blindly optimise for accuracy — especially on imbalanced datasets.

Here’s why:

from sklearn.metrics import accuracy_score, precision_score, recall_score

y_true = [0, 0, 0, 0, 1]
y_pred = [0, 0, 0, 0, 0]

print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred, zero_division=0))
print("Recall:", recall_score(y_true, y_pred))

Accuracy: 0.8 Precision: 0.0 Recall: 0.0

It “predicts correctly” 80% of the time, but it never spots the positive case. Metrics should reflect the actual problem you’re solving.

9. Overfitting with Too Much Model Complexity

If you add enough parameters, you can fit anything… including noise.

Here’s a regression example using PolynomialFeatures:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np

X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).flatten() + np.random.normal(0, 0.1, 100)

model = make_pipeline(PolynomialFeatures(degree=9), LinearRegression())
model.fit(X, y)

plt.scatter(X, y, label='Data')
plt.plot(X, model.predict(X), color='r', label='Degree 9 Fit')
plt.legend()
plt.show()

Looks great, until you test it elsewhere. Overfit models don’t generalise.

10. Bad Visualisation

  • No 3D pie charts.
  • No bar charts with different Y-axis scales.
  • No line charts with 30 overlapping series!!!!!

Better approach: * Use consistent colours * Label your axis * Keep it simple * Remove clutter

Make charts that explain, not confuse.

Final Thoughts

These mistakes aren’t advanced. They’re basic.

The best way to avoid them? Slow down. Check your work. Validate your logic. Use the simplest tools that solve the problem, and understand the context before writing a single line of code.

If you’re building something important, get these things right first.

Share on: