Ten Data Science Mistakes (and How to Avoid Them) // SQL and other ramblings

There’s a lot of noise around tools, frameworks, and model types. But most data science failures aren’t technical, they’re basic. You can avoid them by getting a few fundamentals right.

This post covers ten common mistakes people make when building data-driven solutions, and includes working Python examples so you can test them yourself (or show them to your colleagues when they ignore you).

1. Ignoring Domain Knowledge

You can’t model what you don’t understand.

Without knowing how the business works, you won’t know what the data represents, which features are important, or what the model should optimise. Worse, you might solve the wrong problem entirely.

2. Overvaluing Tools

Python. R. SQL. Spark. Tableau. Whatever. None of these matter if you can’t ask the right questions or structure the problem clearly.

Too many teams spend time debating tech stacks instead of thinking about outcomes. The algorithms underneath haven’t changed that much in decades — logistic regression is still logistic regression.

Use the tools you know. Solve the problem first. Optimise later.

3. Messy Joins and Dirty Data

Joining tables without understanding the data relationships is a fast way to create duplicates, nulls, or broken logic.

Here’s a Python example showing how a bad join can silently duplicate rows:

import pandas as pd

# Orders: one row per order
orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 1, 2],
    'order_value': [100, 150, 200]
})

# Customers: multiple addresses per customer
customers = pd.DataFrame({
    'customer_id': [1, 1, 2],
    'address_type': ['home', 'work', 'home'],
    'region': ['North', 'North', 'South']
})

# Join to add region (but accidentally duplicate orders)
merged = pd.merge(orders, customers, on='customer_id', how='left')
print("Merged Data:")
print(merged)

# Try to sum order value per customer
summary = merged.groupby('customer_id')['order_value'].sum()
print("\nIncorrect Total Order Value:")
print(summary)

	order_id	customer_id	order_value	address_type	region
0	101	1	100	home	North
1	101	1	100	work	North
2	102	1	150	home	North
3	102	1	150	work	North
4	103	2	200	home	South

Incorrect Total Order Value: | customer_id — | — 1 | 500 2 | 200

Fix: Deduplicate Before Joining

import pandas as pd

# Orders: one row per order
orders = pd.DataFrame({
    'order_id': [101, 102, 103],
    'customer_id': [1, 1, 2],
    'order_value': [100, 150, 200]
})

# Customers: multiple addresses per customer
customers = pd.DataFrame({
    'customer_id': [1, 1, 2],
    'address_type': ['home', 'work', 'home'],
    'region': ['North', 'North', 'South']
})

# Fix: only keep one row per customer
customers_deduped = customers.drop_duplicates(subset='customer_id')

merged_clean = pd.merge(orders, customers_deduped, on='customer_id', how='left')
summary_clean = merged_clean.groupby('customer_id')['order_value'].sum()
print("\nCorrect Total Order Value:")
print(summary_clean)

Correct Total Order Value: | customer_id — | — 1 | 250 2 | 200

4. Confusing Correlation with Causation

“A and B are correlated, so A must cause B.”

Here’s a simple (fake) example:

import pandas as pd

df = pd.DataFrame({
    'ice_cream_sales': [100, 150, 200, 250],
    'sunburns': [10, 20, 30, 40]
})

correlation = df.corr()
print(correlation)

This shows a near-perfect correlation! But no, selling Cornettos doesn’t cause sunburns. The weather does. Always ask what external factors might be driving your relationships.

5. Creating Redundant Features

Highly correlated features confuse models and inflate coefficients.

import pandas as pd

df = pd.DataFrame({
    'income_monthly': [2000, 3000, 4000],
    'income_yearly': [24000, 36000, 48000]  # Just monthly * 12
})

print(df.corr())

The correlation will be 1.0… they’re the same signal. That’s not helpful to a model.

	income_monthly	income_yearly
income_monthly	1.0	1.0
income_yearly	1.0	1.0

Fix: drop or combine them before training.

6. Skipping Normalisation

Distance-based models (like K-means, k-NN, or SVM with RBF kernel) are sensitive to scale.

Fix: apply StandardScaler before fitting the model.

7. Not Validating

It’s not enough to train and test on the same data and call it a day. Your model must be evaluated on unseen data.

Here’s how to do a simple train/test split:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

X = np.random.rand(100, 1)
y = 3 * X.flatten() + np.random.randn(100)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

print("MSE:", mean_squared_error(y_test, predictions))

Always track training vs testing error. Big gaps suggest overfitting.

8. Using the Wrong Metric

Don’t blindly optimise for accuracy — especially on imbalanced datasets.

Here’s why:

from sklearn.metrics import accuracy_score, precision_score, recall_score

y_true = [0, 0, 0, 0, 1]
y_pred = [0, 0, 0, 0, 0]

print("Accuracy:", accuracy_score(y_true, y_pred))
print("Precision:", precision_score(y_true, y_pred, zero_division=0))
print("Recall:", recall_score(y_true, y_pred))

Accuracy: 0.8 Precision: 0.0 Recall: 0.0

It “predicts correctly” 80% of the time, but it never spots the positive case. Metrics should reflect the actual problem you’re solving.

9. Overfitting with Too Much Model Complexity

If you add enough parameters, you can fit anything… including noise.

Here’s a regression example using PolynomialFeatures:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np

X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).flatten() + np.random.normal(0, 0.1, 100)

model = make_pipeline(PolynomialFeatures(degree=9), LinearRegression())
model.fit(X, y)

plt.scatter(X, y, label='Data')
plt.plot(X, model.predict(X), color='r', label='Degree 9 Fit')
plt.legend()
plt.show()

Looks great, until you test it elsewhere. Overfit models don’t generalise.

10. Bad Visualisation

No 3D pie charts.
No bar charts with different Y-axis scales.
No line charts with 30 overlapping series!!!!!

Better approach: * Use consistent colours * Label your axis * Keep it simple * Remove clutter

Make charts that explain, not confuse.

Final Thoughts

These mistakes aren’t advanced. They’re basic.

The best way to avoid them? Slow down. Check your work. Validate your logic. Use the simplest tools that solve the problem, and understand the context before writing a single line of code.

If you’re building something important, get these things right first.

Share on: