← ALL POSTS
NumPyPandasMatplotlibScikit-learnPythonData Science

The Python Data Science Stack: NumPy, Pandas, Matplotlib, and Scikit-learn

Four libraries. One stack. The reason nearly every data science workflow in Python starts with the same four imports — and where each one earns its place or shows its limits.

April 3, 20265 min read

Four libraries. One stack. You will find the same four imports at the top of notebooks from intro courses, production pipelines, and Kaggle grandmaster solutions alike. That level of consensus is rare in software, and it did not happen by accident.

NumPy, Pandas, Matplotlib, and Scikit-learn emerged at different times, solved different problems, and were built by different people. They became a stack because each one filled a gap the others left open, and together they cover most of what classical data science actually requires.

NumPy: The Foundation Everything Else Is Built On

NumPy is a numerical computation library centered on one idea: the n-dimensional array (ndarray). Vectorized operations over ndarrays are implemented in C, which means they are orders of magnitude faster than equivalent Python loops.

Every other library in this stack either depends on NumPy directly or mirrors its array conventions. When Pandas gives you a column, it is backed by a NumPy array. When Scikit-learn fits a model, it is doing linear algebra on NumPy arrays. It is the substrate.

import numpy as np

# Vectorized operation — no explicit loop
prices = np.array([10.5, 12.0, 9.8, 11.3, 14.1])
normalized = (prices - prices.mean()) / prices.std()
print(normalized)
# [ 0.07  0.81 -0.76  0.28  1.60] (approximate)

Where it breaks: NumPy arrays have no notion of labeled columns, missing values, or mixed types. That is not a flaw — it is a deliberate constraint. The moment you need those things, you reach for Pandas.

Pandas: Tabular Data With Labels and Flexibility

Pandas gives you labeled, column-typed, null-aware tabular data via DataFrame and Series. It is the workhorse of data cleaning, transformation, and exploration.

import pandas as pd

df = pd.read_csv("sales.csv")
monthly = (
    df[df["status"] == "completed"]
    .groupby("month")["revenue"]
    .sum()
    .reset_index()
)
print(monthly.head())

The chaining style above is idiomatic. Most real preprocessing pipelines are 80% Pandas operations: null imputation, type casting, group aggregations, and merges.

Where it breaks: Pandas does not scale horizontally. When your dataset stops fitting in memory, the DataFrame abstraction starts fighting you. Polars handles that better on a single machine; Spark handles it at cluster scale. Pandas is also not designed for time-series signal processing or high-dimensional arrays — NumPy or domain-specific libraries are better there.

Matplotlib: Plotting That Does Exactly What You Tell It

Matplotlib is the lowest-level plotting library in the stack, and that is both its strength and its friction point. You can produce almost any chart type, with full control over every visual element. You can also spend 20 minutes adjusting tick label rotation.

import matplotlib.pyplot as plt

months = ["Jan", "Feb", "Mar", "Apr", "May"]
revenue = [42000, 47500, 51000, 49800, 55200]

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(months, revenue, marker="o", linewidth=2)
ax.set_title("Monthly Revenue")
ax.set_ylabel("USD")
plt.tight_layout()
plt.savefig("revenue.png", dpi=150)

Seaborn wraps Matplotlib and makes statistical plots significantly less verbose. Plotly is better for interactive output. But when you need precise control over a publication-quality figure, Matplotlib is still the right tool.

Where it breaks: The API is stateful and inconsistent in places — the plt.* interface and the ax.* interface do not always behave identically. For quick EDA, Pandas has .plot() which calls Matplotlib under the hood and requires fewer lines. Use Matplotlib directly when defaults are not enough.

Scikit-learn: Classical ML With a Consistent API

Scikit-learn's best feature is not any single algorithm — it is the API consistency. Every estimator exposes .fit(), .predict(), and .transform(). That uniformity means you can swap a RandomForestClassifier for a LogisticRegression in one line, or chain preprocessing steps into a Pipeline that handles train/test leakage automatically.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

print(classification_report(y_test, clf.predict(X_test)))

Pipeline is underused. Wrapping your scaler and model together prevents the common mistake of fitting the scaler on the full dataset before splitting.

Where it breaks: Scikit-learn is classical ML — it does not do deep learning. It does not support GPU acceleration natively. And it has no native support for time-series cross-validation beyond a few specialized splitters. For gradient-boosted trees at scale, XGBoost and LightGBM have largely displaced Scikit-learn's own GradientBoostingClassifier on performance.

The Honest View of the Stack

The four-library stack is genuinely good at what it was designed for: structured tabular data, classical ML algorithms, and exploratory analysis. It is well-documented, stable, and the hiring pool for it is enormous.

The cracks show at the edges. Memory constraints with Pandas. No GPU path in Scikit-learn. Matplotlib verbosity. NumPy's lack of lazy evaluation. None of these are fatal — they are known limits that push you toward specialized tools when you hit them.

For most analytical and ML tasks that are not deep learning, this stack is still the right starting point. Not because there are no better alternatives for specific problems, but because the ecosystem, documentation, and community around these four libraries have thirty years of compounded investment behind them.


← BACK TO ALL POSTS