Data Science with Python Cheatsheet Cheatsheet

🔢

NumPy Basics

Foundation

NumPy is the fundamental package for numerical computing in Python. It provides powerful N-dimensional array objects, broadcasting, and vectorized operations that form the backbone of data science.

numpy-basics.py

import numpy as np

# ── Array Creation ──
a = np.array([1, 2, 3, 4, 5])          # 1D array from list
b = np.zeros((3, 4))                    # 3x4 array of zeros
c = np.ones((2, 3))                     # 2x3 array of ones
d = np.full((3, 3), 7.0)               # 3x3 array filled with 7.0
e = np.arange(0, 10, 2)                # [0, 2, 4, 6, 8]
f = np.linspace(0, 1, 5)               # [0.0, 0.25, 0.5, 0.75, 1.0]
g = np.random.randn(3, 3)              # 3x3 standard normal distribution
h = np.random.randint(0, 10, (2, 5))   # 2x5 random integers [0, 10)
i = np.eye(4)                          # 4x4 identity matrix
j = np.random.seed(42)                 # Set random seed for reproducibility

# ── Array Properties ──
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr.ndim         # 2 (number of dimensions)
arr.shape        # (2, 3) (rows, columns)
arr.size         # 6 (total elements)
arr.dtype        # int64 (data type)
arr.itemsize     # 8 (bytes per element)
arr.nbytes       # 48 (total bytes)

NumPy Indexing & Slicing

Operation	Code	Result	Description
Single element	arr[0, 1]	2	Row 0, Column 1
Row slice	arr[0, :]	[1, 2, 3]	Entire first row
Column slice	arr[:, 1]	[2, 5]	Entire second column
Negative index	arr[-1, -1]	6	Last element
Fancy indexing	arr[[0, 1], [2, 0]]	[3, 4]	Elements (0,2) and (1,0)
Boolean mask	arr[arr > 3]	[4, 5, 6]	Elements greater than 3
Where	np.where(arr > 3, arr, 0)	[[0,0,0],[4,5,6]]	Replace with condition
Reshape	arr.reshape(3, 2)	3x2 array	Change shape (must match size)
Transpose	arr.T	3x2 array	Swap rows and columns
Flatten	arr.flatten()	[1,2,3,4,5,6]	1D copy of array

numpy-operations.py

# ── Vectorized Operations (no Python loops needed) ──
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])

a + b         # [11, 22, 33, 44, 45]   element-wise addition
a * b         # [10, 40, 90, 160, 250] element-wise multiplication
a ** 2        # [1, 4, 9, 16, 25]       element-wise power
a > 3         # [False, False, False, True, True]
np.dot(a, b)  # 550                     dot product
np.cross(a[:3], b[:3])  # cross product

# ── Aggregation ──
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr.sum()            # 21
arr.sum(axis=0)      # [5, 7, 9]  column sums
arr.sum(axis=1)      # [6, 15]    row sums
arr.mean()           # 3.5
arr.std()            # 1.7078
arr.min(), arr.max() # 1, 6
arr.argmax()         # 5 (flat index of max)
arr.cumsum()         # [1, 3, 6, 10, 15, 21]

# ── Linear Algebra ──
A = np.array([[1, 2], [3, 4]])
A_inv = np.linalg.inv(A)          # inverse
det = np.linalg.det(A)            # determinant: -2.0
eigvals, eigvecs = np.linalg.eig(A)  # eigenvalues/vectors
U, S, Vt = np.linalg.svd(A)      # SVD decomposition
np.linalg.solve(A, b)             # solve Ax = b

# ── Broadcasting Rules ──
# Shapes are compatible when dimensions are equal or one of them is 1
a = np.array([[1], [2], [3]])     # shape (3, 1)
b = np.array([10, 20, 30])       # shape (3,)
a + b  # shape (3, 3) — a broadcast along columns

Common NumPy Gotchas

Integer DivisionUse np.true_divide() instead of / for integer arrays in older NumPy. In NumPy 2.0+, / always does true division.

View vs CopySlicing returns a view (shared memory). Use .copy() to create independent arrays.

NaN HandlingUse np.nanmean(), np.nanstd() to ignore NaN. np.isnan() to detect.

Random StateUse rng = np.random.default_rng(seed=42) in NumPy 1.17+ for better random number generation.

🐼

Pandas DataFrames

Essential

Pandas provides fast, flexible, and expressive data structures designed to make working with structured (tabular) data intuitive. The DataFrame is the primary data structure for data manipulation.

pandas-dataframe-basics.py

import pandas as pd
import numpy as np

# ── Creating DataFrames ──
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'age': [25, 30, 35, 28],
    'salary': [50000, 60000, 70000, 55000],
    'department': ['Engineering', 'Sales', 'Engineering', 'Marketing']
})

# From numpy array
df = pd.DataFrame(np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])

# From CSV / Excel / SQL / JSON
df = pd.read_csv('data.csv', parse_dates=['date'], encoding='utf-8')
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df = pd.read_sql('SELECT * FROM users', conn)
df = pd.read_json('data.json')

# ── Inspecting Data ──
df.head(10)            # First 10 rows
df.tail(5)             # Last 5 rows
df.info()              # Data types, non-null counts, memory usage
df.describe()          # Summary statistics for numeric columns
df.shape               # (rows, columns)
df.columns             # Column names
df.dtypes              # Data types per column
df.memory_usage(deep=True)  # Memory usage per column
df.value_counts('department')  # Frequency count
df.nunique()           # Unique values per column

Pandas Selection & Filtering

Operation	Code	Description
Select column	df['name'] or df.name	Returns Series
Select multiple columns	df[['name', 'age']]	Returns DataFrame
Row by label	df.loc[0]	Single row by index label
Row by position	df.iloc[0]	Single row by integer position
Slice rows	df.iloc[0:5, 1:3]	Rows 0-4, cols 1-2
Boolean filter	df[df['age'] > 30]	Rows where age > 30
Multiple conditions	df[(df['age'] > 25) & (df['salary'] > 55000)]	AND: use & not and
OR condition	df[(df['dept'] == 'Eng') \| (df['dept'] == 'Sales')]	Use \| not or
String contains	df[df['name'].str.contains('Ali')]	Filter by substring
isin	df[df['dept'].isin(['Eng', 'Sales'])]	Membership filter
Query	df.query('age > 30 and salary > 55000')	SQL-like filtering
nlargest	df.nlargest(5, 'salary')	Top 5 by salary

pandas-manipulation.py

# ── Column Operations ──
df['bonus'] = df['salary'] * 0.1           # New column
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 100],
                         labels=['young', 'mid', 'senior'])
df.drop('bonus', axis=1, inplace=True)      # Drop column
df.rename(columns={'name': 'full_name'}, inplace=True)

# ── Sorting ──
df.sort_values('salary', ascending=False, inplace=True)
df.sort_values(['department', 'age'], ascending=[True, False])

# ── Grouping & Aggregation ──
df.groupby('department')['salary'].mean()
df.groupby('department').agg({
    'salary': ['mean', 'median', 'std', 'min', 'max'],
    'age': 'mean'
})
df.pivot_table(values='salary', index='department',
               columns='age_group', aggfunc='mean', fill_value=0)

# ── Merging & Joining ──
pd.merge(df1, df2, on='id', how='left')     # Left join
pd.merge(df1, df2, left_on='id', right_on='user_id', how='inner')
pd.concat([df1, df2], axis=0, ignore_index=True)  # Stack vertically
pd.concat([df1, df2], axis=1)               # Stack horizontally

# ── Apply & Transform ──
df['salary_category'] = df['salary'].apply(
    lambda x: 'high' if x > 65000 else 'medium' if x > 50000 else 'low'
)
df['salary_zscore'] = (df['salary'] - df['salary'].mean()) / df['salary'].std()

# ── Window Functions ──
df['rolling_avg'] = df['salary'].rolling(window=3).mean()
df['rank'] = df.groupby('department')['salary'].rank(ascending=False)
df['pct_change'] = df['salary'].pct_change()
df['cumsum'] = df['salary'].cumsum()

Pandas Performance Tips

Use vectorized operationsAvoid .apply() with Python functions; use built-in vectorized methods or np.where() instead.

Use appropriate dtypesConvert objects to category for repeated strings: df['col'] = df['col'].astype('category')

Use eval/queryFor complex filtering with large DataFrames, df.eval() and df.query() use C-level evaluation.

Chunk readingFor large CSVs: pd.read_csv('big.csv', chunksize=50000) to process in chunks.

Avoid iterrowsNever use iterrows() for computation. Use .itertuples() or vectorized operations instead.

🧹

Data Cleaning

80% of the Work

Data cleaning is often said to consume 80% of a data scientist's time. Proper handling of missing values, duplicates, outliers, and inconsistent formatting is critical for reliable analysis.

data-cleaning.py

import pandas as pd
import numpy as np

# ── Missing Value Detection ──
df.isnull().sum()              # Count missing values per column
df.isnull().mean() * 100       # Percentage missing per column
df.isnull().sum().sum()        # Total missing values
df[df.isnull().any(axis=1)]    # Rows with any missing value
df.isnull().any()              # Columns with any missing value

# ── Missing Value Treatment ──
# Drop rows/columns
df.dropna(subset=['column_name'], inplace=True)   # Drop rows with NaN in col
df.dropna(thresh=5, axis=1)           # Drop cols with fewer than 5 non-null
df.dropna(how='all')                  # Drop rows where ALL values are NaN

# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True)    # Median imputation
df['salary'].fillna(df['salary'].mean())              # Mean imputation
df['department'].fillna('Unknown')                    # Constant fill
df.fillna(method='ffill')           # Forward fill (use previous value)
df.fillna(method='bfill')           # Backward fill (use next value)
df.interpolate(method='linear')     # Linear interpolation

# ── Advanced Imputation (sklearn) ──
from sklearn.impute import SimpleImputer, KNNImputer
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')
knn_imputer = KNNImputer(n_neighbors=5)

# ── Duplicate Handling ──
df.duplicated()                      # Boolean mask of duplicates
df.duplicated(subset=['email'])      # Duplicates based on column
df.drop_duplicates(subset=['email'], keep='last', inplace=True)

# ── String Cleaning ──
df['name'] = df['name'].str.strip()                # Remove whitespace
df['name'] = df['name'].str.lower()                # Lowercase
df['name'] = df['name'].str.title()                # Title case
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True)  # Keep digits
df['email'] = df['email'].str.lower().str.strip()

Outlier Detection & Treatment

outlier-detection.py

# ── IQR Method ──
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)]

# ── Z-Score Method ──
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary']))
outliers = df[z_scores > 3]  # Beyond 3 standard deviations

# ── Treatment Options ──
# 1. Remove outliers
df = df[~((df['salary'] < lower_bound) | (df['salary'] > upper_bound))]
# 2. Cap at bounds (Winsorization)
df['salary'] = df['salary'].clip(lower_bound, upper_bound)
# 3. Log transform to reduce skew
df['salary'] = np.log1p(df['salary'])

Data Type Conversion

Method	Example	Use Case
astype()	df['age'] = df['age'].astype(int)	Explicit type conversion
to_numeric()	pd.to_numeric(df['price'], errors='coerce')	Convert with error handling
to_datetime()	pd.to_datetime(df['date'], format='%Y-%m-%d')	Parse dates
to_categorical()	df['dept'] = df['dept'].astype('category')	Reduce memory for repeated strings
pd.CategoricalDtype()	pd.CategoricalDtype(['low','mid','high'], ordered=True)	Ordered categories

💡

Golden rule: Always explore your data (df.info(), df.describe(), df.head()) before cleaning. Document every transformation. Keep raw data untouched and work on copies.

📈

Visualization (Matplotlib & Seaborn)

See Your Data

Visualization is essential for EDA, finding patterns, and communicating results. Matplotlib provides low-level control; Seaborn provides statistical aesthetics; Plotly adds interactivity.

matplotlib-basics.py

import matplotlib.pyplot as plt
import numpy as np

# ── Basic Plot Setup ──
plt.figure(figsize=(12, 6))         # Set figure size
plt.style.use('seaborn-v0_8-whitegrid')  # Set style
fig, axes = plt.subplots(2, 2, figsize=(12, 10))  # Multi-subplot

# ── Line Plot ──
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x), label='sin(x)', color='blue', linewidth=2)
plt.plot(x, np.cos(x), label='cos(x)', color='red', linestyle='--')
plt.title('Trigonometric Functions')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('plot.png', dpi=300, bbox_inches='tight')
plt.show()

# ── Bar Chart ──
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]
plt.bar(categories, values, color=['#4C72B0', '#55A868', '#C44E52', '#8172B2'])
plt.title('Category Values')
plt.ylabel('Count')

# ── Histogram ──
plt.hist(df['salary'], bins=30, color='skyblue', edgecolor='black', alpha=0.7)
plt.axvline(df['salary'].mean(), color='red', linestyle='--', label=f'Mean: {df["salary"].mean():.0f}')
plt.title('Salary Distribution')
plt.legend()

# ── Scatter Plot ──
plt.scatter(df['age'], df['salary'], c=df['department'].astype('category').cat.codes,
            cmap='viridis', alpha=0.6, s=100)
plt.colorbar(label='Department')
plt.xlabel('Age')
plt.ylabel('Salary')

seaborn-visualizations.py

import seaborn as sns
import matplotlib.pyplot as plt

# ── Statistical Plots ──
# Distribution
sns.histplot(data=df, x='salary', kde=True, hue='department', bins=30)

# Box Plot (outliers & quartiles)
sns.boxplot(data=df, x='department', y='salary', palette='Set2')

# Violin Plot (distribution + box plot)
sns.violinplot(data=df, x='department', y='salary', inner='box', palette='muted')

# ── Relationship Plots ──
# Scatter with regression line
sns.regplot(data=df, x='age', y='salary', scatter_kws={'alpha': 0.5})

# Pair plot (all numeric relationships)
sns.pairplot(df[['age', 'salary', 'experience']], hue='department')

# ── Heatmap (Correlation Matrix) ──
corr = df.select_dtypes(include='number').corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=0.5)

# ── Categorical Plots ──
sns.countplot(data=df, x='department', order=df['department'].value_counts().index)
sns.barplot(data=df, x='department', y='salary', estimator='mean', errorbar='sd')

# ── Combined Figure ──
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
sns.histplot(data=df, x='salary', ax=axes[0, 0], kde=True)
sns.boxplot(data=df, x='department', y='salary', ax=axes[0, 1])
sns.scatterplot(data=df, x='age', y='salary', hue='department', ax=axes[1, 0])
sns.countplot(data=df, x='department', ax=axes[1, 1])
plt.tight_layout()
plt.savefig('eda_dashboard.png', dpi=300)

Plot Type Selection Guide

Goal	Best Chart	Library	When to Use
Compare categories	Bar chart	sns.barplot / sns.countplot	Categorical vs numeric
Distribution of 1 variable	Histogram / KDE	sns.histplot / sns.kdeplot	Continuous data distribution
Distribution by category	Box plot / Violin	sns.boxplot / sns.violinplot	Compare distributions across groups
Relationship between 2 vars	Scatter plot	sns.scatterplot / sns.regplot	Correlation analysis
Correlation matrix	Heatmap	sns.heatmap	Multi-variable relationships
Time series	Line chart	plt.plot / sns.lineplot	Trends over time
Composition	Stacked bar / Pie	plt.pie / stacked bar	Parts of a whole
Multi-dimensional	Pair plot / Parallel	sns.pairplot	Explore many variables at once

⚠️

Visualization best practices:Always label axes, use consistent color palettes, avoid 3D charts, tell a story with annotations, and keep it simple. Use colorblind-friendly palettes like 'viridis'.

📐

Statistical Analysis

Hypothesis Testing

Statistical analysis provides the mathematical foundation for drawing conclusions from data. Understanding distributions, hypothesis testing, and correlation is essential for any data scientist.

statistical-tests.py

import scipy.stats as stats
import numpy as np
from scipy.stats import norm, ttest_ind, chi2_contingency, f_oneway

# ── Descriptive Statistics ──
data = np.array([23, 25, 28, 30, 32, 35, 38, 40, 42, 45])
print(f"Mean: {data.mean():.2f}")       # 33.80
print(f"Median: {np.median(data):.2f}") # 33.50
print(f"Std: {data.std():.2f}")         # 7.27
print(f"Variance: {data.var():.2f}")    # 52.84
print(f"Skewness: {stats.skew(data):.2f}")  # Asymmetry
print(f"Kurtosis: {stats.kurtosis(data):.2f}")  # Tailedness

# ── Normal Distribution ──
# PDF at x=0 for standard normal
pdf_val = norm.pdf(0, 0, 1)        # 0.3989
# P(X <= 1.96) for standard normal
cdf_val = norm.cdf(1.96, 0, 1)     # 0.9750
# Value at 95th percentile
percentile = norm.ppf(0.95, 0, 1)  # 1.6449

# ── Hypothesis Testing ──
group_a = np.random.normal(100, 15, 50)
group_b = np.random.normal(105, 15, 50)

# Independent t-test (compare two group means)
t_stat, p_value = ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")
# If p < 0.05, reject null hypothesis

# Paired t-test (before/after on same subjects)
t_stat_paired, p_paired = stats.ttest_rel(before, after)

# ── ANOVA (compare 3+ group means) ──
group1 = np.random.normal(100, 15, 30)
group2 = np.random.normal(105, 15, 30)
group3 = np.random.normal(110, 15, 30)
f_stat, p_value = f_oneway(group1, group2, group3)

# ── Chi-Square Test (categorical data) ──
observed = np.array([[50, 30, 20], [30, 40, 30]])
chi2, p_val, dof, expected = chi2_contingency(observed)

# ── Correlation Tests ──
# Pearson (linear relationship, both continuous)
r, p = stats.pearsonr(df['age'], df['salary'])
# Spearman (monotonic relationship, ordinal OK)
rho, p = stats.spearmanr(df['experience'], df['salary'])
# Kendall (rank-based, small samples)
tau, p = stats.kendalltau(df['rank'], df['score'])

Hypothesis Testing Decision Framework

Question	Test	Data Type	Assumptions	Python Function
One mean vs. value	One-sample t-test	Continuous	Normality	stats.ttest_1samp(data, mu0)
Two independent means	Independent t-test	Continuous, 2 groups	Normality, equal variance	stats.ttest_ind(g1, g2)
Two paired means	Paired t-test	Continuous, paired	Normality of differences	stats.ttest_rel(before, after)
3+ group means	One-way ANOVA	Continuous, 3+ groups	Normality, equal variance	stats.f_oneway(g1, g2, g3)
Two categorical vars	Chi-square test	Categorical	Expected counts >= 5	stats.chi2_contingency(obs)
Correlation	Pearson	Continuous	Linearity, normality	stats.pearsonr(x, y)
Rank correlation	Spearman	Ordinal/continuous	Monotonic relationship	stats.spearmanr(x, y)
Non-normal 2 means	Mann-Whitney U	Continuous/ordinal	Independent samples	stats.mannwhitneyu(g1, g2)
Distribution test	Shapiro-Wilk	Continuous	Sample size < 5000	stats.shapiro(data)

Statistical Concepts Cheat Sheet

p-valueProbability of observing the test statistic (or more extreme) if the null hypothesis is true. If p < alpha, reject H0.

Alpha (significance level)Threshold for rejecting H0. Common: 0.05 (5% risk of Type I error).

Confidence IntervalRange of values that likely contains the true parameter. 95% CI = estimate +/- 1.96 * SE.

Effect SizeMagnitude of difference, independent of sample size. Cohen's d, Pearson's r, Cramer's V.

Statistical PowerProbability of correctly rejecting a false H0. Increases with sample size and effect size.

Central Limit TheoremThe sampling distribution of the mean approaches normal distribution as n increases, regardless of population shape.

🔧

Feature Engineering

Model Performance

Feature engineering is the art of creating informative features from raw data. Well-engineered features often matter more than the choice of algorithm. This section covers the most important techniques.

feature-engineering.py

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

# ── Numerical Feature Transformations ──
scaler = StandardScaler()       # Mean=0, Std=1  (sensitive to outliers)
minmax = MinMaxScaler()         # [0, 1] range
robust = RobustScaler()         # Median=0, IQR=1 (robust to outliers)

X_scaled = scaler.fit_transform(df[['salary', 'age', 'experience']])
X_minmax = minmax.fit_transform(df[['salary', 'age', 'experience']])

# Log transform for skewed data
df['log_salary'] = np.log1p(df['salary'])    # log(1 + x)
df['sqrt_salary'] = np.sqrt(df['salary'])

# Box-Cox transform (requires all positive values)
from scipy.stats import boxcox
df['boxcox_salary'], lam = boxcox(df['salary'] + 1)

# ── Categorical Encoding ──
# Label Encoding (ordinal relationship)
le = LabelEncoder()
df['dept_encoded'] = le.fit_transform(df['department'])

# One-Hot Encoding (no ordinal relationship)
df = pd.get_dummies(df, columns=['department'], drop_first=False)

# Ordinal Encoding (explicit ordering)
oe = OrdinalEncoder(categories=[['low', 'medium', 'high']])
df['priority_encoded'] = oe.fit_transform(df[['priority']])

# Target Encoding (risk of leakage — use with CV!)
from sklearn.model_selection import KFold
def target_encode(df, col, target, n_folds=5):
    kf = KFold(n_splits=n_folds, shuffle=False)
    encoded = pd.Series(index=df.index, dtype=float)
    for train_idx, val_idx in kf.split(df):
        means = df.iloc[train_idx].groupby(col)[target].mean()
        encoded.iloc[val_idx] = df.iloc[val_idx][col].map(means)
    return encoded

# ── Date/Time Features ──
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['date'].dt.dayofweek >= 5
df['quarter'] = df['date'].dt.quarter
df['hour'] = df['date'].dt.hour
df['days_since'] = (pd.Timestamp('now') - df['date']).dt.days

Feature Engineering Techniques Reference

Technique	Description	Code / Library	Best For
Polynomial Features	Create interaction and power terms	sklearn.preprocessing.PolynomialFeatures(degree=2)	Capturing non-linear relationships
Binning	Convert continuous to discrete	pd.cut() / pd.qcut()	Non-linear effects, age groups
Target Encoding	Replace category with target mean	Custom with KFold (see above)	High-cardinality categories
Frequency Encoding	Replace category with count	`df['col'].map(df['col'].value_counts())`	High-cardinality categories (no leakage)
Interaction Features	Product/ratio of two features	df['ratio'] = df['a'] / df['b']	Combined effects
Text Features	TF-IDF, word count, embeddings	sklearn.feature_extraction.text.TfidfVectorizer	NLP feature extraction
Aggregation	Group statistics per entity	df.groupby('user_id')['amount'].agg(['mean','sum','count'])	User behavior features
Lag Features	Previous time step values	df['lag_1'] = df['value'].shift(1)	Time series prediction
Rolling Features	Moving window statistics	df['rolling_7'] = df['value'].rolling(7).mean()	Time series smoothing
PCA	Dimensionality reduction	sklearn.decomposition.PCA(n_components=0.95)	Reduce features, multicollinearity

Feature Selection Methods

Method	How It Works	Code	Pros / Cons
Variance Threshold	Drop features with low variance	VarianceThreshold(threshold=0.01)	Simple, fast; misses useful low-variance features
Correlation Filter	Remove highly correlated pairs	df.corr() and manual removal	Reduces redundancy; ignores target relationship
SelectKBest	Select top K features by statistical test	SelectKBest(f_classif, k=10)	Fast; univariate, ignores feature interactions
RFE	Recursively eliminate least important features	RFE(estimator, n_features_to_select=10)	Model-aware; computationally expensive
L1 (Lasso)	Linear model that drives weights to zero	Lasso(alpha=0.01)	Built-in selection; linear relationships only
Tree Importance	Feature importance from tree models	model.feature_importances_	Handles non-linearity; can be biased toward high-cardinality features
SHAP Values	Game-theory based feature attribution	shap.Explainer(model)	Most interpretable; expensive for large datasets

⚠️

Warning: Always do feature engineering within cross-validation folds to avoid data leakage. Features derived from the target (like target encoding) are especially prone to leakage.

🎯

Model Evaluation

Measure Everything

Proper model evaluation prevents overfitting and ensures your model generalizes to unseen data. This section covers cross-validation, metrics, and evaluation strategies.

model-evaluation.py

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.model_selection import StratifiedKFold, RepeatedStratifiedKFold
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, confusion_matrix, classification_report,
    mean_squared_error, mean_absolute_error, r2_score)
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# ── Train/Test Split ──
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify for classification
)

# ── Cross-Validation ──
# K-Fold CV
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")

# Stratified K-Fold (preserves class ratios)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='f1_macro')

# Repeated Stratified K-Fold (more robust estimate)
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)

# ── Classification Metrics ──
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]  # Probability of positive class

print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred, average='binary'):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred, average='binary'):.4f}")
print(f"F1 Score:  {f1_score(y_test, y_pred, average='binary'):.4f}")
print(f"AUC-ROC:   {roc_auc_score(y_test, y_prob):.4f}")
print(f"\n{classification_report(y_test, y_pred)}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
# [[TN  FP]
#  [FN  TP]]

regression-metrics.py

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import mean_absolute_percentage_error
import numpy as np

# ── Regression Metrics ──
y_pred = model.predict(X_test)

print(f"MAE:  {mean_absolute_error(y_test, y_pred):.4f}")    # |y - y_hat| mean
print(f"MSE:  {mean_squared_error(y_test, y_pred):.4f}")     # (y - y_hat)^2 mean
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")  # sqrt(MSE)
print(f"R-squared: {r2_score(y_test, y_pred):.4f}")          # 1 - SS_res/SS_tot
print(f"MAPE: {mean_absolute_percentage_error(y_test, y_pred):.2%}")

# ── Hyperparameter Tuning ──
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Grid Search (exhaustive)
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'max_features': ['sqrt', 'log2']
}
grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid, cv=5, scoring='f1', n_jobs=-1, verbose=1
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best score:  {grid.best_score_:.4f}")

# Random Search (faster, samples randomly)
from scipy.stats import randint
param_dist = {
    'n_estimators': randint(100, 500),
    'max_depth': randint(3, 20),
    'min_samples_split': randint(2, 20)
}
random_search = RandomizedSearchCV(
    model, param_dist, n_iter=50, cv=5, scoring='f1',
    random_state=42, n_jobs=-1
)

Evaluation Metrics Cheat Sheet

Metric	Formula	Use When	Range
Accuracy	(TP + TN) / Total	Balanced classes	[0, 1]
Precision	TP / (TP + FP)	FP is costly (spam)	[0, 1]
Recall (Sensitivity)	TP / (TP + FN)	FN is costly (disease)	[0, 1]
Specificity	TN / (TN + FP)	True negative rate	[0, 1]
F1 Score	2 * P * R / (P + R)	Balance P & R	[0, 1]
AUC-ROC	Area under ROC curve	Threshold-independent	[0, 1]
Log Loss	-mean(ylog(p) + (1-y)log(1-p))	Probabilistic predictions	[0, inf)
MAE	mean(\|y - y_hat\|)	Robust to outliers	[0, inf)
RMSE	sqrt(mean((y - y_hat)^2))	Penalizes large errors	[0, inf)
R-squared	1 - SS_res/SS_tot	Explained variance	(-inf, 1]
Adjusted R-squared	1 - (1-R2)*(n-1)/(n-p-1)	Penalizes extra features	(-inf, 1]

🚫

Critical: Never evaluate your model on training data. Always use a held-out test set that the model has never seen during training or hyperparameter tuning. For final reporting, evaluate once on the test set.

💬

Interview Q&A

Get Hired

The most frequently asked data science interview questions with concise, high-quality answers and code examples.

Q1: What is the difference between supervised and unsupervised learning?

Answer: Supervised learning uses labeled data (input-output pairs) to learn a mapping function. The model is trained on known examples and evaluated on its ability to predict labels for new data. Examples: classification (spam detection), regression (price prediction).

Unsupervised learning works with unlabeled data to discover hidden patterns, structures, or groupings. Examples: clustering (customer segmentation), dimensionality reduction (PCA), anomaly detection.

The key difference is the presence of ground truth labels during training. Semi-supervised learning bridges both, using small labeled data + large unlabeled data.

Q2: How do you handle imbalanced datasets?

Answer: Imbalanced data (e.g., 99% negative, 1% positive) causes models to be biased toward the majority class.

imbalance-handling.py

# 1. Resampling
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# 2. Class weights (built into sklearn)
model = RandomForestClassifier(class_weight='balanced', random_state=42)
# Or custom: class_weight={'negative': 1, 'positive': 10}

# 3. Evaluation: Use F1, Precision-Recall AUC (NOT accuracy)
from sklearn.metrics import precision_recall_curve, auc
precision, recall, _ = precision_recall_curve(y_test, y_prob)
pr_auc = auc(recall, precision)

Q3: What is the bias-variance tradeoff?

Answer: Bias is the error from oversimplifying assumptions (underfitting). Variance is the error from being too sensitive to training data fluctuations (overfitting).

High bias: Model is too simple (linear model for non-linear data). Solution: increase model complexity, add features, reduce regularization.

High variance: Model is too complex (deep decision tree). Solution: simplify model, add regularization, increase training data, use ensemble methods.

The tradeoff: as model complexity increases, bias decreases but variance increases. The optimal model minimizes total error = bias^2 + variance + irreducible noise.

Q4: Explain A/B testing and its key concepts.

Answer: A/B testing is a controlled experiment to compare two versions (A=control, B=treatment) of a product/feature to determine which performs better on a metric.

Null Hypothesis (H0)No difference between A and B (conversion rates are equal)

Alternative Hypothesis (H1)Version B is different from A

Sample SizeCalculated using power analysis: effect size, alpha (0.05), power (0.8)

RandomizationUsers randomly assigned to A or B to eliminate confounding

Statistical SignificanceP-value < 0.05 means we reject H0 and conclude B is different

Common PitfallsSnooping (checking results before target sample), peeking, not accounting for multiple testing

Q5: What is the curse of dimensionality? How do you handle it?

Answer: As the number of features increases, the volume of the feature space grows exponentially. Data becomes sparse, distances become less meaningful, and models require exponentially more data. This leads to overfitting and poor generalization.

Solutions: (1) Feature selection — remove irrelevant features using correlation, mutual information, or model-based importance. (2) Dimensionality reduction — PCA, t-SNE, UMAP. (3) Regularization — L1/L2 penalties. (4) Domain knowledge — keep only meaningful features. (5) Autoencoders — learn compressed representations.

Q6: Explain the difference between correlation and causation.

Answer: Correlation measures the statistical relationship between two variables (they move together). Causation means one variable directly influences another. Correlation does not imply causation.

A classic example: ice cream sales and drowning rates are correlated, but the cause is summer heat (confounding variable).

Methods to establish causation: Randomized controlled trials (gold standard), instrumental variables, difference-in-differences, regression discontinuity, and causal inference frameworks (do-calculus, causal DAGs).

Q7: What is the difference between Type I and Type II errors?

Answer: Type I error (false positive) = rejecting H0 when it is true. Probability = alpha (typically 0.05). Example: convicting an innocent person.

Type II error (false negative) = failing to reject H0 when it is false. Probability = beta. Power = 1 - beta. Example: letting a guilty person go free.

Tradeoff: Decreasing alpha (strict significance) increases beta (more false negatives). The balance depends on the cost of each error. In medical testing: a false negative (missing a disease) is worse than a false positive (unnecessary follow-up test).

Q8: How do you handle missing data in a real-world project?

Answer: My approach follows a systematic process:

1. AnalyzeUnderstand the pattern (MCAR, MAR, MNAR) and percentage of missingness per column.

2. Small % (< 5%)Drop rows if missing is completely random, or impute with mean/median.

3. Medium % (5-30%)Use KNN imputation or iterative imputation (MICE) for numerical; mode for categorical.

4. Large % (> 30%)Consider dropping the column, creating a "missing" indicator, or using model-based imputation.

5. MNAR (Not At Random)The missingness itself is informative. Create indicator features and consult domain experts.

💡

Interview tip: Always structure answers with: (1) definition, (2) method, (3) tradeoffs, (4) real-world example. Use the STAR method for behavioral questions. Practice explaining technical concepts to non-technical audiences.

⏳

Loading cheatsheet...