NumPy, Pandas, Data Cleaning, Visualization (Matplotlib/Seaborn), Statistical Analysis, Feature Engineering, Model Evaluation — data science mastery.
NumPy is the fundamental package for numerical computing in Python. It provides powerful N-dimensional array objects, broadcasting, and vectorized operations that form the backbone of data science.
import numpy as np
# ── Array Creation ──
a = np.array([1, 2, 3, 4, 5]) # 1D array from list
b = np.zeros((3, 4)) # 3x4 array of zeros
c = np.ones((2, 3)) # 2x3 array of ones
d = np.full((3, 3), 7.0) # 3x3 array filled with 7.0
e = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
f = np.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
g = np.random.randn(3, 3) # 3x3 standard normal distribution
h = np.random.randint(0, 10, (2, 5)) # 2x5 random integers [0, 10)
i = np.eye(4) # 4x4 identity matrix
j = np.random.seed(42) # Set random seed for reproducibility
# ── Array Properties ──
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr.ndim # 2 (number of dimensions)
arr.shape # (2, 3) (rows, columns)
arr.size # 6 (total elements)
arr.dtype # int64 (data type)
arr.itemsize # 8 (bytes per element)
arr.nbytes # 48 (total bytes)| Operation | Code | Result | Description |
|---|---|---|---|
| Single element | arr[0, 1] | 2 | Row 0, Column 1 |
| Row slice | arr[0, :] | [1, 2, 3] | Entire first row |
| Column slice | arr[:, 1] | [2, 5] | Entire second column |
| Negative index | arr[-1, -1] | 6 | Last element |
| Fancy indexing | arr[[0, 1], [2, 0]] | [3, 4] | Elements (0,2) and (1,0) |
| Boolean mask | arr[arr > 3] | [4, 5, 6] | Elements greater than 3 |
| Where | np.where(arr > 3, arr, 0) | [[0,0,0],[4,5,6]] | Replace with condition |
| Reshape | arr.reshape(3, 2) | 3x2 array | Change shape (must match size) |
| Transpose | arr.T | 3x2 array | Swap rows and columns |
| Flatten | arr.flatten() | [1,2,3,4,5,6] | 1D copy of array |
# ── Vectorized Operations (no Python loops needed) ──
a = np.array([1, 2, 3, 4, 5])
b = np.array([10, 20, 30, 40, 50])
a + b # [11, 22, 33, 44, 45] element-wise addition
a * b # [10, 40, 90, 160, 250] element-wise multiplication
a ** 2 # [1, 4, 9, 16, 25] element-wise power
a > 3 # [False, False, False, True, True]
np.dot(a, b) # 550 dot product
np.cross(a[:3], b[:3]) # cross product
# ── Aggregation ──
arr = np.array([[1, 2, 3], [4, 5, 6]])
arr.sum() # 21
arr.sum(axis=0) # [5, 7, 9] column sums
arr.sum(axis=1) # [6, 15] row sums
arr.mean() # 3.5
arr.std() # 1.7078
arr.min(), arr.max() # 1, 6
arr.argmax() # 5 (flat index of max)
arr.cumsum() # [1, 3, 6, 10, 15, 21]
# ── Linear Algebra ──
A = np.array([[1, 2], [3, 4]])
A_inv = np.linalg.inv(A) # inverse
det = np.linalg.det(A) # determinant: -2.0
eigvals, eigvecs = np.linalg.eig(A) # eigenvalues/vectors
U, S, Vt = np.linalg.svd(A) # SVD decomposition
np.linalg.solve(A, b) # solve Ax = b
# ── Broadcasting Rules ──
# Shapes are compatible when dimensions are equal or one of them is 1
a = np.array([[1], [2], [3]]) # shape (3, 1)
b = np.array([10, 20, 30]) # shape (3,)
a + b # shape (3, 3) — a broadcast along columnsPandas provides fast, flexible, and expressive data structures designed to make working with structured (tabular) data intuitive. The DataFrame is the primary data structure for data manipulation.
import pandas as pd
import numpy as np
# ── Creating DataFrames ──
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'salary': [50000, 60000, 70000, 55000],
'department': ['Engineering', 'Sales', 'Engineering', 'Marketing']
})
# From numpy array
df = pd.DataFrame(np.random.randn(100, 4), columns=['A', 'B', 'C', 'D'])
# From CSV / Excel / SQL / JSON
df = pd.read_csv('data.csv', parse_dates=['date'], encoding='utf-8')
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df = pd.read_sql('SELECT * FROM users', conn)
df = pd.read_json('data.json')
# ── Inspecting Data ──
df.head(10) # First 10 rows
df.tail(5) # Last 5 rows
df.info() # Data types, non-null counts, memory usage
df.describe() # Summary statistics for numeric columns
df.shape # (rows, columns)
df.columns # Column names
df.dtypes # Data types per column
df.memory_usage(deep=True) # Memory usage per column
df.value_counts('department') # Frequency count
df.nunique() # Unique values per column| Operation | Code | Description |
|---|---|---|
| Select column | df['name'] or df.name | Returns Series |
| Select multiple columns | df[['name', 'age']] | Returns DataFrame |
| Row by label | df.loc[0] | Single row by index label |
| Row by position | df.iloc[0] | Single row by integer position |
| Slice rows | df.iloc[0:5, 1:3] | Rows 0-4, cols 1-2 |
| Boolean filter | df[df['age'] > 30] | Rows where age > 30 |
| Multiple conditions | df[(df['age'] > 25) & (df['salary'] > 55000)] | AND: use & not and |
| OR condition | df[(df['dept'] == 'Eng') | (df['dept'] == 'Sales')] | Use | not or |
| String contains | df[df['name'].str.contains('Ali')] | Filter by substring |
| isin | df[df['dept'].isin(['Eng', 'Sales'])] | Membership filter |
| Query | df.query('age > 30 and salary > 55000') | SQL-like filtering |
| nlargest | df.nlargest(5, 'salary') | Top 5 by salary |
# ── Column Operations ──
df['bonus'] = df['salary'] * 0.1 # New column
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 100],
labels=['young', 'mid', 'senior'])
df.drop('bonus', axis=1, inplace=True) # Drop column
df.rename(columns={'name': 'full_name'}, inplace=True)
# ── Sorting ──
df.sort_values('salary', ascending=False, inplace=True)
df.sort_values(['department', 'age'], ascending=[True, False])
# ── Grouping & Aggregation ──
df.groupby('department')['salary'].mean()
df.groupby('department').agg({
'salary': ['mean', 'median', 'std', 'min', 'max'],
'age': 'mean'
})
df.pivot_table(values='salary', index='department',
columns='age_group', aggfunc='mean', fill_value=0)
# ── Merging & Joining ──
pd.merge(df1, df2, on='id', how='left') # Left join
pd.merge(df1, df2, left_on='id', right_on='user_id', how='inner')
pd.concat([df1, df2], axis=0, ignore_index=True) # Stack vertically
pd.concat([df1, df2], axis=1) # Stack horizontally
# ── Apply & Transform ──
df['salary_category'] = df['salary'].apply(
lambda x: 'high' if x > 65000 else 'medium' if x > 50000 else 'low'
)
df['salary_zscore'] = (df['salary'] - df['salary'].mean()) / df['salary'].std()
# ── Window Functions ──
df['rolling_avg'] = df['salary'].rolling(window=3).mean()
df['rank'] = df.groupby('department')['salary'].rank(ascending=False)
df['pct_change'] = df['salary'].pct_change()
df['cumsum'] = df['salary'].cumsum()Data cleaning is often said to consume 80% of a data scientist's time. Proper handling of missing values, duplicates, outliers, and inconsistent formatting is critical for reliable analysis.
import pandas as pd
import numpy as np
# ── Missing Value Detection ──
df.isnull().sum() # Count missing values per column
df.isnull().mean() * 100 # Percentage missing per column
df.isnull().sum().sum() # Total missing values
df[df.isnull().any(axis=1)] # Rows with any missing value
df.isnull().any() # Columns with any missing value
# ── Missing Value Treatment ──
# Drop rows/columns
df.dropna(subset=['column_name'], inplace=True) # Drop rows with NaN in col
df.dropna(thresh=5, axis=1) # Drop cols with fewer than 5 non-null
df.dropna(how='all') # Drop rows where ALL values are NaN
# Fill missing values
df['age'].fillna(df['age'].median(), inplace=True) # Median imputation
df['salary'].fillna(df['salary'].mean()) # Mean imputation
df['department'].fillna('Unknown') # Constant fill
df.fillna(method='ffill') # Forward fill (use previous value)
df.fillna(method='bfill') # Backward fill (use next value)
df.interpolate(method='linear') # Linear interpolation
# ── Advanced Imputation (sklearn) ──
from sklearn.impute import SimpleImputer, KNNImputer
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')
knn_imputer = KNNImputer(n_neighbors=5)
# ── Duplicate Handling ──
df.duplicated() # Boolean mask of duplicates
df.duplicated(subset=['email']) # Duplicates based on column
df.drop_duplicates(subset=['email'], keep='last', inplace=True)
# ── String Cleaning ──
df['name'] = df['name'].str.strip() # Remove whitespace
df['name'] = df['name'].str.lower() # Lowercase
df['name'] = df['name'].str.title() # Title case
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True) # Keep digits
df['email'] = df['email'].str.lower().str.strip()# ── IQR Method ──
Q1 = df['salary'].quantile(0.25)
Q3 = df['salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)]
# ── Z-Score Method ──
from scipy import stats
z_scores = np.abs(stats.zscore(df['salary']))
outliers = df[z_scores > 3] # Beyond 3 standard deviations
# ── Treatment Options ──
# 1. Remove outliers
df = df[~((df['salary'] < lower_bound) | (df['salary'] > upper_bound))]
# 2. Cap at bounds (Winsorization)
df['salary'] = df['salary'].clip(lower_bound, upper_bound)
# 3. Log transform to reduce skew
df['salary'] = np.log1p(df['salary'])| Method | Example | Use Case |
|---|---|---|
| astype() | df['age'] = df['age'].astype(int) | Explicit type conversion |
| to_numeric() | pd.to_numeric(df['price'], errors='coerce') | Convert with error handling |
| to_datetime() | pd.to_datetime(df['date'], format='%Y-%m-%d') | Parse dates |
| to_categorical() | df['dept'] = df['dept'].astype('category') | Reduce memory for repeated strings |
| pd.CategoricalDtype() | pd.CategoricalDtype(['low','mid','high'], ordered=True) | Ordered categories |
Visualization is essential for EDA, finding patterns, and communicating results. Matplotlib provides low-level control; Seaborn provides statistical aesthetics; Plotly adds interactivity.
import matplotlib.pyplot as plt
import numpy as np
# ── Basic Plot Setup ──
plt.figure(figsize=(12, 6)) # Set figure size
plt.style.use('seaborn-v0_8-whitegrid') # Set style
fig, axes = plt.subplots(2, 2, figsize=(12, 10)) # Multi-subplot
# ── Line Plot ──
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x), label='sin(x)', color='blue', linewidth=2)
plt.plot(x, np.cos(x), label='cos(x)', color='red', linestyle='--')
plt.title('Trigonometric Functions')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('plot.png', dpi=300, bbox_inches='tight')
plt.show()
# ── Bar Chart ──
categories = ['A', 'B', 'C', 'D']
values = [23, 45, 56, 78]
plt.bar(categories, values, color=['#4C72B0', '#55A868', '#C44E52', '#8172B2'])
plt.title('Category Values')
plt.ylabel('Count')
# ── Histogram ──
plt.hist(df['salary'], bins=30, color='skyblue', edgecolor='black', alpha=0.7)
plt.axvline(df['salary'].mean(), color='red', linestyle='--', label=f'Mean: {df["salary"].mean():.0f}')
plt.title('Salary Distribution')
plt.legend()
# ── Scatter Plot ──
plt.scatter(df['age'], df['salary'], c=df['department'].astype('category').cat.codes,
cmap='viridis', alpha=0.6, s=100)
plt.colorbar(label='Department')
plt.xlabel('Age')
plt.ylabel('Salary')import seaborn as sns
import matplotlib.pyplot as plt
# ── Statistical Plots ──
# Distribution
sns.histplot(data=df, x='salary', kde=True, hue='department', bins=30)
# Box Plot (outliers & quartiles)
sns.boxplot(data=df, x='department', y='salary', palette='Set2')
# Violin Plot (distribution + box plot)
sns.violinplot(data=df, x='department', y='salary', inner='box', palette='muted')
# ── Relationship Plots ──
# Scatter with regression line
sns.regplot(data=df, x='age', y='salary', scatter_kws={'alpha': 0.5})
# Pair plot (all numeric relationships)
sns.pairplot(df[['age', 'salary', 'experience']], hue='department')
# ── Heatmap (Correlation Matrix) ──
corr = df.select_dtypes(include='number').corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', center=0,
square=True, linewidths=0.5)
# ── Categorical Plots ──
sns.countplot(data=df, x='department', order=df['department'].value_counts().index)
sns.barplot(data=df, x='department', y='salary', estimator='mean', errorbar='sd')
# ── Combined Figure ──
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
sns.histplot(data=df, x='salary', ax=axes[0, 0], kde=True)
sns.boxplot(data=df, x='department', y='salary', ax=axes[0, 1])
sns.scatterplot(data=df, x='age', y='salary', hue='department', ax=axes[1, 0])
sns.countplot(data=df, x='department', ax=axes[1, 1])
plt.tight_layout()
plt.savefig('eda_dashboard.png', dpi=300)| Goal | Best Chart | Library | When to Use |
|---|---|---|---|
| Compare categories | Bar chart | sns.barplot / sns.countplot | Categorical vs numeric |
| Distribution of 1 variable | Histogram / KDE | sns.histplot / sns.kdeplot | Continuous data distribution |
| Distribution by category | Box plot / Violin | sns.boxplot / sns.violinplot | Compare distributions across groups |
| Relationship between 2 vars | Scatter plot | sns.scatterplot / sns.regplot | Correlation analysis |
| Correlation matrix | Heatmap | sns.heatmap | Multi-variable relationships |
| Time series | Line chart | plt.plot / sns.lineplot | Trends over time |
| Composition | Stacked bar / Pie | plt.pie / stacked bar | Parts of a whole |
| Multi-dimensional | Pair plot / Parallel | sns.pairplot | Explore many variables at once |
Statistical analysis provides the mathematical foundation for drawing conclusions from data. Understanding distributions, hypothesis testing, and correlation is essential for any data scientist.
import scipy.stats as stats
import numpy as np
from scipy.stats import norm, ttest_ind, chi2_contingency, f_oneway
# ── Descriptive Statistics ──
data = np.array([23, 25, 28, 30, 32, 35, 38, 40, 42, 45])
print(f"Mean: {data.mean():.2f}") # 33.80
print(f"Median: {np.median(data):.2f}") # 33.50
print(f"Std: {data.std():.2f}") # 7.27
print(f"Variance: {data.var():.2f}") # 52.84
print(f"Skewness: {stats.skew(data):.2f}") # Asymmetry
print(f"Kurtosis: {stats.kurtosis(data):.2f}") # Tailedness
# ── Normal Distribution ──
# PDF at x=0 for standard normal
pdf_val = norm.pdf(0, 0, 1) # 0.3989
# P(X <= 1.96) for standard normal
cdf_val = norm.cdf(1.96, 0, 1) # 0.9750
# Value at 95th percentile
percentile = norm.ppf(0.95, 0, 1) # 1.6449
# ── Hypothesis Testing ──
group_a = np.random.normal(100, 15, 50)
group_b = np.random.normal(105, 15, 50)
# Independent t-test (compare two group means)
t_stat, p_value = ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.4f}, p-value: {p_value:.4f}")
# If p < 0.05, reject null hypothesis
# Paired t-test (before/after on same subjects)
t_stat_paired, p_paired = stats.ttest_rel(before, after)
# ── ANOVA (compare 3+ group means) ──
group1 = np.random.normal(100, 15, 30)
group2 = np.random.normal(105, 15, 30)
group3 = np.random.normal(110, 15, 30)
f_stat, p_value = f_oneway(group1, group2, group3)
# ── Chi-Square Test (categorical data) ──
observed = np.array([[50, 30, 20], [30, 40, 30]])
chi2, p_val, dof, expected = chi2_contingency(observed)
# ── Correlation Tests ──
# Pearson (linear relationship, both continuous)
r, p = stats.pearsonr(df['age'], df['salary'])
# Spearman (monotonic relationship, ordinal OK)
rho, p = stats.spearmanr(df['experience'], df['salary'])
# Kendall (rank-based, small samples)
tau, p = stats.kendalltau(df['rank'], df['score'])| Question | Test | Data Type | Assumptions | Python Function |
|---|---|---|---|---|
| One mean vs. value | One-sample t-test | Continuous | Normality | stats.ttest_1samp(data, mu0) |
| Two independent means | Independent t-test | Continuous, 2 groups | Normality, equal variance | stats.ttest_ind(g1, g2) |
| Two paired means | Paired t-test | Continuous, paired | Normality of differences | stats.ttest_rel(before, after) |
| 3+ group means | One-way ANOVA | Continuous, 3+ groups | Normality, equal variance | stats.f_oneway(g1, g2, g3) |
| Two categorical vars | Chi-square test | Categorical | Expected counts >= 5 | stats.chi2_contingency(obs) |
| Correlation | Pearson | Continuous | Linearity, normality | stats.pearsonr(x, y) |
| Rank correlation | Spearman | Ordinal/continuous | Monotonic relationship | stats.spearmanr(x, y) |
| Non-normal 2 means | Mann-Whitney U | Continuous/ordinal | Independent samples | stats.mannwhitneyu(g1, g2) |
| Distribution test | Shapiro-Wilk | Continuous | Sample size < 5000 | stats.shapiro(data) |
Feature engineering is the art of creating informative features from raw data. Well-engineered features often matter more than the choice of algorithm. This section covers the most important techniques.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
# ── Numerical Feature Transformations ──
scaler = StandardScaler() # Mean=0, Std=1 (sensitive to outliers)
minmax = MinMaxScaler() # [0, 1] range
robust = RobustScaler() # Median=0, IQR=1 (robust to outliers)
X_scaled = scaler.fit_transform(df[['salary', 'age', 'experience']])
X_minmax = minmax.fit_transform(df[['salary', 'age', 'experience']])
# Log transform for skewed data
df['log_salary'] = np.log1p(df['salary']) # log(1 + x)
df['sqrt_salary'] = np.sqrt(df['salary'])
# Box-Cox transform (requires all positive values)
from scipy.stats import boxcox
df['boxcox_salary'], lam = boxcox(df['salary'] + 1)
# ── Categorical Encoding ──
# Label Encoding (ordinal relationship)
le = LabelEncoder()
df['dept_encoded'] = le.fit_transform(df['department'])
# One-Hot Encoding (no ordinal relationship)
df = pd.get_dummies(df, columns=['department'], drop_first=False)
# Ordinal Encoding (explicit ordering)
oe = OrdinalEncoder(categories=[['low', 'medium', 'high']])
df['priority_encoded'] = oe.fit_transform(df[['priority']])
# Target Encoding (risk of leakage — use with CV!)
from sklearn.model_selection import KFold
def target_encode(df, col, target, n_folds=5):
kf = KFold(n_splits=n_folds, shuffle=False)
encoded = pd.Series(index=df.index, dtype=float)
for train_idx, val_idx in kf.split(df):
means = df.iloc[train_idx].groupby(col)[target].mean()
encoded.iloc[val_idx] = df.iloc[val_idx][col].map(means)
return encoded
# ── Date/Time Features ──
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['date'].dt.dayofweek >= 5
df['quarter'] = df['date'].dt.quarter
df['hour'] = df['date'].dt.hour
df['days_since'] = (pd.Timestamp('now') - df['date']).dt.days| Technique | Description | Code / Library | Best For |
|---|---|---|---|
| Polynomial Features | Create interaction and power terms | sklearn.preprocessing.PolynomialFeatures(degree=2) | Capturing non-linear relationships |
| Binning | Convert continuous to discrete | pd.cut() / pd.qcut() | Non-linear effects, age groups |
| Target Encoding | Replace category with target mean | Custom with KFold (see above) | High-cardinality categories |
| Frequency Encoding | Replace category with count | `df['col'].map(df['col'].value_counts())` | High-cardinality categories (no leakage) |
| Interaction Features | Product/ratio of two features | df['ratio'] = df['a'] / df['b'] | Combined effects |
| Text Features | TF-IDF, word count, embeddings | sklearn.feature_extraction.text.TfidfVectorizer | NLP feature extraction |
| Aggregation | Group statistics per entity | df.groupby('user_id')['amount'].agg(['mean','sum','count']) | User behavior features |
| Lag Features | Previous time step values | df['lag_1'] = df['value'].shift(1) | Time series prediction |
| Rolling Features | Moving window statistics | df['rolling_7'] = df['value'].rolling(7).mean() | Time series smoothing |
| PCA | Dimensionality reduction | sklearn.decomposition.PCA(n_components=0.95) | Reduce features, multicollinearity |
| Method | How It Works | Code | Pros / Cons |
|---|---|---|---|
| Variance Threshold | Drop features with low variance | VarianceThreshold(threshold=0.01) | Simple, fast; misses useful low-variance features |
| Correlation Filter | Remove highly correlated pairs | df.corr() and manual removal | Reduces redundancy; ignores target relationship |
| SelectKBest | Select top K features by statistical test | SelectKBest(f_classif, k=10) | Fast; univariate, ignores feature interactions |
| RFE | Recursively eliminate least important features | RFE(estimator, n_features_to_select=10) | Model-aware; computationally expensive |
| L1 (Lasso) | Linear model that drives weights to zero | Lasso(alpha=0.01) | Built-in selection; linear relationships only |
| Tree Importance | Feature importance from tree models | model.feature_importances_ | Handles non-linearity; can be biased toward high-cardinality features |
| SHAP Values | Game-theory based feature attribution | shap.Explainer(model) | Most interpretable; expensive for large datasets |
Proper model evaluation prevents overfitting and ensures your model generalizes to unseen data. This section covers cross-validation, metrics, and evaluation strategies.
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.model_selection import StratifiedKFold, RepeatedStratifiedKFold
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix, classification_report,
mean_squared_error, mean_absolute_error, r2_score)
from sklearn.ensemble import RandomForestClassifier
import numpy as np
# ── Train/Test Split ──
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y # stratify for classification
)
# ── Cross-Validation ──
# K-Fold CV
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print(f"CV Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f})")
# Stratified K-Fold (preserves class ratios)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='f1_macro')
# Repeated Stratified K-Fold (more robust estimate)
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=42)
# ── Classification Metrics ──
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1] # Probability of positive class
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred, average='binary'):.4f}")
print(f"Recall: {recall_score(y_test, y_pred, average='binary'):.4f}")
print(f"F1 Score: {f1_score(y_test, y_pred, average='binary'):.4f}")
print(f"AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"\n{classification_report(y_test, y_pred)}")
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
# [[TN FP]
# [FN TP]]from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import mean_absolute_percentage_error
import numpy as np
# ── Regression Metrics ──
y_pred = model.predict(X_test)
print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}") # |y - y_hat| mean
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}") # (y - y_hat)^2 mean
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}") # sqrt(MSE)
print(f"R-squared: {r2_score(y_test, y_pred):.4f}") # 1 - SS_res/SS_tot
print(f"MAPE: {mean_absolute_percentage_error(y_test, y_pred):.2%}")
# ── Hyperparameter Tuning ──
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Grid Search (exhaustive)
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10],
'max_features': ['sqrt', 'log2']
}
grid = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid, cv=5, scoring='f1', n_jobs=-1, verbose=1
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best score: {grid.best_score_:.4f}")
# Random Search (faster, samples randomly)
from scipy.stats import randint
param_dist = {
'n_estimators': randint(100, 500),
'max_depth': randint(3, 20),
'min_samples_split': randint(2, 20)
}
random_search = RandomizedSearchCV(
model, param_dist, n_iter=50, cv=5, scoring='f1',
random_state=42, n_jobs=-1
)| Metric | Formula | Use When | Range |
|---|---|---|---|
| Accuracy | (TP + TN) / Total | Balanced classes | [0, 1] |
| Precision | TP / (TP + FP) | FP is costly (spam) | [0, 1] |
| Recall (Sensitivity) | TP / (TP + FN) | FN is costly (disease) | [0, 1] |
| Specificity | TN / (TN + FP) | True negative rate | [0, 1] |
| F1 Score | 2 * P * R / (P + R) | Balance P & R | [0, 1] |
| AUC-ROC | Area under ROC curve | Threshold-independent | [0, 1] |
| Log Loss | -mean(y*log(p) + (1-y)*log(1-p)) | Probabilistic predictions | [0, inf) |
| MAE | mean(|y - y_hat|) | Robust to outliers | [0, inf) |
| RMSE | sqrt(mean((y - y_hat)^2)) | Penalizes large errors | [0, inf) |
| R-squared | 1 - SS_res/SS_tot | Explained variance | (-inf, 1] |
| Adjusted R-squared | 1 - (1-R2)*(n-1)/(n-p-1) | Penalizes extra features | (-inf, 1] |
The most frequently asked data science interview questions with concise, high-quality answers and code examples.
Answer: Supervised learning uses labeled data (input-output pairs) to learn a mapping function. The model is trained on known examples and evaluated on its ability to predict labels for new data. Examples: classification (spam detection), regression (price prediction).
Unsupervised learning works with unlabeled data to discover hidden patterns, structures, or groupings. Examples: clustering (customer segmentation), dimensionality reduction (PCA), anomaly detection.
The key difference is the presence of ground truth labels during training. Semi-supervised learning bridges both, using small labeled data + large unlabeled data.
Answer: Imbalanced data (e.g., 99% negative, 1% positive) causes models to be biased toward the majority class.
# 1. Resampling
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# 2. Class weights (built into sklearn)
model = RandomForestClassifier(class_weight='balanced', random_state=42)
# Or custom: class_weight={'negative': 1, 'positive': 10}
# 3. Evaluation: Use F1, Precision-Recall AUC (NOT accuracy)
from sklearn.metrics import precision_recall_curve, auc
precision, recall, _ = precision_recall_curve(y_test, y_prob)
pr_auc = auc(recall, precision)Answer: Bias is the error from oversimplifying assumptions (underfitting). Variance is the error from being too sensitive to training data fluctuations (overfitting).
High bias: Model is too simple (linear model for non-linear data). Solution: increase model complexity, add features, reduce regularization.
High variance: Model is too complex (deep decision tree). Solution: simplify model, add regularization, increase training data, use ensemble methods.
The tradeoff: as model complexity increases, bias decreases but variance increases. The optimal model minimizes total error = bias^2 + variance + irreducible noise.
Answer: A/B testing is a controlled experiment to compare two versions (A=control, B=treatment) of a product/feature to determine which performs better on a metric.
Answer: As the number of features increases, the volume of the feature space grows exponentially. Data becomes sparse, distances become less meaningful, and models require exponentially more data. This leads to overfitting and poor generalization.
Solutions: (1) Feature selection — remove irrelevant features using correlation, mutual information, or model-based importance. (2) Dimensionality reduction — PCA, t-SNE, UMAP. (3) Regularization — L1/L2 penalties. (4) Domain knowledge — keep only meaningful features. (5) Autoencoders — learn compressed representations.
Answer: Correlation measures the statistical relationship between two variables (they move together). Causation means one variable directly influences another. Correlation does not imply causation.
A classic example: ice cream sales and drowning rates are correlated, but the cause is summer heat (confounding variable).
Methods to establish causation: Randomized controlled trials (gold standard), instrumental variables, difference-in-differences, regression discontinuity, and causal inference frameworks (do-calculus, causal DAGs).
Answer: Type I error (false positive) = rejecting H0 when it is true. Probability = alpha (typically 0.05). Example: convicting an innocent person.
Type II error (false negative) = failing to reject H0 when it is false. Probability = beta. Power = 1 - beta. Example: letting a guilty person go free.
Tradeoff: Decreasing alpha (strict significance) increases beta (more false negatives). The balance depends on the cost of each error. In medical testing: a false negative (missing a disease) is worse than a false positive (unnecessary follow-up test).
Answer: My approach follows a systematic process: