Titanic ML Production Pipeline

Case Brief

A working notebook is not a working system.

The research project produced a well-performing model — 83% cross-validation accuracy — through iterative experimentation in Jupyter. But a notebook runs top-to-bottom once, with no input validation, no reusable components, and no way to serve predictions without re-executing cells.

This project is the second half: taking every decision from that research — the features, the preprocessing logic, the model configuration — and rebuilding it as a proper software system. Same model, done properly.

At a Glance

	Notebook (Research)	Pipeline (This Project)
Transformations	Inline pandas cells	Custom sklearn classes
Parameters	Hardcoded	YAML configuration file
Validation	None	Schema + dtype checks
Reproducibility	Re-run notebook top-to-bottom	Load serialized `.pkl` artifacts
Prediction path	Manual cell execution	`make_predictions()` function
Extendability	Edit cells, break things	Add modules, keep the rest intact

Architecture

The system is organized around six explicit module boundaries, each with a single responsibility:

Module	Responsibility
`src/config/`	Single source of truth for all paths, parameters, and schema expectations
`src/data_manager/`	Data I/O, input validation, and artifact serialization
`src/features/`	Custom sklearn-compatible transformers
`src/pipeline.py`	Feature pipeline and model pipeline definitions
`src/train_pipeline.py`	Training orchestration — fit, evaluate, save
`src/predict.py`	Load persisted artifacts, validate input, serve predictions

Nothing crosses those boundaries. The data manager does not know about features. The prediction interface does not know about training. Each module is independently readable, testable, and replaceable.

Configuration as a Contract

All paths, model hyperparameters, imputation strategies, feature lists, and expected dtypes live in a single configuration.yml file. Nothing is hardcoded across scripts.

Yaml

1model_params:2  SEED: 202507083  learning_rate: 0.14  max_depth: 35  max_features: "sqrt"6  n_estimators: 2007  subsample: 1.08 9preprocessing_params:10  age_imputer_strategy: "median"11  embarked_imputer_strategy: "frequent"

This makes the system auditable at a glance — one file describes every assumption the pipeline makes about data and model behavior. It also means changing a hyperparameter or swapping a file path requires editing exactly one line, in exactly one place.

Feature Engineering

The five features that drove most of the model's lift in the research phase are now implemented as proper sklearn transformer classes — each with a fit() step that learns from training data and a transform() step that applies the same learned logic to new data.

This distinction matters in practice: a transformer fitted on training data must apply its learned parameters at inference time. Notebook code re-running on test data learns from test data. A fitted sklearn transformer does not.

Transformer	What it does
`TitleExtractor`	Parses social titles from passenger names (Mr, Mrs, Miss, Master, Rare) and one-hot encodes them. Stores the training-set column structure so inference never silently misaligns.
`CapFareOutliers`	Fits IQR-based fare bounds per passenger class on training data and applies those same bounds at inference — the upper limit for first-class fares comes from training, not from whatever the new data contains.
`GroupMedianImputer`	Fills missing fares using the per-class median learned from training data.
`AgeGroupEncoder`	Converts continuous age into five fixed bins (Child, Teen, Adult, Middle-aged, Senior). Stateless — boundaries are fixed by domain logic.
`IsFamilyOnBoard`	Binary flag from `SibSp` and `Parch`. Traveling alone was a stronger predictor than either raw column. Stateless.
`TicketCounter`	Counts passengers sharing the same ticket number — a group travel signal that SibSp/Parch alone miss. Stateless.

These transformers are assembled into a sequential pipeline alongside standard imputation, encoding, and scaling steps:

Python

1feature_pipeline = Pipeline([2    ("age_imputer",           MeanMedianImputer(variables=["Age"], ...)),3    ("embarked_imputer",      CategoricalImputer(variables=["Embarked"], ...)),4    ("fare_imputer",          GroupMedianImputer(variable="Fare", group_var="Pclass")),5    ("fare_capping",          CapFareOutliers()),6    ("sex_encoder",           CustomMapping(variable="Sex", mapping={"male": 0, "female": 1})),7    ("embarked_encoder",      SklearnTransformerWrapper(OneHotEncoder(drop="first"), ...)),8    ("title_extractor",       TitleExtractor()),9    ("age_group_feature",     AgeGroupEncoder()),10    ("isfamilyonboard_feature", IsFamilyOnBoard()),11    ("ticket_size_feature",   TicketCounter()),12    ("drop_features",         DropFeatures([...])),13    ("scaling",               SklearnTransformerWrapper(StandardScaler(), ...)),14])

The pipeline transforms the raw 11-column input into a clean 13-feature numerical matrix — then serializes the entire fitted object to disk so inference requires no retraining.

Validation & Prediction

Before any transformation runs, validate_data() enforces the data contract on every input — whether that input is a batch DataFrame or a single passenger record passed as a dictionary:

Is the input actually a DataFrame after normalization?
Are all required columns present?
Does each column match its expected dtype?
Are missing values surfaced before they silently propagate?

Schema violations raise immediately with a clear message. Missing values trigger a log warning and continue to imputation downstream. A corrupted prediction is a harder problem than a failed one.

Python

1def make_predictions(data: pd.DataFrame | dict = None) -> np.ndarray:2    if isinstance(data, dict):3        data = pd.DataFrame([data])4 5    validate_data(data)6    X_transformed = feature_pipeline.transform(data[feature_pipeline.feature_names_in_])7 8    return model_pipeline.predict(X_transformed)

The function accepts both shapes of input intentionally — a DataFrame maps to batch inference, a dictionary maps to the natural structure of a single API request body. Either way, validation runs first.

Results

The model is Gradient Boosting — selected in the research phase for consistent accuracy and low cross-validation variance. The pipeline adds no new features and changes no hyperparameters. Performance is identical to the notebook prototype, which is the point.

Metric	Value
Training accuracy	80%+
Cross-validation accuracy	~83%
Kaggle public leaderboard	~79%
End-to-end training time	< 30 seconds
Prediction latency (single record)	< 1 ms

The ~4pp gap between cross-validation and leaderboard is expected — Kaggle's held-out test set is never seen during training or tuning, and the training set of 891 records limits generalization regardless of model choice.

Reflection

The clearest thing this project demonstrates is not technical — it is judgment about when to stop experimenting and start building.

The research phase was the right place to try algorithms, engineer features freely, and iterate without consequence. The pipeline phase is the right place to lock those decisions into a structure that another person — or a future version of yourself — can run, inspect, and modify without reconstructing the reasoning from scratch.

The skill is knowing which phase you are in, and not conflating the two.

This project shows the translation between the two: modeling decisions converted into a Python workflow with clear module boundaries, saved artifacts, and a consistent path from raw data to predictions — built to be extended rather than re-explained every time it runs.

Future Steps

The architecture is structured so the remaining steps are additions, not rewrites:

Next Step	What it enables
Testing (pytest)	Unit tests for each transformer and the validator against synthetic data
API layer (FastAPI)	`make_predictions()` maps directly to a POST endpoint; input validation is already handled
Containerization (Docker)	Runtime dependencies are minimal — Python, scikit-learn, pandas, joblib
CI/CD (GitHub Actions)	Automated retraining and artifact publishing on push to main
Model monitoring	Drift detection on input distributions and prediction confidence over time

Case Brief

At a Glance

Architecture

Configuration as a Contract

Feature Engineering

Validation & Prediction

Results

Reflection

Future Steps

Other Projects

My website: A portfolio built with agentic AI tools

Superstore Sales Performance Dashboard

Maven Landing Page A/B Test Analysis

Want to compare notes on a project?