Titanic ML Production Pipeline

Turned the Titanic notebook workflow into a modular Python ML system — configuration-driven, validated, with custom transformers, persisted artifacts, and a clean prediction interface.

PythonMachine learning

Published August 2025

Case Brief

A working notebook is not a working system.

The research project produced a well-performing model — 83% cross-validation accuracy — through iterative experimentation in Jupyter. But a notebook runs top-to-bottom once, with no input validation, no reusable components, and no way to serve predictions without re-executing cells.

This project is the second half: taking every decision from that research — the features, the preprocessing logic, the model configuration — and rebuilding it as a proper software system. Same model, done properly.

At a Glance

Notebook (Research)Pipeline (This Project)
TransformationsInline pandas cellsCustom sklearn classes
ParametersHardcodedYAML configuration file
ValidationNoneSchema + dtype checks
ReproducibilityRe-run notebook top-to-bottomLoad serialized .pkl artifacts
Prediction pathManual cell executionmake_predictions() function
ExtendabilityEdit cells, break thingsAdd modules, keep the rest intact

Architecture

The system is organized around six explicit module boundaries, each with a single responsibility:

ModuleResponsibility
src/config/Single source of truth for all paths, parameters, and schema expectations
src/data_manager/Data I/O, input validation, and artifact serialization
src/features/Custom sklearn-compatible transformers
src/pipeline.pyFeature pipeline and model pipeline definitions
src/train_pipeline.pyTraining orchestration — fit, evaluate, save
src/predict.pyLoad persisted artifacts, validate input, serve predictions

Nothing crosses those boundaries. The data manager does not know about features. The prediction interface does not know about training. Each module is independently readable, testable, and replaceable.

Directory

src/├── config/│   ├── configuration.yml        ← All parameters, paths, schema│   └── core.py                  ← Config loader├── data_manager/│   ├── data_loader.py           ← CSV I/O, artifact serialization│   ├── data_validator.py        ← Schema and dtype checks│   └── datasets/                ← Raw and processed data├── features/│   └── feature_engineering.py  ← Custom sklearn transformers├── pipeline.py                  ← Pipeline definitions├── train_pipeline.py            ← Training orchestration└── predict.py                   ← Prediction interface

Configuration as a Contract

All paths, model hyperparameters, imputation strategies, feature lists, and expected dtypes live in a single configuration.yml file. Nothing is hardcoded across scripts.

Yaml

model_params:  SEED: 20250708  learning_rate: 0.1  max_depth: 3  max_features: "sqrt"  n_estimators: 200  subsample: 1.0 preprocessing_params:  age_imputer_strategy: "median"  embarked_imputer_strategy: "frequent"

This makes the system auditable at a glance — one file describes every assumption the pipeline makes about data and model behavior. It also means changing a hyperparameter or swapping a file path requires editing exactly one line, in exactly one place.

Feature Engineering

The five features that drove most of the model's lift in the research phase are now implemented as proper sklearn transformer classes — each with a fit() step that learns from training data and a transform() step that applies the same learned logic to new data.

This distinction matters in practice: a transformer fitted on training data must apply its learned parameters at inference time. Notebook code re-running on test data learns from test data. A fitted sklearn transformer does not.

TransformerWhat it does
TitleExtractorParses social titles from passenger names (Mr, Mrs, Miss, Master, Rare) and one-hot encodes them. Stores the training-set column structure so inference never silently misaligns.
CapFareOutliersFits IQR-based fare bounds per passenger class on training data and applies those same bounds at inference — the upper limit for first-class fares comes from training, not from whatever the new data contains.
GroupMedianImputerFills missing fares using the per-class median learned from training data.
AgeGroupEncoderConverts continuous age into five fixed bins (Child, Teen, Adult, Middle-aged, Senior). Stateless — boundaries are fixed by domain logic.
IsFamilyOnBoardBinary flag from SibSp and Parch. Traveling alone was a stronger predictor than either raw column. Stateless.
TicketCounterCounts passengers sharing the same ticket number — a group travel signal that SibSp/Parch alone miss. Stateless.

These transformers are assembled into a sequential pipeline alongside standard imputation, encoding, and scaling steps:

Python

feature_pipeline = Pipeline([    ("age_imputer",           MeanMedianImputer(variables=["Age"], ...)),    ("embarked_imputer",      CategoricalImputer(variables=["Embarked"], ...)),    ("fare_imputer",          GroupMedianImputer(variable="Fare", group_var="Pclass")),    ("fare_capping",          CapFareOutliers()),    ("sex_encoder",           CustomMapping(variable="Sex", mapping={"male": 0, "female": 1})),    ("embarked_encoder",      SklearnTransformerWrapper(OneHotEncoder(drop="first"), ...)),    ("title_extractor",       TitleExtractor()),    ("age_group_feature",     AgeGroupEncoder()),    ("isfamilyonboard_feature", IsFamilyOnBoard()),    ("ticket_size_feature",   TicketCounter()),    ("drop_features",         DropFeatures([...])),    ("scaling",               SklearnTransformerWrapper(StandardScaler(), ...)),])

The pipeline transforms the raw 11-column input into a clean 13-feature numerical matrix — then serializes the entire fitted object to disk so inference requires no retraining.

Validation & Prediction

Before any transformation runs, validate_data() enforces the data contract on every input — whether that input is a batch DataFrame or a single passenger record passed as a dictionary:

  • Is the input actually a DataFrame after normalization?
  • Are all required columns present?
  • Does each column match its expected dtype?
  • Are missing values surfaced before they silently propagate?

Schema violations raise immediately with a clear message. Missing values trigger a log warning and continue to imputation downstream. A corrupted prediction is a harder problem than a failed one.

Python

def make_predictions(data: pd.DataFrame | dict = None) -> np.ndarray:    if isinstance(data, dict):        data = pd.DataFrame([data])     validate_data(data)    X_transformed = feature_pipeline.transform(data[feature_pipeline.feature_names_in_])     return model_pipeline.predict(X_transformed)

The function accepts both shapes of input intentionally — a DataFrame maps to batch inference, a dictionary maps to the natural structure of a single API request body. Either way, validation runs first.

Results

The model is Gradient Boosting — selected in the research phase for consistent accuracy and low cross-validation variance. The pipeline adds no new features and changes no hyperparameters. Performance is identical to the notebook prototype, which is the point.

MetricValue
Training accuracy80%+
Cross-validation accuracy~83%
Kaggle public leaderboard~79%
End-to-end training time< 30 seconds
Prediction latency (single record)< 1 ms

The ~4pp gap between cross-validation and leaderboard is expected — Kaggle's held-out test set is never seen during training or tuning, and the training set of 891 records limits generalization regardless of model choice.

Reflection

The clearest thing this project demonstrates is not technical — it is judgment about when to stop experimenting and start building.

The research phase was the right place to try algorithms, engineer features freely, and iterate without consequence. The pipeline phase is the right place to lock those decisions into a structure that another person — or a future version of yourself — can run, inspect, and modify without reconstructing the reasoning from scratch.

The skill is knowing which phase you are in, and not conflating the two.

This project shows the translation between the two: modeling decisions converted into a Python workflow with clear module boundaries, saved artifacts, and a consistent path from raw data to predictions — built to be extended rather than re-explained every time it runs.

Future Steps

The architecture is structured so the remaining steps are additions, not rewrites:

Next StepWhat it enables
Testing (pytest)Unit tests for each transformer and the validator against synthetic data
API layer (FastAPI)make_predictions() maps directly to a POST endpoint; input validation is already handled
Containerization (Docker)Runtime dependencies are minimal — Python, scikit-learn, pandas, joblib
CI/CD (GitHub Actions)Automated retraining and artifact publishing on push to main
Model monitoringDrift detection on input distributions and prediction confidence over time

Other Projects

View all →

CONTACT

Want to compare notes on a project?

I'm always up for a sharp data or product conversation.

Get in touch