Titanic Survival Prediction Machine Learning Research

Used the Kaggle Titanic dataset as a controlled machine-learning exploration sandbox. Discovered insights through EDA, engineered defensible features, benchmarked 8 model families, and selected Gradient Boosting for its stability-performance tradeoff.

PythonScikit-learnXGBoostCatBoostPandas

Published August 2025

Case Brief

The Kaggle Titanic survival dataset is the "Hello World" of machine learning, but it serves as an excellent constrained sandbox to practice disciplined modeling.

This project was the research and prototyping phase of a two-part machine learning workflow. Before building a production-ready pipeline, I needed to uncover the true predictive signals. The goal wasn't just to maximize a leaderboard score through brute-force complexity, but to show a clean modeling journey: auditing data, engineering defensible features, benchmarking algorithms, and ultimately selecting the most stable model structure.

The output? A finalized Gradient Boosting model and a refined set of preprocessing logic ready to be modularized.

The Constraint Sandbox

The simplicity of the Kaggle Titanic dataset makes every shortcut easy to see. The raw datasets are small:

SplitRowsPurpose
train.csv891Training data with the Survived target
test.csv418Holdout passenger records for Kaggle submission

Constraints bred creativity. For instance, Age was missing for 177 passengers, Cabin for 687, and Embarked for 2. Instead of treating these purely as "data cleaning" chores, they became modeling decisions. Cabin was too sparse to lean on without creating brittle signals. Age required imputation that didn't destroy its variance.

EDA & Analytical Decisions

My exploratory data analysis focused heavily on relationships that could generalize. The important choice was restraint—avoiding the trap of overfitting an 891-row sample.

SignalWhat I foundModeling implication
SexSurvival was overwhelmingly biased toward females.Formed the core baseline feature.
Passenger Class1st class passengers survived at a much higher rate than 3rd class.Socio-economic advantage was a strong proxy worth preserving.
AgeMedian age was similar across survival groups, but young children survived more.Median imputation by gender/class, followed by grouping into Age Bands.
FamilySurvivors were more likely to have some family, but large families perished.Combined SibSp and Parch into a new isfamilyonboard indicator.
FareHeavily skewed with extreme outliers.Cap fare outliers by class using IQR-based bounds.
Name/TitleTitles carried deep social context (e.g., Master vs. Mr)Extracted and grouped titles.

Feature Engineering

Notebook 02_feature_engineering.ipynb transitioned raw signals into 12 concrete, modeled features. The logic was crafted so it wouldn't silently break on out-of-sample data.

Engineered FeatureTransformation Strategy
AgeGroupFilled missing values strategically, then bucketed continuous Age into fixed categorical bands.
Title_Miss, Title_Mr, Title_Mrs, Title_RareParsed the raw string Name field, bucketed rare titles together, and applied one-hot encoding.
FareApplied IQR-based upper limits strictly grouped by Pclass.
isfamilyonboardConverted traveling party size into a binary flag indicating if a passenger was alone.
TicketGroupSizeCounted passengers sharing the same ticket number—finding group travel signals that native family columns missed.
Sex & EmbarkedStandard numerical and one-hot encoding mappings.

These transformations dramatically improved the model input space. More importantly, they clarified exactly what custom sklearn transformers I'd need to write for the production pipeline version of this project.

Algorithm Comparison & Tuning

I started with a simple baseline: applying a simple predictor using only Sex. It yielded 0.776 accuracy. A model that cannot clearly beat that baseline isn't adding enough value.

I benched eight supervised approaches to see what algorithms best mapped the feature space:

ModelValidation Accuracy
Baseline (Gender-only)0.776
Logistic Regression0.798
Decision Tree0.771
Random Forest0.780
LightGBM0.816
K-Nearest Neighbors0.816
CatBoost0.812
Gradient Boosting0.830
XGBoost0.834

The Final Cut: XGBoost vs. Gradient Boosting

First-pass accuracy is misleading with only 891 rows, as a single split can overstate quality. I performed 5-fold cross-validation and hyperparameter grid searches on the top two:

CandidateTuned Validation AccuracyBest CV AccuracyCV Standard Deviation
XGBoost0.8120.8280.036
Gradient Boosting0.8300.8260.023

XGBoost showed slightly higher peak cross-validation but double the fold-to-fold variance. Gradient Boosting gave nearly the same average cross-validated accuracy (0.826) but with a much lower standard deviation (0.023), indicating more consistent, stable performance.

Final Setup

The selected model was a specifically restrained GradientBoostingClassifier:

Python

GradientBoostingClassifier(    n_estimators=200,    learning_rate=0.1,    max_depth=3,    max_features="sqrt",    subsample=1.0,    random_state=20250708)

Final performance on the validation split yielded 0.830 Accuracy (Precision: 0.802, Recall: 0.747, F1: 0.774). The confusion matrix showed 120 true negatives and 65 true positives, with only 16 false positives—proving the model learned substantially beyond the gender baseline.

Next Steps

The research phase did exactly what it was supposed to do. It acted as an exploration sandbox to identify the exact feature transformations and model family worth keeping.

The main gap? Experiment hygiene. Re-running cells from top-to-bottom works for research, but not for software.

Instead of treating the Kaggle submission (submission.csv) as the finish line, this research became the blueprint for step two: translating the notebook workflows into a robust, object-oriented system. (Read Part 2: Titanic ML Production Pipeline)

Other Projects

View all →

CONTACT

Want to compare notes on a project?

I'm always up for a sharp data or product conversation.

Get in touch