Titanic Survival Prediction Machine Learning Research

Case Brief

The Kaggle Titanic survival dataset is the "Hello World" of machine learning, but it serves as an excellent constrained sandbox to practice disciplined modeling.

This project was the research and prototyping phase of a two-part machine learning workflow. Before building a production-ready pipeline, I needed to uncover the true predictive signals. The goal wasn't just to maximize a leaderboard score through brute-force complexity, but to show a clean modeling journey: auditing data, engineering defensible features, benchmarking algorithms, and ultimately selecting the most stable model structure.

The output? A finalized Gradient Boosting model and a refined set of preprocessing logic ready to be modularized.

The Constraint Sandbox

The simplicity of the Kaggle Titanic dataset makes every shortcut easy to see. The raw datasets are small:

Split	Rows	Purpose
`train.csv`	891	Training data with the `Survived` target
`test.csv`	418	Holdout passenger records for Kaggle submission

Constraints bred creativity. For instance, Age was missing for 177 passengers, Cabin for 687, and Embarked for 2. Instead of treating these purely as "data cleaning" chores, they became modeling decisions. Cabin was too sparse to lean on without creating brittle signals. Age required imputation that didn't destroy its variance.

EDA & Analytical Decisions

My exploratory data analysis focused heavily on relationships that could generalize. The important choice was restraint—avoiding the trap of overfitting an 891-row sample.

Signal	What I found	Modeling implication
Sex	Survival was overwhelmingly biased toward females.	Formed the core baseline feature.
Passenger Class	1st class passengers survived at a much higher rate than 3rd class.	Socio-economic advantage was a strong proxy worth preserving.
Age	Median age was similar across survival groups, but young children survived more.	Median imputation by gender/class, followed by grouping into Age Bands.
Family	Survivors were more likely to have some family, but large families perished.	Combined `SibSp` and `Parch` into a new `isfamilyonboard` indicator.
Fare	Heavily skewed with extreme outliers.	Cap fare outliers by class using IQR-based bounds.
Name/Title	Titles carried deep social context (e.g., Master vs. Mr)	Extracted and grouped titles.

Feature Engineering

Notebook 02_feature_engineering.ipynb transitioned raw signals into 12 concrete, modeled features. The logic was crafted so it wouldn't silently break on out-of-sample data.

Engineered Feature	Transformation Strategy
`AgeGroup`	Filled missing values strategically, then bucketed continuous Age into fixed categorical bands.
`Title_Miss`, `Title_Mr`, `Title_Mrs`, `Title_Rare`	Parsed the raw string `Name` field, bucketed rare titles together, and applied one-hot encoding.
`Fare`	Applied IQR-based upper limits strictly grouped by `Pclass`.
`isfamilyonboard`	Converted traveling party size into a binary flag indicating if a passenger was alone.
`TicketGroupSize`	Counted passengers sharing the same ticket number—finding group travel signals that native family columns missed.
`Sex` & `Embarked`	Standard numerical and one-hot encoding mappings.

These transformations dramatically improved the model input space. More importantly, they clarified exactly what custom sklearn transformers I'd need to write for the production pipeline version of this project.

Algorithm Comparison & Tuning

I started with a simple baseline: applying a simple predictor using only Sex. It yielded 0.776 accuracy. A model that cannot clearly beat that baseline isn't adding enough value.

I benched eight supervised approaches to see what algorithms best mapped the feature space:

Model	Validation Accuracy
Baseline (Gender-only)	0.776
Logistic Regression	0.798
Decision Tree	0.771
Random Forest	0.780
LightGBM	0.816
K-Nearest Neighbors	0.816
CatBoost	0.812
Gradient Boosting	0.830
XGBoost	0.834

The Final Cut: XGBoost vs. Gradient Boosting

First-pass accuracy is misleading with only 891 rows, as a single split can overstate quality. I performed 5-fold cross-validation and hyperparameter grid searches on the top two:

Candidate	Tuned Validation Accuracy	Best CV Accuracy	CV Standard Deviation
XGBoost	0.812	0.828	0.036
Gradient Boosting	0.830	0.826	0.023

XGBoost showed slightly higher peak cross-validation but double the fold-to-fold variance. Gradient Boosting gave nearly the same average cross-validated accuracy (0.826) but with a much lower standard deviation (0.023), indicating more consistent, stable performance.

Final Setup

The selected model was a specifically restrained GradientBoostingClassifier:

Python

1GradientBoostingClassifier(2    n_estimators=200,3    learning_rate=0.1,4    max_depth=3,5    max_features="sqrt",6    subsample=1.0,7    random_state=202507088)

Final performance on the validation split yielded 0.830 Accuracy (Precision: 0.802, Recall: 0.747, F1: 0.774). The confusion matrix showed 120 true negatives and 65 true positives, with only 16 false positives—proving the model learned substantially beyond the gender baseline.

Next Steps

The research phase did exactly what it was supposed to do. It acted as an exploration sandbox to identify the exact feature transformations and model family worth keeping.

The main gap? Experiment hygiene. Re-running cells from top-to-bottom works for research, but not for software.

Instead of treating the Kaggle submission (submission.csv) as the finish line, this research became the blueprint for step two: translating the notebook workflows into a robust, object-oriented system. (Read Part 2: Titanic ML Production Pipeline)

Case Brief

The Constraint Sandbox

EDA & Analytical Decisions

Feature Engineering

Algorithm Comparison & Tuning

The Final Cut: XGBoost vs. Gradient Boosting

Final Setup

Next Steps

Other Projects

My website: A portfolio built with agentic AI tools

Superstore Sales Performance Dashboard

Maven Landing Page A/B Test Analysis

Want to compare notes on a project?