• Alex Ricciardi

Predict Titanic Survival

My Codecademy practice project from the Data Scientist Path Foundations of Machine Learning: Supervised Learning Course, Logistic Regression.

▪ Overview

The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.

In this project I will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic based on age, sex and purchased ticket class.

The data I will be using for training our model is provided by Kaggle.

▪ Project Goal

Predict which passengers survived the sinking of the Titanic using machine learning logistic regression models.

▪ Project Requirements

  • Python3

  • Machine Learning

  • The Python Libraries:




▪ Links:

Project Python Code (Jupyter Notebook)

Preprocessing the Data

The file passengers.csv contains the data of 892 passengers onboard the Titanic when it sank that fateful day.

▪ The Data

▪ Clean The Data

Given the saying, “women and children first,” Sex and Age seem like good features to predict survival. All values female are replaced with 1 and all values male are replaced with 0.

Replacing multiple missing values, or nans, in the Age column with the titanic passengers mean age.


# Passenger 888
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64


# Passenger 888 updated
0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64

▪ Utilize The Feature Pclass

Given the strict class system onboard the Titanic, dividing the Pclass feature into

the sub-features FirstClass and SecondClass will help modeling the prediction of who is more likely survive the sinking of the Titanic.

The feature FirstClass attributes the value 1 to passengers in first class and the values 0 for all other passengers.

The feature SecondClass attributes the value 1 to passengers in second class and the values 0 for all other passengers.

Logistic regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.

Select and Split the Data

Now that the data is clean, to build my model, I selected the columns Sex, Age, FirstClass, and SecondClass (independent variables) and store them in a variable named features. I selected the column Survived (binary dependent variable) and store it a variable named survival.

features = passengers[['Sex', 'Age', 'FirstClass', 'SecondClass']]
survival = passengers.Survived

# Perform train, test, split
features_train, features_test, labels_train,  labels_test = train_test_split(features,survival,test_size=0.25, random_state=42)

Normalize the Data

Since I am using Logistic Regression models, I need to implement the Regularization method on the features, I need to scale the feature data, for that purpose, I created a StandardScaler object, .fit_transform() it on the training features, and .transform() the test features.

# Scale the feature data so it has mean = 0 and standard deviation = 1
scaler = StandardScaler()
norm_train_features = scaler.fit_transform(features_train)
norm_test_features = scaler.fit_transform(features_test)

Create and Evaluate the Model


# Create and train the model
model = LogisticRegression() , labels_train)

Train Datasets Coefficient of Determination

# Score the model on the train dataset
model.score(norm_train_features , labels_train)


Test Datasets Coefficient of Determination

# Score the model on the test dataset
model.score(norm_test_features , labels_test)


Features Coefficient of Determination


[('Sex', 1.2127928945459108), 
('Age', -0.366264984201663), 
('FirstClass', 0.8732654291428307), 
('SecondClass', 0.5229584041786293)]

Predict with the Model

Let’s use the model to make predictions on the survival of a few fateful passengers. Provided in the code editor is information for 3rd class passenger Jack and 1st class passenger Rose, from the movie Titanic, stored in NumPy arrays. The arrays store 4 feature values, in the following order:

  • Sex, represented by a 0 for male and 1 for female

  • Age, represented as an integer in years

  • FirstClass, with a 1 indicating the passenger is in first class

  • SecondClass, with a 1 indicating the passenger is in second class

The third array, John_Doe, is a random passenger array

# Sample passenger features
# Male, 20 years old, No-first class, No-Second class
Jack = np.array([0.0,20.0,0.0,0.0]) 
# Female, 17 years old, Yes-first class, No-Second class
Rose = np.array([1.0,17.0,1.0,0.0])
# Female, 49 years old, no-first class, No-Second class
John_Doe = np.array([1.0,49.0,0.0,0.0])

Model's Predictions

# Combine passenger arrays
sample_passengers = np.array([Jack , Rose, John_Doe])
# Scale the sample passenger features
sample_passengers = scaler.transform(sample_passengers)
# Make survival predictions

# Output
[0 1 0]

Jack and John_Doe are predicted to Not survive

Model's Probability of Survival Values

prob = model.predict_proba(sample_passengers)

prob_df = pd.DataFrame({
    'Passenger':['Rose', 'Jack', 'John_Doe'], 
    '% Likely To Survive':[ val[0]*100 for val in prob], 
    '% Likely To Not-Survive':[ val[1]*100 for val in prob]})

  • Jack, a 20 years old male in 3rd class, is 91.50% likely to Not-survive the sinking of the Titanic.

  • John_Doe, a 49 years old female in 3rd class, is 38.74% likely to Not-survive the sinking of the Titanic.

  • Rose, a 17 years old female in 1st class, is 10.57% likely to Not-survive the sinking of the Titanic.

4 views0 comments

© 2020 by Alex Ricciardi.