Kaggle Titanic Improved

Preface

Before you begin reading this article, I have a tragic thing to tell you. Despite my belief that the data analysis presented in this article is more systematic and rigorous, the final scores obtained from the data results may not be as ideal. Nevertheless, I still wish to showcase my thought process. After all, the data analysis in the “Base Model” article left much to be desired.

1. Improvements

1.1 Exploratory Data Analysis (EDA)

Firstly, we have a simple idea: could gender be related to survival rate? Because of “Ladies first”.

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='Sex', data=traindata)
plt.title('Distribution of Sex')
plt.savefig('/deeplearning/KaggleProjects/titanic/Photoes/Sex_Numbers.png')
plt.show()

sns.barplot(x = 'Sex', y='Survived', data = traindata)
plt.title('Sex_Survived')
plt.savefig('/deeplearning/KaggleProjects/titanic/Photoes/Sex_Survived.png')
plt.show()

Figure-1 Sex_Numbers

Figure-2 Sex_Survived

We hypothesize that social status could be correlated with survival rate, as depicted in movies where lower-class workers often had fewer opportunities to escape.

plt.hist([data_all[data_all['Survived']==1]['Pclass'],data_all[data_all['Survived']==0]['Pclass']], color = ['g','r'], label = ['Survived','Dead'])
plt.title('Pclass_Survived')
plt.xlabel('Pclass')
plt.ylabel('Numbers')
plt.legend()
plt.savefig('/deeplearning/KaggleProjects/titanic/Photoes/Pclass_Survived.png')
plt.show()

Figure-3 Pclass_Survived

plt.hist([data_all[data_all['Survived']==1]['Fare'],data_all[data_all['Survived']==0]['Fare']],color = ['g','r'], bins = 50, label = ['Survived','Dead'])
plt.title('Fare_Survived')
plt.xlabel('Fare')
plt.ylabel('Numbers')
plt.legend()
plt.savefig('/deeplearning/KaggleProjects/titanic/Photoes/Fare_Survived.png')
plt.show()

Figure-4 Fare_Survived

Indeed, as it turns out, a higher ticket class is associated with a higher probability of survival.

In other words, higher social status increases the rate of survival.

1.2 Feature Engineering

1.2.1 Cabin

Let’s shift our focus back to the data we previously discarded. We’ll start by analyzing the ‘Cabin’ column.

print(data_all['Cabin'].unique())
[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33'
'B30' 'C52' 'B28' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110'
'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49'
'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77'
'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106'
'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91'
'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34'
'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79'
'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68'
'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58'
'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90'
'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6'
'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50'
'B42' 'C148' 'B45' 'B36' 'A21' 'D34' 'A9' 'C31' 'B61' 'C53' 'D43' 'C130'
'C132' 'C55 C57' 'C116' 'F' 'A29' 'C6' 'C28' 'C51' 'C97' 'D22' 'B10'
'E45' 'E52' 'A11' 'B11' 'C80' 'C89' 'F E46' 'B26' 'F E57' 'A18' 'E60'
'E39 E41' 'B52 B54 B56' 'C39' 'B24' 'D40' 'D38' 'C105']

We notice that ‘Cabin’ is composed of letters and numbers, with the first letter representing the deck number of the cabin and the second letter denoting the room number. Clearly, the deck number holds more meaningful information for our analysis.

Figure-5 Cabin

Certainly, survival chances were higher for passengers on higher decks. From the chart, we can observe the distribution characteristics of the ‘Cabin’ feature.

Before extracting, I used ‘U’ (unknown) to fill in the missing values as a preliminary step. Then, we will extract the first letter.

data_all['Cabin'].fillna('U',inplace=True)
data_all['Cabin'] = data_all['Cabin'].map(lambda s: s[0])
print(data_all['Cabin'].value_counts())
Cabin
U 1014
C 94
B 65
D 46
E 41
A 22
F 21
G 5
T 1

Interestingly, there is no information about the ‘T’ cabin here. (Could it be that someone slept on the mast?) Nonetheless, it’s not an issue. Since there is only one ‘T’ cabin data, we can manually assign it to another cabin.

Please note that this is the traindata!

data_all.loc[(data_all['Cabin'] == 'T'), 'Cabin'] = 'F'

Next, it naturally comes to mind to allocate the ‘U’ data to other categories based on fare levels.

print(data_all.groupby("Cabin")['Fare'].max().sort_values())
print(data_all.groupby("Cabin")['Fare'].min().sort_values())
print(data_all.groupby("Cabin")['Fare'].mean().sort_values())
Cabin
G 16.7000
F 39.0000
A 81.8583
D 113.2750
E 134.5000
C 263.0000
B 512.3292
U 512.3292
Cabin
A 0.0000
B 0.0000
U 0.0000
F 7.2292
E 8.0500
G 10.4625
D 12.8750
C 25.7000
Cabin
G 14.205000
F 18.079367
U 19.132707
A 41.244314
D 53.007339
E 54.564634
C 107.926598
B 122.383078

However, during the grouping process, we discovered that there were passengers in the ‘A’ ‘B’ cabin who boarded for free! They might have been invited as aristocrats.

print(data_all[data_all['Fare'] == 0])
     PassengerId  Survived  Pclass                                   Name   Sex   Age  SibSp  Parch  Ticket  Fare Cabin Embarked
179 180 0.0 3 Leonard, Mr. Lionel male 36.0 0 0 LINE 0.0 U S
263 264 0.0 1 Harrison, Mr. William male 40.0 0 0 112059 0.0 B S
271 272 1.0 3 Tornquist, Mr. William Henry male 25.0 0 0 LINE 0.0 U S
277 278 0.0 2 Parkes, Mr. Francis "Frank" male NaN 0 0 239853 0.0 U S
302 303 0.0 3 Johnson, Mr. William Cahoone Jr male 19.0 0 0 LINE 0.0 U S
413 414 0.0 2 Cunningham, Mr. Alfred Fleming male NaN 0 0 239853 0.0 U S
466 467 0.0 2 Campbell, Mr. William male NaN 0 0 239853 0.0 U S
481 482 0.0 2 Frost, Mr. Anthony Wood "Archie" male NaN 0 0 239854 0.0 U S
597 598 0.0 3 Johnson, Mr. Alfred male 49.0 0 0 LINE 0.0 U S
633 634 0.0 1 Parr, Mr. William Henry Marsh male NaN 0 0 112052 0.0 U S
674 675 0.0 2 Watson, Mr. Ennis Hastings male NaN 0 0 239856 0.0 U S
732 733 0.0 2 Knight, Mr. Robert J male NaN 0 0 239855 0.0 U S
806 807 0.0 1 Andrews, Mr. Thomas Jr male 39.0 0 0 112050 0.0 A S
815 816 0.0 1 Fry, Mr. Richard male NaN 0 0 112058 0.0 B S
822 823 0.0 1 Reuchlin, Jonkheer. John George male 38.0 0 0 19972 0.0 U S
1157 1158 NaN 1 Chisholm, Mr. Roderick Robert Crispin male NaN 0 0 112051 0.0 U S
1263 1264 NaN 1 Ismay, Mr. Joseph Bruce male 49.0 0 0 112058 0.0 B S

After careful consideration, I’ve decided to assign passengers with ticket class 1 to the ‘B’ cabin.

data_all.loc[(data_all['Fare'] == 0) & (data_all['Pclass'] == 1) &(data_all['Cabin'] == 'U'), 'Cabin'] = 'B'

Next, we can proceed to group the ‘U’ cabin based on the fare corresponding to each cabin.

def cabin_estimator(i):
"""Grouping cabin feature by the first letter"""
a = 0
if i<16:
a = "G"
elif i>=16 and i<27:
a = "F"
elif i>=27 and i<47:
a = "A"
elif i>= 47 and i<53:
a = "E"
elif i>= 53 and i<54:
a = "D"
elif i>=54 and i<116:
a = 'C'
else:
a = "B"
return a
data_all.loc[data_all['Cabin'] == 'U','Cabin'] = data_all[data_all['Cabin'] == 'U']['Fare'].apply(lambda x: cabin_estimator(x))

Afterward, I performed another round of subdivision. This can help reduce the number of training parameters and mitigate the potential for overfitting.

data_all['Cabin'] = data_all['Cabin'].replace(['A','B'], 'AB')
data_all['Cabin'] = data_all['Cabin'].replace(['C','D','E'], "CDE")
data_all['Cabin'] = data_all['Cabin'].replace(['F','G'], 'FG')

1.2.2 Name

Passenger names are also valuable data. Titles can provide insights into their social status, and first name can indicate whether they have family members onboard. Here, we will extract only the title data.

data_all['Title'] = data_all['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
print(data_all['Title'].unique())
['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Mme' 'Ms' 'Major' 'Lady'
'Sir' 'Mlle' 'Col' 'Capt' 'the Countess' 'Jonkheer' 'Dona']

We can also group the titles. The grouping order is not explained here, and I will provide it directly.

data_all['Title'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','the Countess','Jonkheer','Col','Rev','Capt','Sir','Don','Dona'],
['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr','Mrs'],inplace=True)

1.2.3 Age

One’s title can also reflect his age, so we can impute missing values based on the average age associated with each title.

print(data_all.groupby("Title")['Age'].mean().sort_values())
Title
Master 5.482642
Miss 21.834533
Mr 32.524493
Mrs 37.046243
Other 44.923077
data_all.loc[(data_all['Age'].isnull())&(data_all['Title']=='Mr'),'Age']=33
data_all.loc[(data_all['Age'].isnull())&(data_all['Title']=='Mrs'),'Age']=37
data_all.loc[(data_all['Age'].isnull())&(data_all['Title']=='Master'),'Age']=5
data_all.loc[(data_all['Age'].isnull())&(data_all['Title']=='Miss'),'Age']=22
data_all.loc[(data_all['Age'].isnull())&(data_all['Title']=='Other'),'Age']=45

With these processing steps completed, the data has been effectively restored, and we are ready to proceed to the next testing phase.

1.3 Ohters

The rest of the processing methods and the final model code are the same as in the previous article. I’ll come back to fill in the gaps once my skills become more advanced.