Kaggle Titanic Improved

Preface

Before you begin reading this article, I have a tragic thing to tell you. Despite my belief that the data analysis presented in this article is more systematic and rigorous, the final scores obtained from the data results may not be as ideal. Nevertheless, I still wish to showcase my thought process. After all, the data analysis in the “Base Model” article left much to be desired.

1. Improvements

1.1 Exploratory Data Analysis (EDA)

Firstly, we have a simple idea: could gender be related to survival rate? Because of “Ladies first”.

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='Sex', data=traindata)
plt.title('Distribution of Sex')
plt.savefig('/deeplearning/KaggleProjects/titanic/Photoes/Sex_Numbers.png')
plt.show()

sns.barplot(x = 'Sex', y='Survived', data = traindata)
plt.title('Sex_Survived')
plt.savefig('/deeplearning/KaggleProjects/titanic/Photoes/Sex_Survived.png')
plt.show()

Figure-1 Sex_Numbers

Figure-2 Sex_Survived

We hypothesize that social status could be correlated with survival rate, as depicted in movies where lower-class workers often had fewer opportunities to escape.

plt.hist([data_all[data_all['Survived']==1]['Pclass'],data_all[data_all['Survived']==0]['Pclass']], color = ['g','r'], label = ['Survived','Dead'])
plt.title('Pclass_Survived')
plt.xlabel('Pclass')
plt.ylabel('Numbers')
plt.legend()
plt.savefig('/deeplearning/KaggleProjects/titanic/Photoes/Pclass_Survived.png')
plt.show()

Figure-3 Pclass_Survived

plt.hist([data_all[data_all['Survived']==1]['Fare'],data_all[data_all['Survived']==0]['Fare']],color = ['g','r'], bins = 50, label = ['Survived','Dead'])
plt.title('Fare_Survived')
plt.xlabel('Fare')
plt.ylabel('Numbers')
plt.legend()
plt.savefig('/deeplearning/KaggleProjects/titanic/Photoes/Fare_Survived.png')
plt.show()

Figure-4 Fare_Survived

Indeed, as it turns out, a higher ticket class is associated with a higher probability of survival.

In other words, higher social status increases the rate of survival.

1.2 Feature Engineering

1.2.1 Cabin

Let’s shift our focus back to the data we previously discarded. We’ll start by analyzing the ‘Cabin’ column.

print(data_all['Cabin'].unique())

[nan 'C85' 'C123' 'E46' 'G6' 'C103' 'D56' 'A6' 'C23 C25 C27' 'B78' 'D33'
 'B30' 'C52' 'B28' 'C83' 'F33' 'F G73' 'E31' 'A5' 'D10 D12' 'D26' 'C110'
 'B58 B60' 'E101' 'F E69' 'D47' 'B86' 'F2' 'C2' 'E33' 'B19' 'A7' 'C49'
 'F4' 'A32' 'B4' 'B80' 'A31' 'D36' 'D15' 'C93' 'C78' 'D35' 'C87' 'B77'
 'E67' 'B94' 'C125' 'C99' 'C118' 'D7' 'A19' 'B49' 'D' 'C22 C26' 'C106'
 'C65' 'E36' 'C54' 'B57 B59 B63 B66' 'C7' 'E34' 'C32' 'B18' 'C124' 'C91'
 'E40' 'T' 'C128' 'D37' 'B35' 'E50' 'C82' 'B96 B98' 'E10' 'E44' 'A34'
 'C104' 'C111' 'C92' 'E38' 'D21' 'E12' 'E63' 'A14' 'B37' 'C30' 'D20' 'B79'
 'E25' 'D46' 'B73' 'C95' 'B38' 'B39' 'B22' 'C86' 'C70' 'A16' 'C101' 'C68'
 'A10' 'E68' 'B41' 'A20' 'D19' 'D50' 'D9' 'A23' 'B50' 'A26' 'D48' 'E58'
 'C126' 'B71' 'B51 B53 B55' 'D49' 'B5' 'B20' 'F G63' 'C62 C64' 'E24' 'C90'
 'C45' 'E8' 'B101' 'D45' 'C46' 'D30' 'E121' 'D11' 'E77' 'F38' 'B3' 'D6'
 'B82 B84' 'D17' 'A36' 'B102' 'B69' 'E49' 'C47' 'D28' 'E17' 'A24' 'C50'
 'B42' 'C148' 'B45' 'B36' 'A21' 'D34' 'A9' 'C31' 'B61' 'C53' 'D43' 'C130'
 'C132' 'C55 C57' 'C116' 'F' 'A29' 'C6' 'C28' 'C51' 'C97' 'D22' 'B10'
 'E45' 'E52' 'A11' 'B11' 'C80' 'C89' 'F E46' 'B26' 'F E57' 'A18' 'E60'
 'E39 E41' 'B52 B54 B56' 'C39' 'B24' 'D40' 'D38' 'C105']

We notice that ‘Cabin’ is composed of letters and numbers, with the first letter representing the deck number of the cabin and the second letter denoting the room number. Clearly, the deck number holds more meaningful information for our analysis.

Figure-5 Cabin

Certainly, survival chances were higher for passengers on higher decks. From the chart, we can observe the distribution characteristics of the ‘Cabin’ feature.

Before extracting, I used ‘U’ (unknown) to fill in the missing values as a preliminary step. Then, we will extract the first letter.

data_all['Cabin'].fillna('U',inplace=True)
data_all['Cabin'] = data_all['Cabin'].map(lambda s: s[0])
print(data_all['Cabin'].value_counts())

Interestingly, there is no information about the ‘T’ cabin here. (Could it be that someone slept on the mast?) Nonetheless, it’s not an issue. Since there is only one ‘T’ cabin data, we can manually assign it to another cabin.

Please note that this is the traindata!

data_all.loc[(data_all['Cabin'] == 'T'), 'Cabin'] = 'F'

Next, it naturally comes to mind to allocate the ‘U’ data to other categories based on fare levels.

print(data_all.groupby("Cabin")['Fare'].max().sort_values())
print(data_all.groupby("Cabin")['Fare'].min().sort_values())
print(data_all.groupby("Cabin")['Fare'].mean().sort_values())

Cabin
G     16.7000
F     39.0000
A     81.8583
D    113.2750
E    134.5000
C    263.0000
B    512.3292
U    512.3292
Cabin
A     0.0000
B     0.0000
U     0.0000
F     7.2292
E     8.0500
G    10.4625
D    12.8750
C    25.7000
Cabin
G     14.205000
F     18.079367
U     19.132707
A     41.244314
D     53.007339
E     54.564634
C    107.926598
B    122.383078

However, during the grouping process, we discovered that there were passengers in the ‘A’ ‘B’ cabin who boarded for free! They might have been invited as aristocrats.

print(data_all[data_all['Fare'] == 0])

     PassengerId  Survived  Pclass                                   Name   Sex   Age  SibSp  Parch  Ticket  Fare Cabin Embarked
179           180       0.0       3                    Leonard, Mr. Lionel  male  36.0      0      0    LINE   0.0     U        S
263           264       0.0       1                  Harrison, Mr. William  male  40.0      0      0  112059   0.0     B        S
271           272       1.0       3           Tornquist, Mr. William Henry  male  25.0      0      0    LINE   0.0     U        S
277           278       0.0       2            Parkes, Mr. Francis "Frank"  male   NaN      0      0  239853   0.0     U        S
302           303       0.0       3        Johnson, Mr. William Cahoone Jr  male  19.0      0      0    LINE   0.0     U        S
413           414       0.0       2         Cunningham, Mr. Alfred Fleming  male   NaN      0      0  239853   0.0     U        S
466           467       0.0       2                  Campbell, Mr. William  male   NaN      0      0  239853   0.0     U        S
481           482       0.0       2       Frost, Mr. Anthony Wood "Archie"  male   NaN      0      0  239854   0.0     U        S
597           598       0.0       3                    Johnson, Mr. Alfred  male  49.0      0      0    LINE   0.0     U        S
633           634       0.0       1          Parr, Mr. William Henry Marsh  male   NaN      0      0  112052   0.0     U        S
674           675       0.0       2             Watson, Mr. Ennis Hastings  male   NaN      0      0  239856   0.0     U        S
732           733       0.0       2                   Knight, Mr. Robert J  male   NaN      0      0  239855   0.0     U        S
806           807       0.0       1                 Andrews, Mr. Thomas Jr  male  39.0      0      0  112050   0.0     A        S
815           816       0.0       1                       Fry, Mr. Richard  male   NaN      0      0  112058   0.0     B        S
822           823       0.0       1        Reuchlin, Jonkheer. John George  male  38.0      0      0   19972   0.0     U        S
1157         1158       NaN       1  Chisholm, Mr. Roderick Robert Crispin  male   NaN      0      0  112051   0.0     U        S
1263         1264       NaN       1                Ismay, Mr. Joseph Bruce  male  49.0      0      0  112058   0.0     B        S

After careful consideration, I’ve decided to assign passengers with ticket class 1 to the ‘B’ cabin.

data_all.loc[(data_all['Fare'] == 0) & (data_all['Pclass'] == 1) &(data_all['Cabin'] == 'U'), 'Cabin'] = 'B'

Next, we can proceed to group the ‘U’ cabin based on the fare corresponding to each cabin.

def cabin_estimator(i):
    """Grouping cabin feature by the first letter"""
    a = 0
    if i<16:
        a = "G"
    elif i>=16 and i<27:
        a = "F"
    elif i>=27 and i<47:
        a = "A"
    elif i>= 47 and i<53:
        a = "E"
    elif i>= 53 and i<54:
        a = "D"
    elif i>=54 and i<116:
        a = 'C'
    else:
        a = "B"
    return a
data_all.loc[data_all['Cabin'] == 'U','Cabin'] = data_all[data_all['Cabin'] == 'U']['Fare'].apply(lambda x: cabin_estimator(x))

Afterward, I performed another round of subdivision. This can help reduce the number of training parameters and mitigate the potential for overfitting.

data_all['Cabin'] = data_all['Cabin'].replace(['A','B'], 'AB')
data_all['Cabin'] = data_all['Cabin'].replace(['C','D','E'], "CDE")
data_all['Cabin'] = data_all['Cabin'].replace(['F','G'], 'FG')

1.2.2 Name

Passenger names are also valuable data. Titles can provide insights into their social status, and first name can indicate whether they have family members onboard. Here, we will extract only the title data.

data_all['Title'] = data_all['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
print(data_all['Title'].unique())

['Mr' 'Mrs' 'Miss' 'Master' 'Don' 'Rev' 'Dr' 'Mme' 'Ms' 'Major' 'Lady'
 'Sir' 'Mlle' 'Col' 'Capt' 'the Countess' 'Jonkheer' 'Dona']

We can also group the titles. The grouping order is not explained here, and I will provide it directly.

data_all['Title'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','the Countess','Jonkheer','Col','Rev','Capt','Sir','Don','Dona'],
                        ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr','Mrs'],inplace=True)

1.2.3 Age

One’s title can also reflect his age, so we can impute missing values based on the average age associated with each title.

print(data_all.groupby("Title")['Age'].mean().sort_values())

Title
Master     5.482642
Miss      21.834533
Mr        32.524493
Mrs       37.046243
Other     44.923077

data_all.loc[(data_all['Age'].isnull())&(data_all['Title']=='Mr'),'Age']=33
data_all.loc[(data_all['Age'].isnull())&(data_all['Title']=='Mrs'),'Age']=37
data_all.loc[(data_all['Age'].isnull())&(data_all['Title']=='Master'),'Age']=5
data_all.loc[(data_all['Age'].isnull())&(data_all['Title']=='Miss'),'Age']=22
data_all.loc[(data_all['Age'].isnull())&(data_all['Title']=='Other'),'Age']=45

With these processing steps completed, the data has been effectively restored, and we are ready to proceed to the next testing phase.

1.3 Ohters

The rest of the processing methods and the final model code are the same as in the previous article. I’ll come back to fill in the gaps once my skills become more advanced.