Kaggle Titanic

Preface

I appreciate your understanding regarding any potential errors or omissions, as I’m still in the learning process. Feel free to engage in discussions – I’m here to help and discuss any topics you’d like to explore.

Learning is a continuous journey, and open discussions can lead to valuable insights and growth.

This article begins by exploring the most straightforward approach to address the issue.

If you wish to follow along with my code, please visit my GitHub repository using the following link:

Linermao’s GitHub: https://github.com/Linermao/Kaggle-Code

1. Competition Overview

Original competation link： Titanic - Machine Learning from Disaster

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

2. Data

2.1 Download Data

Firstly, you should install ‘Kaggle’. If you’ve already done this, please disregard this step.

pip install kaggle

Then, you can utilize this code to download data from the Titanic competition.

kaggle competitions download -c titanic

Alternatively, you can log in to Kaggle to download the data.

click hear: https://www.kaggle.com/competitions/titanic/data

Careful! There are three .csv files. But the files we only need are train.csv and test.csv .

2.2 Data Extraction

To start, It is imperative to ascertain the available data and its significance.

import pandas as pd
path_traindata = "/deeplearning/KaggleProjects/titanic/train.csv"
path_testdata = "/deeplearning/KaggleProjects/titanic/test.csv"
traindata = pd.read_csv(path_traindata)
testdata = pd.read_csv(path_testdata)

data_all = pd.concat([traindata,testdata], ignore_index = True)
# Simultaneously Handling Data
print(data_all.info())
print(data_all.describe())

Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 

None
       PassengerId    Survived       Pclass          Age        SibSp        Parch         Fare
count  1309.000000  891.000000  1309.000000  1046.000000  1309.000000  1309.000000  1308.000000
mean    655.000000    0.383838     2.294882    29.881138     0.498854     0.385027    33.295479
std     378.020061    0.486592     0.837836    14.413493     1.041658     0.865560    51.758668
min       1.000000    0.000000     1.000000     0.170000     0.000000     0.000000     0.000000
25%     328.000000    0.000000     2.000000    21.000000     0.000000     0.000000     7.895800
50%     655.000000    0.000000     3.000000    28.000000     0.000000     0.000000    14.454200
75%     982.000000    1.000000     3.000000    39.000000     1.000000     0.000000    31.275000
max    1309.000000    1.000000     3.000000    80.000000     8.000000     9.000000   512.329200

From the Competition, we know:

Variable	Definition	Key
Survival	Survival	0 = No, 1 = Yes
Pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
Sex	Sex
Age	Age in years
Sibsp	# of siblings / spouses aboard the Titanic
Parch	# of parents / children aboard the Titanic
Ticket	Ticket number
Fare	Passenger fare
Cabin	Cabin number
Embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

2.3 Data Cleaning

2.3.1 Imputing Missing Values

We observe that ‘Age’, ‘Fare’, ‘Cabin’, and ‘Embarked’ contain missing values, with ‘Cabin’ being particularly notable with a loss of nearly 70% of its values.

The missing values in ‘Survived’ arise from the fact that they need to be predicted based on the model we establish for the test set. Therefore, they can be disregarded for now.

Let’s focus on ‘Age’ as a starting point for analysis.

In handling missing values, the most commonly employed method is to replace them with the mean value.

Thus, we can employ the following code to address the ‘Age’ variable :

data_all['Age'].fillna(data_all['Age'].mean(), inplace = True)
# test whether there are any remaining missing values
# print(data_all['Age'].isnull().any())
# if false, it indicates the absence of any missing values.

It’s worth mentioning that using the mean to fill in missing values is an extremely rudimentary and hasty approach. Therefore, we will further optimize our missing value imputation method in subsequent steps.

For ‘Fare’, which has only one missing value, we can replace it with the mean value as well.

data_all['Fare'].fillna(data_all['Fare'].mean, inplace = True)
# print(data_all['Fare'].isnull().any())

Regarding ‘Embarked’, we could analyze its distribution pattern as a reference for our next steps.

print(data_all['Embarked'].value_counts())

Embarked
S    914
C    270
Q    123

We can visually observe that the majority of values are distributed around ‘S’, indicating that we can utilize ‘S’ to impute the missing values.

Interestingly, you can print the names of the missing individuals and conduct online searches to confirm that they indeed boarded at the ‘Southampton’ port.

print(data_all[data_all[['Embarked']].isnull().any(axis=1)])

     PassengerId  Survived  Pclass                                       Name     Sex   Age  SibSp  Parch  Ticket  Fare Cabin Embarked
61            62       1.0       1                        Icard, Miss. Amelie  female  38.0      0      0  113572  80.0   B28      NaN
829          830       1.0       1  Stone, Mrs. George Nelson (Martha Evelyn)  female  62.0      0      0  113572  80.0   B28      NaN

Impute the missing values.

data_all['Embarked'].fillna('S',inplace=True)
# print(data_all['Embarked'].isnull().any())

Now, let’s shift our focus to ‘Cabin’. Due to the excessive number of missing values, it might be advisable to remove them for now. However, we will certainly revisit this issue and discuss it separately in the future.

data_all.drop('Cabin',axis=1,inplace=True)

Lastly, after imputing all the missing values, let’s perform a final check.

print(data_all.isnull().any())

False

2.4 Feature Engineering

Firstly, we observe that the ‘Ticket’ column contains a mix of characters and numbers, which could introduce unnecessary complexity for our initial model. This equally applies to the ‘Name’ column as well. Therefore, we can choose to disregard them for now and revisit the topic later.

data_all.drop('Ticket',axis=1,inplace=True)
data_all.drop('Name',axis=1,inplace=True)

Subsequently, we discover that ‘SibSp’ and ‘Parch’ can be combined to create new variables, I will refer to this new variable as “FamilySize,” which represents the sum of a passenger’s siblings and spouse count plus parent and children count and then adding 1 to account for the passenger themselves.

data_all['FamilySize'] = data_all['SibSp'] + data_all['Parch'] + 1

Next, we will group the data by ‘Age’ separately. Grouping can assist us in extracting deeper insights from the data.

data_all['Age'] = pd.cut(data_all['Age'], 5)
# cut divides the data into a specified number of equi-width intervals
print(data_all['Age'].value_counts())

Age
(16.136, 32.102]    787
(32.102, 48.068]    269
(0.0902, 16.136]    134
(48.068, 64.034]    106
(64.034, 80.0]       13

For ‘Fare’, we can apply similar grouping methods. However, there are some additional aspects to consider.

To begin, we can display the value counts for ‘Fare’.

print(data_all['Fare'].describe())

count    1309.000000
mean       33.295479
std        51.738879
min         0.000000
25%         7.895800
50%        14.454200
75%        31.275000
max       512.329200

We observe that while the maximum value is quite high, the mean and median values are comparatively smaller. This suggests that a small portion of individuals purchased expensive tickets, prompting the need to classify this subgroup separately.

The presence of a minimum value of 0 indicates that some individuals boarded for free, possibly indicating workers. We will delve into this matter in more detail later.

We can establish our own classification ranges to roughly categorize different fare tiers. The following is an good example of this.

print(data_all['Fare'].describe(percentiles = [0.6,0.9,0.98]))

count    1309.000000
mean       33.295479
std        51.738879
min         0.000000
50%        14.454200
60%        21.679200
90%        78.019980
98%       221.779200
max       512.329200

Actually, we can observe that only four individuals purchased the highest tickets.

The joy of the affluent is beyond comprehension.

print(data_all['Fare'][data_all['Fare'] > 300])

258     512.3292
679     512.3292
737     512.3292
1234    512.3292

Hence, we can proceed with our categorization.

bins = [0, 14, 78, 220, 500, 600]
labels = ['VeryLow','Low', 'Middle', 'High', 'VeryHigh']
data_all['Fare'] = pd.cut(data_all['Fare'], bins=bins, labels=labels, right=False)
print(data_all['Fare'].value_counts())

Fare
VeryLow     641
Low         537
Middle      102
High         25
VeryHigh      4

2.5 One Hot Encoding

Up to this point, we have conducted a preliminary “less refined” selection and transformation of the data. However, the obtained values are not yet in integer or floating-point format, so they cannot be directly fed into the model for training. Another transformation is required.

Allow me to provide you with an easy example:

In the ‘Sex’ column, where the values are either ‘male’ or ‘female’, we should convert these to 1 or 0 respectively, making them compatible for modeling purposes.

We can employ the ‘LabelEncoder’ function from the ‘sklearn.preprocessing’ module for this purpose.

from sklearn.preprocessing import LabelEncoder
map_features = ['Sex', 'Pclass', 'Age', 'Fare']
for feature in map_features:
    data_all[feature] = LabelEncoder().fit_transform(data_all[feature])
print(data_all.info())

Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Sex          1309 non-null   int64  
 4   Age          1309 non-null   int64  
 5   SibSp        1309 non-null   int64  
 6   Parch        1309 non-null   int64  
 7   Fare         1309 non-null   int64  
 8   Embarked     1309 non-null   int64  
 9   FamilySize   1309 non-null   int64

We can observe that, except for ‘Survived’, all other columns have been converted to integer type.

Then we can proceed to utilize the One-Hot Encoding technique.

Many algorithms assume that there is a logical sequence within a column. However, this is not always expressed by the numerical ratio. Therefore it is needed to one hot encoding the variables afterwards.

Take an example:

We will transform the ‘Sex’ column into two columns, ‘Sex_1’ and ‘Sex_2’. If ‘Sex’ is 1, then ‘Sex_1’ is set to 1 and ‘Sex_2’ is set to 0, and vice versa.

We employ the ‘get_dummies’ function from the ‘pandas’ module for this purpose.

map_features_2 = ['Pclass','Sex','Embarked','FamilySize','Fare','Age']
encoded_features = pd.get_dummies(data_all[map_features_2], columns=map_features_2)
print(encoded_features.info())

 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Pclass_0       1309 non-null   bool 
 1   Pclass_1       1309 non-null   bool 
 2   Pclass_2       1309 non-null   bool 
 3   Sex_0          1309 non-null   bool 
 4   Sex_1          1309 non-null   bool 
 5   Embarked_0     1309 non-null   bool 
 6   Embarked_1     1309 non-null   bool 
 7   Embarked_2     1309 non-null   bool 
 8   FamilySize_1   1309 non-null   bool 
 9   FamilySize_2   1309 non-null   bool 
 10  FamilySize_3   1309 non-null   bool 
 11  FamilySize_4   1309 non-null   bool 
 12  FamilySize_5   1309 non-null   bool 
 13  FamilySize_6   1309 non-null   bool 
 14  FamilySize_7   1309 non-null   bool 
 15  FamilySize_8   1309 non-null   bool 
 16  FamilySize_11  1309 non-null   bool 
 17  Fare_0         1309 non-null   bool 
 18  Fare_1         1309 non-null   bool 
 19  Fare_2         1309 non-null   bool 
 20  Fare_3         1309 non-null   bool 
 21  Fare_4         1309 non-null   bool 
 22  Age_0          1309 non-null   bool 
 23  Age_1          1309 non-null   bool 
 24  Age_2          1309 non-null   bool 
 25  Age_3          1309 non-null   bool 
 26  Age_4          1309 non-null   bool

Next, we can proceed to separate the original dataset into training and testing sets, preparing for the model training.

train_x = encoded_features.iloc[:traindata.shape[0]]
test_x = encoded_features.iloc[traindata.shape[0]:]

train_y = data_all['Survived'].iloc[:traindata.shape[0]]
test_y = data_all['Survived'].iloc[traindata.shape[0]:]

3. Model

In this phase, we need to choose an appropriate model to fit the data and select the best-performing model.

3.1 Linear Regression

For a binary classification problem, we can start by attempting the simplest linear regression model.

What we need to do is quite straightforward. We just need to invoke the ‘LinearRegression’ from the ‘sklearn’ package.

from sklearn.linear_model import LinearRegression
model_LinearRegression = LinearRegression()
model_LinearRegression.fit(train_x, train_y)
test_y = model_LinearRegression.predict(test_x)
threshold = 0.5
test_y = (test_y > threshold).astype(int)
print("Predicted y:", test_y)

3.2 Logistic Regression

Logistic regression is also one of the commonly used methods for binary classification problems.

from sklearn.linear_model import LogisticRegression
model_LogisticRegression = LogisticRegression()
model_LogisticRegression.fit(train_x, train_y)
test_y = model_LogisticRegression.predict(test_x).astype(int)
print("Predicted y:", test_y)

3.3 Random Forest

from sklearn.ensemble import RandomForestClassifier
model_RandomForestClassifier = RandomForestClassifier(n_estimators=100, random_state=42)
model_RandomForestClassifier.fit(train_x, train_y)
test_y = model_RandomForestClassifier.predict(test_x).astype(int)
print("Predicted y:", test_y)

3.4 Output Results

Please pay attention to the required output format as specified in the prompt.

Submission File Format:
You should submit a csv file with exactly 418 entries plus a header row. Your submission will show an error if you have extra columns (beyond PassengerId and Survived) or rows.

The file should have exactly 2 columns:

PassengerId (sorted in any order)

Survived (contains your binary predictions: 1 for survived, 0 for deceased)

predictions_df = pd.DataFrame({'PassengerId': testdata['PassengerId'], 'Survived': test_y})
output_file = "/deeplearning/KaggleProjects/titanic/Output/Predictions.csv"
# Rename the file.
predictions_df.to_csv(output_file, index=False)
print("Predictions saved to:", output_file)

Then you can proceed to submit your base model to Kaggle. By following my code, you might achieve a score similar to the following:

Figure-1 Score

Is it over at this point? No, not by a long shot. What we’ve done so far is just completing a basic model. Let’s not forget that we’ve discarded a significant amount of data and made several simplifications along the way.