Titanic Data Prep with Pandas

Posted By: Brenna Starkey, undergraduate student

8 min readMay 1, 2021

When creating my first data science model, I chose to explore the Titanic Dataset. I made this decision because it is fun and accessible, equipped with lots of references and serves as almost everyone’s first. To start, I joined the well-known Kaggle Titanic Competition:

Titanic - Machine Learning from Disaster

Start here! Predict survival on the Titanic and get familiar with ML basics

www.kaggle.com

During my first two submissions, I used the NumPy, Scikit-learn, Matplotlib, and Seaborn packages to clean, model, and visualize the data. My results were first, a .76 accuracy with Random Forest modeling and then an .81 with Logistic Regression. Overall, it was a great first experience with predictive modeling, but I wanted to focus on data preparation and the Titanic dataset was almost too easy…

So, I dirtied it up and created my own version of the Titanic Dataset, I am calling the “Stinky” Titanic Dataset. Then, I learned simple data preparation with the Pandas library to apply to the data. At first I was interested in data cleaning applications such as Tableau and Alteryx, but quickly learned the limitations of these user interfaces. Instead, I built a Jupyter Notbook and wrote my own code to clean the data. In this article, I will explain the differences made to the data set and then walk through cleaning it with Pandas.

Stinky Titanic Dataset

The “Stinky” Dataset is linked on my GitHub as well as the Jupyter Notebook so you can follow along and run the upcoming code.

brennastarkey/StinkyTitanic

github.com

First, we import the library and load our data. I have named our dataset df because this type is called a DataFrame in Pandas. It structures the data into two dimensional arrays that are labeled, similar to a spreadsheet. After these initial steps, we can visualize the data with the head() function which displays the first 5 rows by default or the first n rows passed by the parameter. In this example and for the rest of this article I will be showing the first 10.

import pandas as pddf = pd.read_excel('dirty_titanic_df.xlsx')
df.head(10)

The major differences in this “Stinky” data set from the original are in the Name and Age columns. A comma delimiter was added after the titles in the Name attribute and the Age column was dispersed with various data types. For instance, it contains a mixture of integers values and strings, like on line 5. The code below will produce some detailed information about all the structure of the columns.

df.info()

df.describe(include='all')

The describe() function further explores the composition of the data and summarizes the tendency and shapes of the distribution. The “include=’all’” parameter specifies to include all columns of the input rather than just the numeric objects.

Cleaning the Attributes

Before we would be prepared to formulate any type of model or machine learning algorithm, we must clean up the data. Thus, this is the most influential and indicative step in our model preparation. In this article I will focus on removing irrelevant attributes, handling null/missing data, and grouping attributes into identifying quantities.

Survived:

df['Survived'].dtype

As we can see from the output the Survived data type is currently an integer, but we can easily change all the entries to boolean values by applying the astype() function to the object.

df['Survived'] = df['Survived'].astype('bool')

Pclass:

The next attribute in the DataFrame is Pclass, representing the ticket class (1–3). If we ran the following code, we would receive the output 3, indicating those three unique values.

df['Pclass'].nunique()

Thus, we can change this data type to be categorical. The benefits of this conversion are outlined in Pandas user guide below.

Categorical data - pandas 1.2.4 documentation

This is an introduction to pandas categorical data type, including a short comparison with R's . Categoricals are a…

pandas.pydata.org

df['Pclass'] = df['Pclass'].astype(‘category’)

Name:

One of the most prominent differences in my “Stinky” data set, is the addition of the delimiter in the Name attribute. Below we will filter out everything after the comma delimiter and are left with just the titles. We will then create a new Title attribute which will be more useful than the distinct names. Inserting a new column to the DataFrame is done by specifying the index, name, and default value to set all the rows to. After extracting the titles, we can then delete the Name attribute since it won’t be necessary anymore.

df.insert(4, 'Title', '')
df['Title'] = df['Name'].str.split(',', expand = True)
df = df.drop(columns = ['Name'])
df.head(10)

df['Title'].unique()

We can see from the code above that we have 18 distinct titles. Below, running the crosstab() function will display a frequency table between the two attributes.

pd.crosstab(df['Title'], df['Sex'])

Now, we can group by common names and replace all others with the ‘rare’ title.

df['Title'] = df['Title'].replace(['Lady ', 'Countess ','Capt ', 'Col ','Dr ', 'Major ', 'Rev', 'Sir'], 'Rare')df['Title'] = df['Title'].replace('Mlle ', 'Miss')
df['Title'] = df['Title'].replace('Ms ', 'Miss')
df['Title'] = df['Title'].replace('Mme ', 'Mrs')
df['Title'].unique()

After creating these identifiers we can display the correlation between the title and their survival rate.

df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

Further more we can band these together and categorize into numeric values, mapping to each title.

title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
df['Title'] = df['Title'].map(title_mapping)
df['Title'] = df['Title'].fillna(0)
df['Title'] = df['Title'].round(0).astype(int)

Once again, we can use the crosstab() function to visualize the distribution between titles and sex.

pd.crosstab(df[‘Title’], data[‘Sex’])

Sex:

As far as the Sex attribute, there is no data preparation needed, but at this point we can start to make our assumptions about the model. We can assume more females survived than males as part of our predictive analysis. The following code locates the female and male values respectively and computes each survival rate.

women = df.loc[df['Sex'] == 'female']["Survived"]
rate_women = sum(women)/len(women)
men = df.loc[data['Sex'] == 'male']["Survived"]
rate_men = sum(men)/len(men)print("% of women who survived:", rate_women)
print("% of men who survived:", rate_men)

Age:

The code below returns 177, indicating there are 177 null values for the Age attribute. Since age plays a large role in our intuitive assumptions about survival rate, we must fill those null values in. I decided to replace them with the median age since there are a couple outliers.

df['Age'].isnull().sum()

The following code returns statistics about the data that guided me to choose the median values.

df.sort_values('Age',ascending=False).head(10)
df['Age'].min()
df['Age].max()
df['Age'].describe()
df['Age].dtype

First, we will convert the attribute from a string object to and integer and fill in any invalid parsing with null, indicated by the ‘coerce’ errors parameter.

df['Age']=pd.to_numeric(df['Age'],errors='coerce')
df.head(10)

Next, we will actually fill in all those null values with the median using fillna(). After that I noticed the data type was a floating number so I converted it to an integer with astype().

df['Age'] = df['Age'].fillna((df['Age'].median()))
df['Age'] = df['Age'].astype(int)
df.head(10)

SibSp/Parch:

The SibSp column represents the number of siblings or spouses the passenger had aboard with them and the Parch represents the number of children or parents accompanying the individual. We can observe the correlation between these attributes and survival below.

df[['SibSp', 'Survived']].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

df[['Parch', 'Survived']].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Aside from visualizing the data to make the predictive model, there are no other cleaning steps I took for these two attributes.

Ticket:

After observing that the Ticket column contains 687 unique values and considering if the ticket number contributes to survival rate, I decided to drop the Ticket column to filter the data easier. Dropping columns is done with the following code.

df = df.drop(labels=['Ticket'], axis=1)

Fare:

Let’s convert our Fare attribute to be two decimals places to represent the dollar and cent amount with the round() function.

df.sort_values('Fare',ascending=False)
df['Fare'] = df['Fare'].round(2)
df.head(10)

Cabin:

Once again due to the nature of the Cabin attribute it will not be necessary to include this column in our analysis.

df = df.drop(labels=['Cabin'],axis=1)
df.head(10)

Embarked:

The following code finds the most frequent Embarked location and fills it in the found null values.

freq_port = df['Embarked'].dropna().mode()[0]
df['Embarked'] = df['Embarked'].fillna(freq_port)  
df[['Embarked', 'Survived']].groupby(['Embarked'], 
as_index=False).mean().sort_values(by='Survived', ascending=False)

Next, we can convert the three locations into numeric representatives.

df['Embarked'] = df['Embarked'].map( {'S': 0, 'C': 1, 'Q': 
2}).astype(int)
data.head(10)

Conclusion

Now that we have prepped the “Stinky” Titanic Dataset using Pandas library, the data is ready to be analyzed and modeled. This technique of cleaning and preparing the data is a pivotal part to data science and the first step taken towards creating beautiful visualizations and predictive analytics.