I recently discovered Kaggle, this is a Google platform where you can compete in deep learning challenges, and improve your skills. Kaggle has a challenge designed for beginners in deep learning and data science, in this post you will see an easy way to approach the main problem of this challenge. So you can solve it by yourself and then you can try to improve your personal score.
A Titanic Problem
The problem of the beginners challenge it's pretty simple to understand, this consist of developing a model that can predict correctly who survived to titanic. In order to do this Kaggle provided us with all the necessary data to train a neural network, but before modeling an architecture we need to do some data analysis, I used Pandas and Tensorflow to develop this neural network but you are free to use any library you want. So let's begin. Kaggle gives us csv files with all the necessary data and these values follows this next structure:
Variable | Definition | Key |
---|---|---|
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 =1st, 2 =2nd, 3 =3rd |
sex | Sex | |
Age | Age in years | |
sibsp | # of siblings / spouses aboard the Titanic | |
parch | # of parents / children aboard the Titanic | |
ticket | Ticket number | |
fare | Passenger fare | |
cabin | Cabin number | |
embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Data Preprocessing
There are some things that we need to do in order to have a good dataset, first thing is verify if there are null values in our data, in this case the data contains some missing values in age column, so to resolve it we first get age mean value for men and women and then fill those missing values with his corresponding mean value, I also split age column in ranges and put passengers into these groups. Then we need to transform non-integer values into integer values, for example, the Sex column has values ‘male’ and ‘female’, so we can turn them into 1 and 0, we can do the same with the embarked column. After this we could delete those columns with many missing values or those we don't think are relevant for prediction like column ‘name’ or ‘cabin’. Then you need to normalize all data, in my case I decided to normalize it in the range of [-1, 1], with this you improve the performance and training of your neural network.
Once we had finished data preprocessing we can start with data analysis and see which values have more weight over the prediction, as we can see Sex is one of these values with high weight, we see that almost all men died and almost all women survived, we can do this process over all the data and see how much weight has these values.
I will show you two more examples, ticket class of the passenger and his age, ticket class is important because it also corresponds to where they were inside the ship, higher classes were closest to the deck, this factor is better represented in the graph. It is important to say that if you want to improve the accuracy of your model you must start considering the columns we deleted before in the preprocessing.
Architecture
To resolve this problem I chose a really simple architecture with just two hidden layers, this is because I found for this problem with the preprocessing data I made, the more layers/neurons you add the less accuracy you get, this model uses hyperbolic tangents as activation functions for first and hidden layers and a ReLU activation function for the output layer. To train this model I used a batch size of 16 and it was trained for 1000 epochs, the loss was calculated with mean squared error algorithm.
With this model I get top 25% and I think is the best you can get with a model with this simplicity, and to improve your score you need to choose another approach in the preprocessing data as I say before and also I think a good solution will be add another model with a different algorithm and made a double or even triple validation.
Links
Titanic Challenge: https://www.kaggle.com/c/titanic