The Titanic ML project on Kaggle

Greetings,

I hope everyone is doing well,

I am trying to do the titanic challenge in kaggle. Can you please assist me with the following queries:

  1. do I need to categories for the age to start?

  2. I dropped the following columns I think they are completely unnecessary and cannot be used: Cabin, Ticket, Name . Am I right here ? Should I drop the Embarked as well?

  3. How do I determine which model and feautures to use to me it seems like a regression task, are there metrics that I can use?

the data and my jupyter notebook are here:

Thanks.

Kind Regards,
Atrox.

  1. You may try and compare whether binning the Age variable can lead to better performance of the model, or models.

  2. Again you can compare the effects of dropping certain columns by experimenting. When one decide which features to be dropped, common sense might tell us some features are unrelated to the survival chance. But is it really the case? We have to examine the data. If you look into the survival rates of pessengers from different point of Embarked, do you see some pattern emerges? Basic data exploration will tell Gender and PC Class are two big factors influencing the survival chance, but some may argue the titles extracted from the Names (Mr, Mrs, Miss, Sir, Lady, etc) contain similarily important information, and to a lesser extent, so does Cabin (As the first letter of Cabin indicates which deck the cabin is located, and related to the PC Class). There is no definitive answer for feature selection. Sometimes seemingly useless features can yield important information after good feature engineering, but sometimes promising looking features do not improve model performance significantly after inclusion for training.

  3. While the model is to ‘predict’ which passengers survive, why do you see it looks like a regression task?

1 Like
  1. How do I see if it is effective?

  2. Okay I understand. Are there some data analysis methods that you can advice me in this case to use? For example some articles maybe?

  3. I am not sure about it being regression model, that is why I wanted to ask how to determine which model to use. I thought it is a regression model since I saw some correlations between age and survival rate and so on. I thought I can just fit a regression line that would fit this data.

  1. You train a model with Age variable as it is, consists of numeric values, make predictions with validation dataset or test set, get the score of accuracy. Than train another model with Age variable binned(in one or more than one way), get another score(s) and compare.

This tutorial from Kaggle’s mini course Intermediate Machine Learning is using this comparative approach on the effect of different treatments of categorical variables, though with different dataset.

  1. For the Titanic challenge, there’s ton of notebooks and tutorials in Kaggle’s competition page. You may look into the ones with most votes, but not the ones with 100% accuracy score, which are using cheating means for vanity.

  2. I think there’s a confusion of the term used. In supervised learning we have two major types of task: classification and regression. The following description is taken from scikit-learn Tutorial:

  • classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. An example of a classification problem would be handwritten digit recognition, in which the aim is to assign each input vector to one of a finite number of discrete categories. Another way to think of classification is as a discrete (as opposed to continuous) form of supervised learning where one has a limited number of categories and for each of the n samples provided, one is to try to label them with the correct category or class.
  • regression: if the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the length of a salmon as a function of its age and weight.

While you are not wrong in saying that a regression line may fit between certain variables and survival rate (in fact this is how logistic regression is operating), at the end the final predictions are not about survival rate, but concrete answers that whether a passgener with given features was going to be survived or not. So it is a classification problem.

I am not sure whether you are referring to machine learning algorithms (like logistic regression, svm, knn, decision trees, random forest) when you talk about which model to use. Any algorithm that can do classification can be used, of course you would pick the one with the best performance. For different performance metrics for classification, you may consult the Metrics and scoring section of scilit-learn’s documentation or this article, but Kaggle’s Titanic competition just uses the simple classification accuracy, ie the percentage of passengers you correctly predict their fate. For feature selection, again you may consult the corresponding section of scikit-learn’s documentation or this article.

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.