Health Costs with Linear Regression - Data Shuffle? or Not?

Tell us what’s happening:
I have successfully found the solution with my code but however, I discovered something else. We have to split the 80% of the original given dataframe into train_data and 20% for test_data. With that two datasets, the model’s Mean Absolute Error is less than 3500. But when you shuffle the dataset (using df.sample) first before splitting, and the rest steps, i.e., the same model and procedures used in non-shuffled dataframes are followed. Then the model’s Mean Absolute Error becomes really high (~4100).

The following procedures include-

  • Creating features and labels for training and testing
  • Normalization of the values with “keras.layers.Normalization”
  • Building model using “Adam” as optimizer and “mae” as loss function
  • Training the model and evaluating

These steps are done the same for the un-shuffled and shuffled dataset. The results are very different.

My point is the model should not behave very differently just because the data is shuffled, right? If the changes are big, due to the sample being shuffled, then can it really be rely on? What would we need to reduce MAS for any sample?

Your code so far
You can test for yourself by uncommenting the single line, and commenting the other. One line is not to shuffle, the other line is for shuffling:

Your browser information:

User Agent is: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36

Challenge: Linear Regression Health Costs Calculator

Link to the challenge:

I do. But shuffled 80% data and unshuffled 80% data gives very different outputs (Mean Abs Error). I am talking about that.

Yes. In my experience with this dataset, taking different samples in the process of spliting the train and test dataset can result in several hundred points variation of mae. If you use df.sample with different random states to split the dataset before training, you may notice this variation.

I think it is part of the nature of machine learning that model performance is affected by the sample used in training, just like any other activities involving taking sample. K-fold cross validation, which uses different portions of the data to test and train a model on different iterations, may give a better idea how accurately the model will perform in practice.

When sampling effect can determine whether a model will pass the test or not, that means the normal range of accuracy metric of the model is to close to the passing/failure line. To improve the general accuracy level, you may tweak your model by increasing neurons or layers, or you may do so some feature engineering. In my experience, if you feed the right set of features, even a model with single dense layer, which is equivalent to multiple linear regression, can get the mae lower than 1,400.

Thanks for appreciation. I took some hints from this forum, and try to give back. :smiley:

Further thought on the sampling effect: From the plots of predicted values against true values, it’s apparent that generally all models perform well in predictions for expenses under 15K or 16K, but stuggle in various degrees for values above that. Each higher expenses case can result in error in the range of several thousands to dozens of thousands. So if the test dataset contains more higher expenses cases, the mean absolute error score of the model will become higher, and vice versa.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.