Health Costs with Linear Regression - Data Shuffle? or Not?

lawun330 · June 23, 2022, 7:23pm

Tell us what’s happening:
I have successfully found the solution with my code but however, I discovered something else. We have to split the 80% of the original given dataframe into train_data and 20% for test_data. With that two datasets, the model’s Mean Absolute Error is less than 3500. But when you shuffle the dataset (using df.sample) first before splitting, and the rest steps, i.e., the same model and procedures used in non-shuffled dataframes are followed. Then the model’s Mean Absolute Error becomes really high (~4100).

The following procedures include-

Creating features and labels for training and testing
Normalization of the values with “keras.layers.Normalization”
Building model using “Adam” as optimizer and “mae” as loss function
Training the model and evaluating

These steps are done the same for the un-shuffled and shuffled dataset. The results are very different.

My point is the model should not behave very differently just because the data is shuffled, right? If the changes are big, due to the sample being shuffled, then can it really be rely on? What would we need to reduce MAS for any sample?

Your code so far
You can test for yourself by uncommenting the single line, and commenting the other. One line is not to shuffle, the other line is for shuffling:

Your browser information:

User Agent is: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36

Challenge: Linear Regression Health Costs Calculator

Link to the challenge:

lawun330 · June 23, 2022, 8:08pm

I do. But shuffled 80% data and unshuffled 80% data gives very different outputs (Mean Abs Error). I am talking about that.

SzeYeung1 · July 15, 2022, 1:08am

Yes. In my experience with this dataset, taking different samples in the process of spliting the train and test dataset can result in several hundred points variation of mae. If you use df.sample with different random states to split the dataset before training, you may notice this variation.

I think it is part of the nature of machine learning that model performance is affected by the sample used in training, just like any other activities involving taking sample. K-fold cross validation, which uses different portions of the data to test and train a model on different iterations, may give a better idea how accurately the model will perform in practice.

When sampling effect can determine whether a model will pass the test or not, that means the normal range of accuracy metric of the model is to close to the passing/failure line. To improve the general accuracy level, you may tweak your model by increasing neurons or layers, or you may do so some feature engineering. In my experience, if you feed the right set of features, even a model with single dense layer, which is equivalent to multiple linear regression, can get the mae lower than 1,400.

SzeYeung1 · July 18, 2022, 9:58pm

Thanks for appreciation. I took some hints from this forum, and try to give back.

SzeYeung1 · July 20, 2022, 11:44am

Further thought on the sampling effect: From the plots of predicted values against true values, it’s apparent that generally all models perform well in predictions for expenses under 15K or 16K, but stuggle in various degrees for values above that. Each higher expenses case can result in error in the range of several thousands to dozens of thousands. So if the test dataset contains more higher expenses cases, the mean absolute error score of the model will become higher, and vice versa.

system · January 18, 2023, 11:44pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Health Cost Calculator: trouble getting below 3500 mae Python	8	616	October 18, 2021
Machine Learning with Python Projects - Linear Regression Health Costs Calculator Python	2	28	June 19, 2025
Linear Regression Health cost calculator error Python	2	404	November 4, 2021
[ML] Health Costs Calculator - Errors at evaluate - test_labels - Splitting the test_data Python	2	438	April 20, 2022
Machine Learning: Linear Regression Health Costs Calculator Completion Code Feedback	5	2055	February 13, 2022

Health Costs with Linear Regression - Data Shuffle? or Not?

Related topics