I’m working on the heat map portion of the project, and it is failing. The first error is in one of the fields so I don’t have variable issues, but there are 3 or 4 diffs in the actual v. expected output.
Each one is a tenth off and I’m wondering if there is differences in my rounding vs. others’ and I would like to see you rounding code if you have it. Otherwise, I’m wondering if I should just modify the test to put the actual values in the expected because I have completed the ‘spirit’ of the project.
It’s not rounding, it looks like the data cleaning that’s causing the correlation to be off a bit, which causes the heatmap to be off a bit (post a live project link and I could be more certain).
Each time you do one of these steps, you are picking the rows that match the quantile provided. So at the end, df_heat is the rows that are under the 0.975 quantile for weight. All five conditions have to be satisfied simultaneously.
As far as simultaneous satisfaction, I think that is achieved here because I am filtering out anything that satisfies the negative of each thing so only getting things above 2.5% and then finding things below 97.5% from that set so the end result should be everything in the set of [2.5% quantile, 97.5% quantile] inclusive on both ends.
Your proposed solution being to put everything into one loc check does seem to give me a different resulting data frame but…
Actually sitting here thinking it through, I think that I figured out why it is different.
It stands to reason that logically the two operations would come out to be the same thing except that the operations happen at different times. Putting all of the checks in one line means that the quantile is calculated on the whole of the data. Putting the checks on different lines means that the data is filtered on a per-step basis and therefore, the quantile is going to change on each step. So we get two different dataframes and that is what is causing the problem.
I messed around with this over a period of several days until I looked at that bit of code for the nth time and realized what it was actually doing and why it was wrong. If I remember correctly, I also compared the data frame sizes with both methods. They should (are likely to) be different and the wrong way should (is likely to) change the data frame size with each different order. The right way should have the same size regardless of order.
It now occurs to me that this would be a great place for a test on the data frame size, similar to the one in the time series visualizer project.
Wow good thing I found this.
I was completly puzzled as to why the test was failing on exactly one number xD
But yeah, doing them in one go results in a 63259 entry-long dataframe, whereas with the other it’s a couple hundred shorter (and indeed does slightly vary depending on the order).
Quick question: is there a benefit of doing this with df.loc[ condition ] instead of just df[ condition ] ?