Rounding error in python medical data visualizer

I’m working on the heat map portion of the project, and it is failing. The first error is in one of the fields so I don’t have variable issues, but there are 3 or 4 diffs in the actual v. expected output.

Each one is a tenth off and I’m wondering if there is differences in my rounding vs. others’ and I would like to see you rounding code if you have it. Otherwise, I’m wondering if I should just modify the test to put the actual values in the expected because I have completed the ‘spirit’ of the project.

Any help would be greatly appreciated :smiley:

Diff output

First differing element 9:
‘0.2’
‘0.3’

[‘0.0’,
‘0.0’,
‘-0.0’,
‘0.0’,
‘-0.1’,
‘0.5’,
‘0.0’,
‘0.1’,
‘0.1’,

  • ‘0.2’,
    ? ^
  • ‘0.3’,
    ? ^

    ‘0.0’,
    ‘0.0’,
    ‘0.0’,
    ‘0.0’,
    ‘0.0’,
    ‘0.0’,
    ‘0.2’,
    ‘0.1’,
    ‘0.0’,
    ‘0.2’,
    ‘0.1’,
    ‘0.0’,
    ‘0.1’,
    ‘-0.0’,
    ‘-0.1’,
    ‘0.1’,
    ‘0.0’,
    ‘0.2’,
    ‘0.0’,
    ‘0.1’,
    ‘-0.0’,
    ‘-0.0’,
    ‘0.1’,
    ‘0.0’,
    ‘0.1’,
    ‘0.4’,
    ‘-0.0’,
    ‘-0.0’,
    ‘0.3’,
    ‘0.2’,
    ‘0.1’,
    ‘-0.0’,
    ‘0.0’,
    ‘0.0’,
    ‘-0.0’,
    ‘-0.0’,
    ‘-0.0’,
    ‘0.2’,
    ‘0.1’,
    ‘0.1’,
    ‘0.0’,
    ‘0.0’,
    ‘0.0’,
    ‘0.0’,
    ‘0.3’,
    ‘0.0’,
    ‘-0.0’,
    ‘0.0’,
    ‘-0.0’,
    ‘-0.0’,
    ‘-0.0’,
    ‘0.0’,
    ‘0.0’,
    ‘-0.0’,
    ‘0.0’,
    ‘0.0’,
    ‘0.0’,
    ‘0.2’,
    ‘0.0’,
    ‘-0.0’,
    ‘0.2’,

  • ‘0.0’,
    ? ^
  • ‘0.1’,
    ? ^

    ‘0.3’,
    ‘0.2’,
    ‘0.1’,
    ‘-0.0’,
    ‘-0.0’,
    ‘-0.0’,
    ‘-0.0’,
    ‘0.1’,
    ‘-0.1’,

  • ‘-0.2’,
    ? ^
  • ‘-0.1’,
    ? ^

    ‘0.7’,
    ‘0.0’,
    ‘0.2’,
    ‘0.1’,
    ‘0.1’,
    ‘-0.0’,
    ‘0.0’,
    ‘-0.0’,
    ‘0.1’] : Expected differnt values in heat map.

Code SPOILER

Clean the data

df_heat = df.loc[df['ap_lo'] <= df['ap_hi']]
df_heat = df_heat.loc[df_heat['height'] >= df_heat['height'].quantile(0.025)]
df_heat = df_heat.loc[df_heat['height'] <= df_heat['height'].quantile(0.975)]
df_heat = df_heat.loc[df_heat['weight'] >= df_heat['weight'].quantile(0.025)]
df_heat = df_heat.loc[df_heat['weight'] <= df_heat['weight'].quantile(0.975)]
print(df_heat.head())

# Calculate the correlation matrix
corr = df_heat.corr()
print(corr)

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr))

# Set up the matplotlib figure
fig, ax = plt.subplots()

# Draw the heatmap with 'sns.heatmap()'
ax = sns.heatmap(corr, vmin=-0.16, vmax=0.32,annot=True, fmt="0.1f", mask=mask, ax=ax)

It’s not rounding, it looks like the data cleaning that’s causing the correlation to be off a bit, which causes the heatmap to be off a bit (post a live project link and I could be more certain).

Each time you do one of these steps, you are picking the rows that match the quantile provided. So at the end, df_heat is the rows that are under the 0.975 quantile for weight. All five conditions have to be satisfied simultaneously.

1 Like

@jeremy.a.gray , I appreciate you taking a look. Here is my live project link.

https://repl.it/@mcfarke311/boilerplate-medical-data-visualizer#medical_data_visualizer.py

As far as simultaneous satisfaction, I think that is achieved here because I am filtering out anything that satisfies the negative of each thing so only getting things above 2.5% and then finding things below 97.5% from that set so the end result should be everything in the set of [2.5% quantile, 97.5% quantile] inclusive on both ends.

Hmmm.

@jeremy.a.gray - that does seem to fix the problem… so thank you for that!

But I’m still very curious as to what the difference is here.

I did

df_heat = df.loc[df[‘ap_lo’] <= df[‘ap_hi’]]

df_heat = df_heat.loc[df_heat[‘height’] >= df_heat[‘height’].quantile(0.025)]
df_heat = df_heat.loc[df_heat[‘height’] <= df_heat[‘height’].quantile(0.975)]
df_heat = df_heat.loc[df_heat[‘weight’] >= df_heat[‘weight’].quantile(0.025)]
df_heat = df_heat.loc[df_heat[‘weight’] <= df_heat[‘weight’].quantile(0.975)]

and you suggested the following

df_heat = df.loc[(df[‘ap_lo’] <= df[‘ap_hi’]) & (df[‘height’] >= df[‘height’].quantile(0.025)) & (df[‘height’] <= df[‘height’].quantile(0.975)) & (df[‘weight’] >= df[‘weight’].quantile(0.025)) & (df[‘weight’] <= df[‘weight’].quantile(0.975))]

Your proposed solution being to put everything into one loc check does seem to give me a different resulting data frame but…

Actually sitting here thinking it through, I think that I figured out why it is different.

It stands to reason that logically the two operations would come out to be the same thing except that the operations happen at different times. Putting all of the checks in one line means that the quantile is calculated on the whole of the data. Putting the checks on different lines means that the data is filtered on a per-step basis and therefore, the quantile is going to change on each step. So we get two different dataframes and that is what is causing the problem.

That’s it exactly.

I messed around with this over a period of several days until I looked at that bit of code for the nth time and realized what it was actually doing and why it was wrong. If I remember correctly, I also compared the data frame sizes with both methods. They should (are likely to) be different and the wrong way should (is likely to) change the data frame size with each different order. The right way should have the same size regardless of order.

It now occurs to me that this would be a great place for a test on the data frame size, similar to the one in the time series visualizer project.

Wow good thing I found this.
I was completly puzzled as to why the test was failing on exactly one number xD
But yeah, doing them in one go results in a 63259 entry-long dataframe, whereas with the other it’s a couple hundred shorter (and indeed does slightly vary depending on the order).

Quick question: is there a benefit of doing this with df.loc[ condition ] instead of just df[ condition ] ?

I’m glad this helped you! :slight_smile:

Not in this situation. But According to python zen, “explicit is better than implicit”.

There are some situations where you must use loc though. Check out the following on SO.