Data Analysis with Python Projects - Medical Data Visualizer

Tell us what’s happening:

Hi,
I’m trying to create a correlation matrix, but the test fails because it expects negative zeros in some places (-0.0) but I get only zeros (0.0). What’s wrong?
At first, I thought that I needed to transform my matrix with abs() for each zero. But then I realised that the -0.0 are coded in the test module. How can I pass the test? I should not modify the test module, because it is for certification project.
Thanks!

Your code so far

corr = df_heat.corr().round(
decimals=1) # this is what I tried: .transform(lambda x: [0.0 if i == 0.0 else i for i in x])

Your browser information:

User Agent is: Mozilla/5.0 (X11; Linux x86_64; rv:129.0) Gecko/20100101 Firefox/129.0

Challenge Information:

Data Analysis with Python Projects - Medical Data Visualizer

Here is the output og the test:

AssertionError: Lists differ: ['0.0', '0.0', '0.0', '0.0', '-0.1', '0.5', '0.0', '0.1',[579 chars]0.1'] != ['0.0', '0.0', '-0.0', '0.0', '-0.1', '0.5', '0.0', '0.1'[601 chars]0.1']

First differing element 2:
'0.0'
'-0.0'

I’m trying to see if this is a problem of library version. I tried to use the version of the libraries in the original GitHub project, but replit.com don’t let me do that. That’s why I had to remove the versions in the requirements.txt file. Here is what I have in this file.

seaborn
pandas
matplotlib
numpy

I couldn’t find a solution to test other versions. I tried to install them manually (with “pip instal …”) but there are always some errors.

pip install should work although there may be a chain of dependencies. What errors do you get?

Please share your full code for testing.

I’m not sure if I can share my code, as this project is part of the certification.
I could install manually Seaborn 0.13.2 and Pandas 1.5.3.
But then, NumPy wasn’t compatible any more. Then, I tried to install manually NumPy 3.1.3 and 3.2.2, but the installation process didn’t succeed (on replit.com).
The only way to run the project without library problems is to remove the versions in the requirements.txt file. But the test module complains that my zeros 0.0 are not negative zeros -0.0.

I get this error with Seaborn 0.13.2 and Pandas 1.5.3 (and the defaul version of NumPy):

Traceback (most recent call last):
  File "/home/runner/boilerplate-medical-data-visualizer/main.py", line 2, in <module>
    import medical_data_visualizer
  File "/home/runner/boilerplate-medical-data-visualizer/medical_data_visualizer.py", line 1, in <module>
    import pandas as pd
  File "/home/runner/boilerplate-medical-data-visualizer/.pythonlibs/lib/python3.12/site-packages/pandas/__init__.py", line 22, in <module>
    from pandas.compat import is_numpy_dev as _is_numpy_dev  # pyright: ignore # noqa:F401
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/boilerplate-medical-data-visualizer/.pythonlibs/lib/python3.12/site-packages/pandas/compat/__init__.py", line 18, in <module>
    from pandas.compat.numpy import (
  File "/home/runner/boilerplate-medical-data-visualizer/.pythonlibs/lib/python3.12/site-packages/pandas/compat/numpy/__init__.py", line 4, in <module>
    from pandas.util.version import Version
  File "/home/runner/boilerplate-medical-data-visualizer/.pythonlibs/lib/python3.12/site-packages/pandas/util/__init__.py", line 2, in <module>
    from pandas.util._decorators import (  # noqa:F401
  File "/home/runner/boilerplate-medical-data-visualizer/.pythonlibs/lib/python3.12/site-packages/pandas/util/_decorators.py", line 14, in <module>
    from pandas._libs.properties import cache_readonly
  File "/home/runner/boilerplate-medical-data-visualizer/.pythonlibs/lib/python3.12/site-packages/pandas/_libs/__init__.py", line 13, in <module>
    from pandas._libs.interval import Interval
  File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

I’m seeing other people sharing their code. So, I hope this is OK. Here is my code:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# 1
df = pd.read_csv('medical_examination.csv')

# 2 BMI=m/h^2
df['overweight'] = ((df['weight'] / ((df['height'] / 100)**2))
                    > 25).astype(int)

# 3
df['cholesterol'] = df['cholesterol'].apply(lambda x: 0 if x == 1 else 1)
df['gluc'] = df['gluc'].apply(lambda x: 0 if x == 1 else 1)


# 4
def draw_cat_plot():
    # 5
    df_cat = pd.melt(df,
                     id_vars='cardio',
                     value_vars=[
                         'cholesterol', 'gluc', 'smoke', 'alco', 'active',
                         'overweight'
                     ])

    # 6
    df_cat = df_cat.groupby(['cardio', 'variable',
                             'value']).size().reset_index(name='total')

    # 7
    cardio0 = df_cat[df_cat['cardio'] == 0]
    cardio1 = df_cat[df_cat['cardio'] == 1]

    # 8
    fig, axs = plt.subplots(ncols=2, figsize=(15, 5))
    sns.countplot(data=cardio0, x='variable', hue='value',
                  ax=axs[0]).set(title='cadio = 0', ylabel="total")
    axs[0].legend([], [], frameon=False)
    sns.countplot(data=cardio1, x='variable', hue='value',
                  ax=axs[1]).set(title='cadio = 1', ylabel="total")
    sns.move_legend(axs[1], "right", bbox_to_anchor=(1.15, 0.5))

    # 9
    fig.savefig('catplot.png')
    return fig


# 10
def draw_heat_map():
    # 11
    df_heat = df.copy()
    df_heat = df_heat[df_heat['ap_lo'] <= df_heat['ap_hi']]
    df_heat = df_heat[(df_heat['height'] >= df_heat['height'].quantile(0.025))
                      &
                      (df_heat['height'] <= df_heat['height'].quantile(0.975))]
    df_heat = df_heat[(df_heat['weight'] >= df_heat['weight'].quantile(0.025))
                      &
                      (df_heat['weight'] <= df_heat['weight'].quantile(0.975))]

    # 12
    corr = df_heat.corr().round(
        decimals=1)#.transform(lambda x: [0.0 if i == 0.0 else i for i in x])

    # 13
    mask = np.triu(np.ones_like(corr, dtype=bool))

    # 14
    fig, ax = plt.subplots(ncols=1, figsize=(10, 10))

    # 15
    sns.heatmap(corr, mask=mask, annot=True, fmt=".1f")

    # 16
    fig.savefig('heatmap.png')
    return fig

I tested on my computer, and there is no more a problem with negative zeros (I get them now). But there are still some differences in values of the heat map between my code and the test values.

...['0.0', '0.0', '-0.0', '0.0', '-0.1', '0.5', '0.0', '0.1', '0.1', '0.2', '0.0', '0.0', '0.0', '0.0', '0.0', '0.0', '0.2', '0.1', '0.0', '0.2', '0.1', '0.0', '0.1', '-0.0', '-0.1', '0.1', '0.0', '0.1', '0.0', '0.1', '-0.0', '-0.0', '0.1', '0.0', '0.1', '0.4', '-0.0', '-0.0', '0.3', '0.2', '0.1', '-0.0', '0.0', '0.0', '-0.0', '-0.0', '-0.0', '0.2', '0.1', '0.1', '0.0', '0.0', '0.0', '0.0', '0.3', '0.0', '-0.0', '0.0', '-0.0', '-0.0', '-0.0', '0.0', '0.0', '-0.0', '0.0', '0.0', '0.0', '0.2', '0.0', '-0.0', '0.2', '0.1', '0.3', '0.2', '0.1', '-0.0', '-0.0', '-0.0', '-0.0', '0.1', '-0.1', '-0.2', '0.7', '0.0', '0.2', '0.1', '0.1', '-0.0', '0.0', '-0.0', '0.1']
F
======================================================================
FAIL: test_heat_map_values (test_module.HeatMapTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/media/luca/maxone/DataScience/freeCodeCamp/Data Analysis with Python/Projects/3-Medical Data Visualizer/boilerplate-medical-data-visualizer/test_module.py", line 47, in test_heat_map_values
    self.assertEqual(actual, expected, "Expected different values in heat map.")
AssertionError: Lists differ: ['0.0[59 chars], '0.2', '0.0', '0.0', '0.0', '0.0', '0.0', '0[548 chars]0.1'] != ['0.0[59 chars], '0.3', '0.0', '0.0', '0.0', '0.0', '0.0', '0[548 chars]0.1']

First differing element 9:
'0.2'
'0.3'

Diff is 1023 characters long. Set self.maxDiff to None to see it. : Expected different values in heat map.

To visualize in a better way my results and the test values, I post the image I get (first image) and the one of the example result (second image):


It seems to me that the problem is the way my data is rounded (because the differences are of 1/10 only, and they are very few).

For the other part of the project, I’m aware that my catplot.png is not correct (but the test doesn’t complain about it).

The problem may arise from the way you filter the data for correlation. Consider different ways of performing filtering with multiple conditions. You may notice the rows of dataframe differs if you do the filtering in different way.

Thank you!
If I filter in one go (i.e. with &s), it works. But now, I have to solve the problem of the first chart.

Filtering multiple conditions in separate lines may look equivalent to using &. In many cases it is, but in this case we are using percentiles for filtering. If we filter the highest and lowest 2.5% in height first, when we filter on weight in separate line, we are filtering the highest and lowest 2.5% based on the 95% of cases remaining, not the original population, and so the result differs.

2 Likes