Medical Data Visualizer Confusion

After major frustrations and a few days of trying to figure out this challenge I’m still completely lost starting with the step:

"Create DataFrame for cat plot using pd.melt " and “Group and reformat the data to split it by ‘cardio’. Show the counts of each feature. You will have to rename one of the collumns for the catplot to work correctly.”

My main problem is I’m not familiar enough with matplotlib or seaborn to understand what my data is ultimately supposed to look like before even attempting to make correct graphs for it.

I’ve watched the tutorials twice now, attempted searching docs on Pandas, MatplotLib, and Seaborn and I’m still confused. When I go to MatplotLib and Seaborn websites they give examples, but they don’t show the underlying data that they are using in their examples, so it’s very difficult for me to understand how I need to format the medical data in this challenge to actually graph it correctly. I also don’t see any bar chart examples like the one in the first image where we are to graph the sums of 1’s and 0’s next to one another on the x-axis, and that’s adding to the complexity of trying to make any progress.

Any advice on where I can go to learn how the data is supposed to be formatted, or other lessons / tutorials I should do before this challenge?

Thanks

3 Likes

https://www.geeksforgeeks.org/python-pandas-melt/
you can read there for pd.melt explaination

After way more hours and days than this should have taken, I’m finally done with it.

I know many others will struggle with this lesson, so below are the things that helped me and may come in handy for you also (along with a bit of ranting):

.melt() -make sure to use the correct id_vars. The point is to group your data in a way that you can later pull a count of value variables corresponding to cardio in the next step.
.groupby() & .count() -used these methods on your long melted data to restructure it to work correctly when plotted

Personally, I found the Seaborn documentation to be very lacking. I wish FCC didn’t use it for these early assignments because of the horrible documentation and the lack of online resources available to use or learn it.

For example, I spent a very long time searching for heatmap fmt string options and it took forever to find how to display it as a float with one decimal. Almost all online tutorials for Seaborn are horrible because most simply take the Seaborn examples and change the variable or data names and don’t expand on anything at all.

I also think the learning in this activity should be scaffolded by doing a few simpler charts first before jumping straight into drawing simultaneous charts with one dataset. It felt like the learner had to skip many important steps in the education process to get to that point.

For beginners just learning how to clean and manipulate dataframes in Panda without knowing data shapes required by Seaborn it is a bit overwhelming to say the least.

Anyway, I was very frustrated and I’m sure I won’t be alone. I hope others have a better experience with this problem than I did.

7 Likes

Hey man I’m suffering the exact same s*** as you and I’m quite frustrated bc this lesson is more about guessing what the correct data format for the catplot is instead of analyzing data … reframing and groupby etc is all clear … I’m just trying to figure out the format for the catplot. more specifically: how do you plot both the “1” and “0” bars for each variable? can you help me with that?

After melting your data should like like this:

        cardio     variable  value
0            0  cholesterol      0
1            1  cholesterol      1
2            1  cholesterol      1
3            1  cholesterol      0
4            0  cholesterol      0
...        ...          ...    ...
419995       0   overweight      1
419996       1   overweight      1
419997       1   overweight      1
419998       1   overweight      1
419999       0   overweight      0

Then after groupby and counting your final data for graphing should like like so:

   cardio variable  value  total
0       0   active      0   6378
1       0   active      1  28643
2       0     alco      0  33080
...
    cardio    variable  value  total
21       1  overweight      1  24440
22       1       smoke      0  32050
23       1       smoke      1   2929

After your data is formatted correctly it’s all about tinkering with the seaborn params to get it to show up correctly.

I was thinking about this exercise a lot again last night. I feel like if it just had a few tests for us to get the data formatted correctly to step people through the process a little that would be a lot more helpful than just 4 tests at the end.

Anyway good luck.

11 Likes

Thank you man! I will keep on messing around with this. I think i’m on a good path but that’s not what (literally) signed up for :confused: thanks a lot for your hint!

Hi, i have a same result withou changing name of column and it’s work well with catplot

Actually you can complete the exercise without creating a separate grouped data frame. The catplot takes an argument
sns.catplot(data=df_cat, kind=“count”, x=“variable”, hue=“value”, col=“cardio”)
col = “cardio” it will do the grouping and plotting of two separate plots for two value.

7 Likes

Work for me! Thanks!

@pschorey Thank you for posting this; I could read something on seaborn and continue writing the arguments on melt function. Not sure how far I succeeded although I am now unable to group the data for Cardio against different variables. My code so far:

def draw_cat_plot():
# Create DataFrame for cat plot using pd.melt using just the values from ‘cholesterol’, ‘gluc’, ‘smoke’, ‘alco’, ‘active’, and ‘overweight’.
df_cat = pd.melt(df, id_vars = [‘cardio’], value_vars = [‘cholesterol’, ‘gluc’, ‘smoke’, ‘alco’, ‘active’, ‘overweight’])

# Group and reformat the data to split it by 'cardio'. Show the counts of each feature. You will have to rename one of the collumns for the catplot to work correctly.

df_cat = None

@pschorey @ArbyC : Dont know but I have tried this numerous times and seem to be not getting this right.

Further; I happen to look up something on Stackoverflow and found something on applying Groupby to melt() function. Made changes to my code and now the error differs… Correct me if I’m wrong but once you create dataframe using pd and apply melt function you have to choose variable as “cardio” against which you have different variables like Cholesterol; smoke; alc etc. which seem to be correct? So; then in my next line of code where I have grouped them ‘variable’ and ‘value’ why it error out saying can;t interpret variable??

def draw_cat_plot():
# Create DataFrame for cat plot using pd.melt using just the values from ‘cholesterol’, ‘gluc’, ‘smoke’, ‘alco’, ‘active’, and ‘overweight’.
df_cat = pd.melt(df, id_vars = [‘cardio’], value_vars = [‘cholesterol’, ‘gluc’, ‘smoke’, ‘alco’, ‘active’, ‘overweight’])

# Group and reformat the data to split it by 'cardio'. Show the counts of each feature. You will have to rename one of the collumns for the catplot to work correctly.

df_cat = pd.melt(df_cat).groupby(['variable', 'value']).size().to_frame(name='total')

# Draw the catplot with 'sns.catplot()'
fig = sns.catplot(
  x = 'variable',
  y = 'total',
  hue = 'value',
  col = 'cardio',
  data = df_cat
)

I had lot trouble here as well. Did you create a total column = 1 and then count? I dont really know why its needed or if there is other option.

No; I haven’t but then as per my initial post if you apply melt() function it takes df; ID variables and var Names as arguments? In the assignment it reads 'you have to rename one of the columns" so I’m sorry I’m not with you

A post was split to a new topic: Graphs Not Showing Up

@niagodiokou can you share the solution here.
Thanks in advance

@ArbyC Thanks for sharing

Hi; can someone please assist…cant get the Graphs as expected:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Import data
df = pd.read_csv('medical_examination.csv')

# Add 'overweight' column
df['overweight'] = (df['weight']/ (df['height']/100**2))
df['overweight'] = df['overweight'].apply(lambda x : 1 if x > 25 else 0)

# Normalize data by making 0 always good and 1 always bad. If the value of 'cholestorol' or 'gluc' is 1, make the value 0. If the value is more than 1, make the value 1.
df['gluc'] = df['gluc'].apply(lambda x : 0 if x ==1 else 1)

# re-weritten the below as the normalization talks about gluc and cholesterol levels
df.loc[df['cholesterol'] == 1, 'cholesterol'] = 0
df.loc[df['cholesterol'] > 1, 'cholesterol'] = 1


# Draw Categorical Plot
def draw_cat_plot():
    # Create DataFrame for cat plot using `pd.melt` using just the values from 'cholesterol', 'gluc', 'smoke', 'alco', 'active', and 'overweight'.
    df_cat = pd.melt(df, id_vars = ['cardio'], value_vars = ['cholesterol', 'gluc', 'smoke', 'alco', 'active', 'overweight'])


    # Group and reformat the data to split it by 'cardio'. Show the counts of each feature. You will have to rename one of the collumns for the catplot to work correctly.

    df_cat['total'] = 1
    df_cat = df_cat.groupby(['cardio','variable', 'value'], as_index = False).count()

    # Draw the catplot with 'sns.catplot()'
    fig = sns.catplot(
      x = 'variable',
      y = 'total',
      kind = 'bar',
      col = 'cardio',
      data = df_cat
    )


    # Do not modify the next two lines
    fig.savefig('catplot.png')
    return fig


# Draw Heat Map
def draw_heat_map():
    # Clean the data
    df_heat = df[(df['ap_lo'] <= df['ap_hi']) & 
    df['height'] >= (df['height'].quantile(0.025)) &
    df['height'] >= (df['height'].quantile(0.975)) &
    df['weight'] >= (df['weight'].quantile(0.025)) &
    df['weight'] >= (df['weight'].quantile(0.975))
    ]

    # Calculate the correlation matrix
    # https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-f1c49c816f07
    # corr = sns.heatmap(df_heat.corr(), annot = True)
    corr = df_heat.corr()

    # Generate a mask for the upper triangle
    # NumPy array creation: triu() function
    # Upper triangle of an array. The triu() function is used to get a copy of a matrix with the elements below the k-th diagonal zeroed.Feb 26, 2020
    mask = np.triu(corr)


    # Set up the matplotlib figure
    fig, ax = plt.subplots(figsize=(9,9))

    # Draw the heatmap with 'sns.heatmap()'
    # sns.heatmap(corr, linewidths=1, mask=mask, vmax=.3, center=0.09,square=True, cbar_kws = {'orientation' : 'horizontal'})

    sns.catplot(data=df_cat, kind='count', x='variable', hue='value', col='cardio')


    # Do not modify the next two lines
    fig.savefig('heatmap.png')
    return fig
2 Likes

Try just this:

df_cat = pd.melt(df, value_vars=["active", "alco", "cholesterol", "gluc", "overweight", "smoke"], id_vars="cardio")

g = sns.catplot(data=df_cat, kind="count",  x="variable", hue="value", col="cardio")

The groupby is not necessary.

3 Likes

@ArbyC Thank you for your feedback; I made the below changes to the code although the error persists. This assignment seems to be giving me toughest time so far…

def draw_cat_plot():
    # Create DataFrame for cat plot using `pd.melt` using just the values from 'cholesterol', 'gluc', 'smoke', 'alco', 'active', and 'overweight'.
    df_cat = pd.melt(df, value_vars=["active", "alco", "cholesterol", "gluc", "overweight", "smoke"], id_vars="cardio")

    # Group and reformat the data to split it by 'cardio'. Show the counts of each feature. You will have to rename one of the collumns for the catplot to work correctly.

    g = sns.catplot(data=df_cat, kind="count",  x="variable", hue="value", col="cardio")

    # Draw the catplot with 'sns.catplot()'
    fig = sns.catplot(
      x = 'variable',
      y = 'total',
      kind = 'bar',
      col = 'cardio',
      data = df_cat
    )

    # Do not modify the next two lines
    fig.savefig('catplot.png')
    return fig

# Draw Heat Map
def draw_heat_map():
    # Clean the data
    df_heat = df[(df['ap_lo'] <= df['ap_hi']) & 
    df['height'] >= (df['height'].quantile(0.025)) &
    df['height'] >= (df['height'].quantile(0.975)) &
    df['weight'] >= (df['weight'].quantile(0.025)) &
    df['weight'] >= (df['weight'].quantile(0.975))
    ]

    # Calculate the correlation matrix
    # https://heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-correlation-matrix-visualizations-f1c49c816f07
    # corr = sns.heatmap(df_heat.corr(), annot = True)
    corr = df_heat.corr()

    # Generate a mask for the upper triangle
    # NumPy array creation: triu() function
    # Upper triangle of an array. The triu() function is used to get a copy of a matrix with the elements below the k-th diagonal zeroed.Feb 26, 2020
    mask = np.triu(corr)

    # Set up the matplotlib figure
    fig, ax = plt.subplots(figsize=(9,9))

    # Draw the heatmap with 'sns.heatmap()'
    sns.heatmap(corr, linewidths=1, mask=mask, vmax=.3, center=0.09,square=True, cbar_kws = {'orientation' : 'horizontal'})

    # sns.catplot(data=df_cat, kind='count', x='variable', hue='value', col='cardio')

    # Do not modify the next two lines
    fig.savefig('heatmap.png')
    return fig

image