Medical Data Visualizer challenge dataframe

I successfully recreated the catplot but I did one step in a way that seems a bit inefficient, and I’m wondering if there is a better way. After doing pd.melt I used groupby with count() to add the ‘total’ column representing the total of each feature. I used the as_index=False feature to do this but in doing so I essentially replaced the ‘value’ column (in the dataframe created by pd.melt) with ‘total’, and couldn’t figure out how to preserve the ‘value’ column except by manually making a new column and just typing in [0,1,0,1…] etc. Is there a better way to do this?

Try using the hue argument.

See the fourth example here.

I don’t think I was very clear. I got the graph to work. But my question generally is:
So after doing pd.melt, my data looks like this:
cardio variable value
0 0 cholesterol 0
1 1 cholesterol 1
2 1 cholesterol 1
3 1 cholesterol 0
4 0 cholesterol 0
Then I did: df_cat = df_cat.groupby([‘cardio’, ‘variable’,‘value’],as_index = False)[‘values’].count()
And got this:
cardio variable value
0 0 active 6378
1 0 active 28643
2 0 alco 33080
3 0 alco 1941
4 0 cholesterol 29330

But is there a way to preserve the original ‘value’ column (the 0’s and 1’s) and add the second ‘value’ column containing the total count? I realize I would have to somehow rename it… but my solution was just to rename ‘value’ to ‘total’ and then add in a new column called ‘value’ manually like df_cat[‘value’] = [0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1] … I’m sure there’s a better way but I’m blanking on it.

Honestly i have been jumping around between curriculum for the last 6 months exploring. But this is a snippet of my solution.

cardio = pd.melt(cardio, id_vars=['cardio'])
cardio.sort_values(by='variable', inplace=True)

I have since forgotten how it worked without opening a notebook and testing it out. If you would like try playing around with the id_vars= part maybe. And i believe i let the catplot do the “counting”.

I created 2 data set with and without cardio
dfc1 = dg[dg[‘cardio’]==0] # DF with Cardio = 0
dfc1 = dfc1.drop([‘cardio’],axis=1)
dfc2 = dg[dg[‘cardio’]==1] # DF with Cardio = 1
dfc2 = dfc2.drop([‘cardio’],axis =1)

And another 2 dataset for plotting. I think no of lines are more but it runs efficiently(time wise)
variables = [‘active’,‘alco’,‘cholesterol’,‘gluc’,‘overweight’,‘smoke’]
total0 =
total1 =
total2 =
total3 =
for i in variables:
total0.append(dfc1[dfc1[i]==0][i].size)
total1.append(dfc1[dfc1[i]==1][i].size)
total2.append(dfc2[dfc2[i]==0][i].size)
total3.append(dfc2[dfc2[i]==1][i].size)

c0 = pd.DataFrame({‘variables’: variables, ‘0’: total0, ‘1’:total1})
c0f = c0.melt(‘variables’, var_name = ‘value’, value_name = ‘total’)
c1 = pd.DataFrame({‘variables’: variables, ‘0’: total2, ‘1’:total3})
c1f = c1.melt(‘variables’, var_name = ‘value’, value_name = ‘total’)

c0f and c1f are final dataframes for plotting

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.