Demographic Data Analyzer Help

I was able to solve the question, “What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?”

But my code was very “spelled out” and as ugly as can be. I know the same tactics won’t work for the later problems so can someone help me dry up my code and maybe give me some pandas suggestions and tips? Thank you!

bachelors_count = df['education'].value_counts()['Bachelors']
masters_count = df['education'].value_counts()['Masters']
doctorate_count = df['education'].value_counts()['Doctorate']

bachelors_rich = df.loc[(df['education'] == 'Bachelors') & (df['salary'] == '>50K')].count()[1]
masters_rich = df.loc[(df['education'] == 'Masters') & (df['salary'] == '>50K')].count()[1]
doctorate_rich = df.loc[(df['education'] == 'Doctorate') & (df['salary'] == '>50K')].count()[1]

higher_education = (bachelors_rich + masters_rich + doctorate_rich)/(bachelors_count + masters_count + doctorate_count)

A condition applied to a dataframe creates a series of boolean.
Selection a dataframe with a boolean series will create a series only of the entries with “True”.
You can use len() to determine the length of a series.

len( df[ ( == “Bob”) or ( == “bob”) ] )
Will give you the number of entries where the name was either “Bob” or “bob”.

It’s quite some lengthy code in the challenge and it’s advised you create the series “higher_education” first and then the sub-series with “rich”.

1 Like

Thank you ! I was able to clean up my code and got the solution with:

higher_education_rich = (((len(df.loc[(df['salary'] == '>50K') & ((df['education'] == 
'Bachelors') |(df['education'] == 'Masters') |(df['education'] == 'Doctorate'))])) / 
educated_count) * 100).round(1)

I’m sure it could be even tighter, but that helped a lot.

Instead of len, you can also tack .count() at the end and it will give you the answer.

I find it ‘neater’ . Also, I went with the ‘education-num’ field for simplicity sake.

df['education-num'].loc[(df['education-num'] == 13) | (df['education-num'] == 14) | (df['education-num'] == 16)].count()

lol I never even looked at that column. Yeah this makes it a bit nicer.
Personally I like len() because it just gives an int, while .count() creates a series unless applied to a single column - also makes it shorter.

But while we are at improvements to look neat, .isin([array]) makes it REALLY neat:

df['education-num'].loc[df['education-num'].isin([13, 14, 16 ])].count()
len( df.loc[df['education-num'].isin([13, 14, 16 ])] )
1 Like

That is neat! Thanks for sharing.