Demographic Data Analyzer Help

isaiahnguyenvazquez3 · February 11, 2021, 7:59am

I was able to solve the question, “What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?”

But my code was very “spelled out” and as ugly as can be. I know the same tactics won’t work for the later problems so can someone help me dry up my code and maybe give me some pandas suggestions and tips? Thank you!

bachelors_count = df['education'].value_counts()['Bachelors']
masters_count = df['education'].value_counts()['Masters']
doctorate_count = df['education'].value_counts()['Doctorate']

bachelors_rich = df.loc[(df['education'] == 'Bachelors') & (df['salary'] == '>50K')].count()[1]
masters_rich = df.loc[(df['education'] == 'Masters') & (df['salary'] == '>50K')].count()[1]
doctorate_rich = df.loc[(df['education'] == 'Doctorate') & (df['salary'] == '>50K')].count()[1]

higher_education = (bachelors_rich + masters_rich + doctorate_rich)/(bachelors_count + masters_count + doctorate_count)

Jagaya · February 11, 2021, 11:58am

A condition applied to a dataframe creates a series of boolean.
Selection a dataframe with a boolean series will create a series only of the entries with “True”.
You can use len() to determine the length of a series.

len( df[ (df.name == “Bob”) or (df.name == “bob”) ] )
Will give you the number of entries where the name was either “Bob” or “bob”.

It’s quite some lengthy code in the challenge and it’s advised you create the series “higher_education” first and then the sub-series with “rich”.

isaiahnguyenvazquez3 · February 14, 2021, 6:38am

Thank you ! I was able to clean up my code and got the solution with:

higher_education_rich = (((len(df.loc[(df['salary'] == '>50K') & ((df['education'] == 
'Bachelors') |(df['education'] == 'Masters') |(df['education'] == 'Doctorate'))])) / 
educated_count) * 100).round(1)

I’m sure it could be even tighter, but that helped a lot.

deusexgarnica · February 18, 2021, 12:11am

Instead of len, you can also tack .count() at the end and it will give you the answer.

I find it ‘neater’ . Also, I went with the ‘education-num’ field for simplicity sake.

df['education-num'].loc[(df['education-num'] == 13) | (df['education-num'] == 14) | (df['education-num'] == 16)].count()

Jagaya · February 18, 2021, 10:27am

lol I never even looked at that column. Yeah this makes it a bit nicer.
Personally I like len() because it just gives an int, while .count() creates a series unless applied to a single column - also makes it shorter.

But while we are at improvements to look neat, .isin([array]) makes it REALLY neat:

df['education-num'].loc[df['education-num'].isin([13, 14, 16 ])].count()
#or
len( df.loc[df['education-num'].isin([13, 14, 16 ])] )

deusexgarnica · February 18, 2021, 4:28pm

That is neat! Thanks for sharing.

Topic		Replies	Views
Demographic Data Analyzer ~ Higher/lower education Percentage miscalculation Python	5	789	December 30, 2021
Data Analysis with Python Projects - Demographic Data Analyzer Python	27	335	January 8, 2025
Demographic data analyzer problem Python	11	492	August 31, 2024
Project Demographic Data Analyzer Python	6	5281	June 1, 2021
Higher_education_rich & lower_education_rich wronh values Python	3	343	October 24, 2021

Demographic Data Analyzer Help

Related topics