Demographic Data Analyzer

@kitanikita Sorry to pound you with my Project queries…I have gone back to my 2nd assignment where I’ve had difficulties evaluating final 3 x things and trying to re-write and have a tough time. Also; I can see there are lot of errors coming up too…

Can you please assist me as I find the below three things are a bit hard to work out:

  1. highest_earning_country
  2. highest_earning_country_percentage
  3. Top IN occupation (popular occupation in India)
import pandas as pd


def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('adult.data.csv')
    
        
    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df.groupby('race')['race'].count()

        # What is the average age of men?
    
    average_age_men = (df[df['sex'] == 'Male']['age'].mean()).round(1)

    # What is the percentage of people who have a Bachelor's degree?
    percentage_bachelors = round(len(df[df['education'] == 'Bachelors']) / len(df) * 100, 1)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    # Education columns with qualifications Y or N
    higher_education = df[df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
    lower_education = df[~df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]

    # percentage with salary >50K
    # round(number, ndigits)
    higher_education_rich = round(len(higher_education[higher_education['salary'] == '>50k']) / len(higher_education) * 100, 1)

    lower_education_rich = round(len(lower_education[lower_education['salary'] == '>50k']) / len(lower_education) * 100, 1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = len(df[df['hours-per-week'] == min_work_hours])

    rich_percentage = round(len(df[(df['hours-per-week'] == min_work_hours) & (df['salary'] == '>50K')]) / num_min_workers * 100, 1)

    # What country has the highest percentage of people that earn >50K?
    
    highest_earning_country = None
          
    highest_earning_country_percentage = None

    # Identify the most popular occupation for those who earn >50K in India.
    
    top_IN_occupation = None

It’s asking for the Percentage of people with ‘salary’ >50K for all the ‘native-country’ , to get the % you need two pieces of Data, the Total counter for each Country and the Counter of only the People with salary >50K from these same Countries.
The “Brutish” method would be creating a new DataFrame, which one column will be Total Counter of each Country and another Column the counter of only People with Salary >50K, all using Country as index, something like this

Without directly giving you the asnwer, you might find using something like

df['native-country'].value_counts()

useful for this. Try it out and see what the output of this line is.

You can use a conditional statement on the salary column and do the same as the above (similar to what you’ve already done for average_age_men ). Following this, you can do some division to get the percentages and

something.idxmax()

to get the index (in this case the country) with the max value.

3 Likes

@kitanikita Just wanted to thank you for the hint

@kitanikita I keep getting highest_earning_country result as "United States…although I remember one of the posts where the correct answer is Iran…can you please check my code below @sgarz thank you for your valuable input.

Also; higher_education_rich and lower_education_rich and the rich_percentage - I assume their code is okay? any reasons why I don’t get a value for those?

 # percentage with salary >50K
    # round(number, ndigits)
    higher_education_rich = round(len(higher_education[higher_education['salary'] == '>50k']) / len(higher_education) * 100, 1)

    lower_education_rich = round(len(lower_education[lower_education['salary'] == '>50k']) / len(lower_education) * 100, 1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = len(df[df['hours-per-week'] == min_work_hours])

    rich_percentage = round(len(df[(df['hours-per-week'] == min_work_hours) & (df['salary'] == '>50K')]) / num_min_workers * 100, 1)

    # What country has the highest percentage of people that earn >50K?

    highest_earning_country = (df[df['salary'] == '>50k']['native-country'].value_counts()/ df['native-country'].value_counts() * 100).sort_values(ascending=False).fillna(0).idxmax()

You should split it up into several lines, rather than having everything in one big line. That way you can test each step one at a time and pinpoint the issue.

The sallary >50K needs to have a capital letter K – that will give you the right answer. Same issue for the other occurances I think.

Also, you don’t need .sort_values(ascending=False).fillna(0) at all here.

2 Likes

I didn’t catch the lowercase K there, so I was troubleshooting by first evaluating the “df[df[‘salary’] == ‘>50k’][‘native-country’].value_counts()” part of your code and I noticed that it was returning an empty series, then I suddenly realized that ‘>50k’ should be written with uppercase K to match the inputs in the CSV, all this before reading kitanikita’s answer, it seems that was all the issue with your code, everything works fine now

I’m also getting ‘United States’ as the country with highest earning. I guess the error is from them.

So is the highest earning country US or iran? People have also mentioned that they are getting Iran as their answer. i am also getting US as the answer. I

No; the correct answer is indeed “Iran”. We are looking for a Country where the number of individuals earning >50K is the highest.
dataframe ( [ salary ]> =50K [‘native country’].value counts divided by df ['native country].value_counts() and of course this is * 100) Return index of first occurrence of maximum over requested axis like - idxmax