Demographic Data Analyzer

@kitanikita Sorry to pound you with my Project queries…I have gone back to my 2nd assignment where I’ve had difficulties evaluating final 3 x things and trying to re-write and have a tough time. Also; I can see there are lot of errors coming up too…

Can you please assist me as I find the below three things are a bit hard to work out:

  1. highest_earning_country
  2. highest_earning_country_percentage
  3. Top IN occupation (popular occupation in India)
import pandas as pd


def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('adult.data.csv')
    
        
    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df.groupby('race')['race'].count()

        # What is the average age of men?
    
    average_age_men = (df[df['sex'] == 'Male']['age'].mean()).round(1)

    # What is the percentage of people who have a Bachelor's degree?
    percentage_bachelors = round(len(df[df['education'] == 'Bachelors']) / len(df) * 100, 1)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    # Education columns with qualifications Y or N
    higher_education = df[df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
    lower_education = df[~df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]

    # percentage with salary >50K
    # round(number, ndigits)
    higher_education_rich = round(len(higher_education[higher_education['salary'] == '>50k']) / len(higher_education) * 100, 1)

    lower_education_rich = round(len(lower_education[lower_education['salary'] == '>50k']) / len(lower_education) * 100, 1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = len(df[df['hours-per-week'] == min_work_hours])

    rich_percentage = round(len(df[(df['hours-per-week'] == min_work_hours) & (df['salary'] == '>50K')]) / num_min_workers * 100, 1)

    # What country has the highest percentage of people that earn >50K?
    
    highest_earning_country = None
          
    highest_earning_country_percentage = None

    # Identify the most popular occupation for those who earn >50K in India.
    
    top_IN_occupation = None

It’s asking for the Percentage of people with ‘salary’ >50K for all the ‘native-country’ , to get the % you need two pieces of Data, the Total counter for each Country and the Counter of only the People with salary >50K from these same Countries.
The “Brutish” method would be creating a new DataFrame, which one column will be Total Counter of each Country and another Column the counter of only People with Salary >50K, all using Country as index, something like this

Without directly giving you the asnwer, you might find using something like

df['native-country'].value_counts()

useful for this. Try it out and see what the output of this line is.

You can use a conditional statement on the salary column and do the same as the above (similar to what you’ve already done for average_age_men ). Following this, you can do some division to get the percentages and

something.idxmax()

to get the index (in this case the country) with the max value.

5 Likes

@kitanikita Just wanted to thank you for the hint

@kitanikita I keep getting highest_earning_country result as "United States…although I remember one of the posts where the correct answer is Iran…can you please check my code below @sgarz thank you for your valuable input.

Also; higher_education_rich and lower_education_rich and the rich_percentage - I assume their code is okay? any reasons why I don’t get a value for those?

 # percentage with salary >50K
    # round(number, ndigits)
    higher_education_rich = round(len(higher_education[higher_education['salary'] == '>50k']) / len(higher_education) * 100, 1)

    lower_education_rich = round(len(lower_education[lower_education['salary'] == '>50k']) / len(lower_education) * 100, 1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = len(df[df['hours-per-week'] == min_work_hours])

    rich_percentage = round(len(df[(df['hours-per-week'] == min_work_hours) & (df['salary'] == '>50K')]) / num_min_workers * 100, 1)

    # What country has the highest percentage of people that earn >50K?

    highest_earning_country = (df[df['salary'] == '>50k']['native-country'].value_counts()/ df['native-country'].value_counts() * 100).sort_values(ascending=False).fillna(0).idxmax()

You should split it up into several lines, rather than having everything in one big line. That way you can test each step one at a time and pinpoint the issue.

The sallary >50K needs to have a capital letter K – that will give you the right answer. Same issue for the other occurances I think.

Also, you don’t need .sort_values(ascending=False).fillna(0) at all here.

3 Likes

I didn’t catch the lowercase K there, so I was troubleshooting by first evaluating the “df[df[‘salary’] == ‘>50k’][‘native-country’].value_counts()” part of your code and I noticed that it was returning an empty series, then I suddenly realized that ‘>50k’ should be written with uppercase K to match the inputs in the CSV, all this before reading kitanikita’s answer, it seems that was all the issue with your code, everything works fine now

I’m also getting ‘United States’ as the country with highest earning. I guess the error is from them.

So is the highest earning country US or iran? People have also mentioned that they are getting Iran as their answer. i am also getting US as the answer. I

No; the correct answer is indeed “Iran”. We are looking for a Country where the number of individuals earning >50K is the highest.
dataframe ( [ salary ]> =50K [‘native country’].value counts divided by df ['native country].value_counts() and of course this is * 100) Return index of first occurrence of maximum over requested axis like - idxmax

1 Like

Hi. This works for me, but I get a list ordered by country and without headings. How do I get the max?. Thank you.

import pandas as pd


def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('adult.data.csv')
    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df['race'].value_counts()
    # What is the average age of men?
    average_age_men = round(df.loc[df['sex'] == 'Male', 'age'].mean(),1)

    # What is the percentage of people who have a Bachelor's degree?
    percentage_bachelors = round(len(df[df['education'] == 'Bachelors'])/len(df)*100, 1)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    higher_education = df[df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
    lower_education = df[~df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]

    # percentage with salary >50K
    higher_education_rich = round(len(higher_education[higher_education['salary'] == '>50K'])/len(higher_education)*100, 1)
    lower_education_rich = round(len(lower_education[lower_education['salary'] == '>50K'])/len(lower_education)*100, 1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = len(df[df['hours-per-week']== min_work_hours])

    rich_percentage = round(len(df[(df['hours-per-week'] == min_work_hours) & (df['salary'] == '>50K')]) / num_min_workers * 100, 1)

    # What country has the highest percentage of people that earn >50K?
    highest_earning_country = (df[df['salary'] == '>50K']['native-country'].value_counts()/ df['native-country'].value_counts() * 100).sort_values(ascending=False).fillna(0).idxmax()
    highest_earning_country_percentage = round(len(df[(df['native-country'] == highest_earning_country) & (df['salary'] == '>50K')]) / len(df[(df['native-country'] == highest_earning_country)])*100,1)
    # Identify the most popular occupation for those who earn >50K in India.
    top_IN_occupation = df[(df['salary'] == ">50K") & (df['native-country'] == "India")]["occupation"].value_counts().index[0]

    # DO NOT MODIFY BELOW THIS LINE

    if print_data:
        print("Number of each race:\n", race_count) 
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
        print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")
        print(f"Min work time: {min_work_hours} hours/week")
        print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")
        print("Country with highest percentage of rich:", highest_earning_country)
        print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage':
        highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }

I’ve edited your post for readability. When you enter a code block into a forum post, please precede it with a separate line of three backticks and follow it with a separate line of three backticks to make it easier to read.

You can also use the “preformatted text” tool in the editor (</>) to add backticks around text.

See this post to find the backtick on your keyboard.
Note: Backticks (`) are not single quotes (’).