Demographic Data Analyzer

rajgorakshay · August 8, 2020, 4:57am

@kitanikita Sorry to pound you with my Project queries…I have gone back to my 2nd assignment where I’ve had difficulties evaluating final 3 x things and trying to re-write and have a tough time. Also; I can see there are lot of errors coming up too…

Can you please assist me as I find the below three things are a bit hard to work out:

highest_earning_country
highest_earning_country_percentage
Top IN occupation (popular occupation in India)

import pandas as pd


def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('adult.data.csv')
    
        
    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df.groupby('race')['race'].count()

        # What is the average age of men?
    
    average_age_men = (df[df['sex'] == 'Male']['age'].mean()).round(1)

    # What is the percentage of people who have a Bachelor's degree?
    percentage_bachelors = round(len(df[df['education'] == 'Bachelors']) / len(df) * 100, 1)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    # Education columns with qualifications Y or N
    higher_education = df[df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
    lower_education = df[~df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]

    # percentage with salary >50K
    # round(number, ndigits)
    higher_education_rich = round(len(higher_education[higher_education['salary'] == '>50k']) / len(higher_education) * 100, 1)

    lower_education_rich = round(len(lower_education[lower_education['salary'] == '>50k']) / len(lower_education) * 100, 1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = len(df[df['hours-per-week'] == min_work_hours])

    rich_percentage = round(len(df[(df['hours-per-week'] == min_work_hours) & (df['salary'] == '>50K')]) / num_min_workers * 100, 1)

    # What country has the highest percentage of people that earn >50K?
    
    highest_earning_country = None
          
    highest_earning_country_percentage = None

    # Identify the most popular occupation for those who earn >50K in India.
    
    top_IN_occupation = None

sgarz · August 8, 2020, 7:30am

It’s asking for the Percentage of people with ‘salary’ >50K for all the ‘native-country’ , to get the % you need two pieces of Data, the Total counter for each Country and the Counter of only the People with salary >50K from these same Countries.
The “Brutish” method would be creating a new DataFrame, which one column will be Total Counter of each Country and another Column the counter of only People with Salary >50K, all using Country as index, something like this

kitanikita · August 8, 2020, 12:19pm

Without directly giving you the asnwer, you might find using something like

df['native-country'].value_counts()

useful for this. Try it out and see what the output of this line is.

You can use a conditional statement on the salary column and do the same as the above (similar to what you’ve already done for average_age_men ). Following this, you can do some division to get the percentages and

something.idxmax()

to get the index (in this case the country) with the max value.

ratid · August 8, 2020, 6:27pm

@kitanikita Just wanted to thank you for the hint

rajgorakshay · August 8, 2020, 9:47pm

@kitanikita I keep getting highest_earning_country result as "United States…although I remember one of the posts where the correct answer is Iran…can you please check my code below @sgarz thank you for your valuable input.

Also; higher_education_rich and lower_education_rich and the rich_percentage - I assume their code is okay? any reasons why I don’t get a value for those?

 # percentage with salary >50K
    # round(number, ndigits)
    higher_education_rich = round(len(higher_education[higher_education['salary'] == '>50k']) / len(higher_education) * 100, 1)

    lower_education_rich = round(len(lower_education[lower_education['salary'] == '>50k']) / len(lower_education) * 100, 1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = len(df[df['hours-per-week'] == min_work_hours])

    rich_percentage = round(len(df[(df['hours-per-week'] == min_work_hours) & (df['salary'] == '>50K')]) / num_min_workers * 100, 1)

    # What country has the highest percentage of people that earn >50K?

    highest_earning_country = (df[df['salary'] == '>50k']['native-country'].value_counts()/ df['native-country'].value_counts() * 100).sort_values(ascending=False).fillna(0).idxmax()

kitanikita · August 9, 2020, 12:39am

You should split it up into several lines, rather than having everything in one big line. That way you can test each step one at a time and pinpoint the issue.

The sallary >50K needs to have a capital letter K – that will give you the right answer. Same issue for the other occurances I think.

Also, you don’t need .sort_values(ascending=False).fillna(0) at all here.

sgarz · August 9, 2020, 5:07am

I didn’t catch the lowercase K there, so I was troubleshooting by first evaluating the “df[df[‘salary’] == ‘>50k’][‘native-country’].value_counts()” part of your code and I noticed that it was returning an empty series, then I suddenly realized that ‘>50k’ should be written with uppercase K to match the inputs in the CSV, all this before reading kitanikita’s answer, it seems that was all the issue with your code, everything works fine now

Rasheed · August 18, 2020, 6:06am

I’m also getting ‘United States’ as the country with highest earning. I guess the error is from them.

Sabretooth · September 15, 2020, 8:00am

So is the highest earning country US or iran? People have also mentioned that they are getting Iran as their answer. i am also getting US as the answer. I

rajgorakshay · September 15, 2020, 9:49pm

No; the correct answer is indeed “Iran”. We are looking for a Country where the number of individuals earning >50K is the highest.
dataframe ( [ salary ]> =50K [‘native country’].value counts divided by df ['native country].value_counts() and of course this is * 100) Return index of first occurrence of maximum over requested axis like - idxmax

mearistizabal · April 2, 2021, 6:23pm

Hi. This works for me, but I get a list ordered by country and without headings. How do I get the max?. Thank you.

Giri12 · April 23, 2021, 2:46pm

import pandas as pd


def calculate_demographic_data(print_data=True):
    # Read data from file
    df = pd.read_csv('adult.data.csv')
    # How many of each race are represented in this dataset? This should be a Pandas series with race names as the index labels.
    race_count = df['race'].value_counts()
    # What is the average age of men?
    average_age_men = round(df.loc[df['sex'] == 'Male', 'age'].mean(),1)

    # What is the percentage of people who have a Bachelor's degree?
    percentage_bachelors = round(len(df[df['education'] == 'Bachelors'])/len(df)*100, 1)

    # What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
    # What percentage of people without advanced education make more than 50K?

    # with and without `Bachelors`, `Masters`, or `Doctorate`
    higher_education = df[df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]
    lower_education = df[~df['education'].isin(['Bachelors', 'Masters', 'Doctorate'])]

    # percentage with salary >50K
    higher_education_rich = round(len(higher_education[higher_education['salary'] == '>50K'])/len(higher_education)*100, 1)
    lower_education_rich = round(len(lower_education[lower_education['salary'] == '>50K'])/len(lower_education)*100, 1)

    # What is the minimum number of hours a person works per week (hours-per-week feature)?
    min_work_hours = df['hours-per-week'].min()

    # What percentage of the people who work the minimum number of hours per week have a salary of >50K?
    num_min_workers = len(df[df['hours-per-week']== min_work_hours])

    rich_percentage = round(len(df[(df['hours-per-week'] == min_work_hours) & (df['salary'] == '>50K')]) / num_min_workers * 100, 1)

    # What country has the highest percentage of people that earn >50K?
    highest_earning_country = (df[df['salary'] == '>50K']['native-country'].value_counts()/ df['native-country'].value_counts() * 100).sort_values(ascending=False).fillna(0).idxmax()
    highest_earning_country_percentage = round(len(df[(df['native-country'] == highest_earning_country) & (df['salary'] == '>50K')]) / len(df[(df['native-country'] == highest_earning_country)])*100,1)
    # Identify the most popular occupation for those who earn >50K in India.
    top_IN_occupation = df[(df['salary'] == ">50K") & (df['native-country'] == "India")]["occupation"].value_counts().index[0]

    # DO NOT MODIFY BELOW THIS LINE

    if print_data:
        print("Number of each race:\n", race_count) 
        print("Average age of men:", average_age_men)
        print(f"Percentage with Bachelors degrees: {percentage_bachelors}%")
        print(f"Percentage with higher education that earn >50K: {higher_education_rich}%")
        print(f"Percentage without higher education that earn >50K: {lower_education_rich}%")
        print(f"Min work time: {min_work_hours} hours/week")
        print(f"Percentage of rich among those who work fewest hours: {rich_percentage}%")
        print("Country with highest percentage of rich:", highest_earning_country)
        print(f"Highest percentage of rich people in country: {highest_earning_country_percentage}%")
        print("Top occupations in India:", top_IN_occupation)

    return {
        'race_count': race_count,
        'average_age_men': average_age_men,
        'percentage_bachelors': percentage_bachelors,
        'higher_education_rich': higher_education_rich,
        'lower_education_rich': lower_education_rich,
        'min_work_hours': min_work_hours,
        'rich_percentage': rich_percentage,
        'highest_earning_country': highest_earning_country,
        'highest_earning_country_percentage':
        highest_earning_country_percentage,
        'top_IN_occupation': top_IN_occupation
    }

lasjorg · April 23, 2021, 2:50pm

I’ve edited your post for readability. When you enter a code block into a forum post, please precede it with a separate line of three backticks and follow it with a separate line of three backticks to make it easier to read.

You can also use the “preformatted text” tool in the editor (</>) to add backticks around text.

See this post to find the backtick on your keyboard.
Note: Backticks (`) are not single quotes (’).

Topic		Replies	Views
Errors with Demographic Data Analyzer Python	2	693	August 29, 2022
Data Analysis with Python Projects - Demographic Data Analyzer Python	3	156	June 29, 2024
Data Analysis with Python Projects - Demographic Data Analyzer Python	2	315	August 9, 2023
Problem with Demographic Data Analyzer Python	4	648	April 9, 2022
Pls what could be wrong with this code? Python	12	1315	May 28, 2022

Demographic Data Analyzer

Related topics