Much needed feedback| Demographic Data Analyzer|

adria.lloreda · April 2, 2021, 8:08am

Hi mates!
Here’s my solution to the Demographic Data Analyzer project. Although code work on, I think is a poor way to achieve the project. I would like to improve my code, but I don’t where to start.

Here’s my code

I suspect that isn’t necessary to be creating a new df to answer each one of the questions. If someone can take some time to read it and feedback on me I would be so grateful.

Brain150 · April 2, 2021, 8:47am

Hey, if it works…

You are right: it is not necessary to create new dataframes for each question. If you don’t know where to start, my suggestion would be to look into filtering of data. There are tons of ways to do this and they can only be used if you understand what is going on.
Some info to start with can be found here : Python : 10 Ways to Filter Pandas DataFrame (listendata.com)
You are using isin at multiple places in your code. This is (in my opinion) a very good way of filtering and write short but readable code.

An example from your code:

    #Creating a new df with age men data
    data_men = data.loc[data["sex"] == "Male", ["sex", "age"]]

    #Setting index of new df with "sex"
    data_men = data_men.set_index(["sex"])

    #Calculating avg(mean) age of the df and roundig to the nearest tenth
    average_age_men = round(float(data_men.mean()), 1)

could be written as:

	# What is the average age of men?
	average_age_men = round(df[df["sex"] == "Male"].age.mean(), 1)

(‘df’ is here what you named ‘data’, so the dataframe from the .csv)

As a (personal) rule of thumb, when I have to apply a function (like ‘.mean()’, or ‘round()’ in the above example), I try NOT to do it in a seperate line. Even if it means that there will be 2 or 3 functions after each other, the code is most of the time still readable, since the complicated parts are in front of the functions:

# Count total advanced educated|
higher_education = df[(df[education] == Bachelors) | (df[education] == Masters) | (df[education] == Doctorate)].education.count()|

The complicated part here is in the'df[(df[education.....Doctorate)], not the .count() that comes behind, so it is ot necessary to write a new line for it in this case.

Sorry to have no clear answer for your code specifically and rewrite it for you, but I hope you gained some insights and are able to search and experiment with filtering and calculations in dataframes with this (as I am doing too every time I work on those nasty, but really practical dataframes ).

adria.lloreda · April 2, 2021, 10:10am

Thanks, @Brain150 ! Of course, it works perfectly.

I’ve written in separated lines was because was “easy” for me to think about different steps I had to complete to get the answer. But as u say, I should adopt your rule and try to not write in a separated line.

Thanks for your help! I will keep coding and improving my skills.

system · October 1, 2021, 10:11pm

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.