Hey, if it works…
You are right: it is not necessary to create new dataframes for each question. If you don’t know where to start, my suggestion would be to look into filtering of data. There are tons of ways to do this and they can only be used if you understand what is going on.
Some info to start with can be found here : Python : 10 Ways to Filter Pandas DataFrame (listendata.com)
You are using isin
at multiple places in your code. This is (in my opinion) a very good way of filtering and write short but readable code.
An example from your code:
#Creating a new df with age men data
data_men = data.loc[data["sex"] == "Male", ["sex", "age"]]
#Setting index of new df with "sex"
data_men = data_men.set_index(["sex"])
#Calculating avg(mean) age of the df and roundig to the nearest tenth
average_age_men = round(float(data_men.mean()), 1)
could be written as:
# What is the average age of men?
average_age_men = round(df[df["sex"] == "Male"].age.mean(), 1)
(‘df’ is here what you named ‘data’, so the dataframe from the .csv)
As a (personal) rule of thumb, when I have to apply a function (like ‘.mean()’, or ‘round()’ in the above example), I try NOT to do it in a seperate line. Even if it means that there will be 2 or 3 functions after each other, the code is most of the time still readable, since the complicated parts are in front of the functions:
# Count total advanced educated|
higher_education = df[(df[education] == Bachelors) | (df[education] == Masters) | (df[education] == Doctorate)].education.count()|
The complicated part here is in the'df[(df[education.....Doctorate)]
, not the .count() that comes behind, so it is ot necessary to write a new line for it in this case.
Sorry to have no clear answer for your code specifically and rewrite it for you, but I hope you gained some insights and are able to search and experiment with filtering and calculations in dataframes with this (as I am doing too every time I work on those nasty, but really practical dataframes
).