So I got a CSV file to the likes of
age,sex,country,blue-eyes
23,male,US,yes
87,female,Germany,yes
11,male,Japan,no
…
Now, I got asked :
"Search for the percentage of Blue eyes for each Country"
So what I did first was grouping by Country just to get a list (or series perhaps?) of the counts of each Country:
import pandas as pd
df = read_csv(‘people.csv’)
group1 = df.groupby(‘country’)
counter1 = group1[‘country’].count()
Then it comes the messy part, where I get a counter just of the Blue eyes to then join both counters into a new data frame, something like this:
only_blueeyes = df.loc[df[‘blue-eyes’] == ‘yes’]
group2 = only_blueeyes.groupby(‘country’)
counter2 = group2[‘country’].count()
result_frame = pd.DataFrame({‘Total’: counter1 , ‘Only Blue Eyes’: counter2})
result_frame[‘Percentage of Blue Eyes’] =( result_frame[‘Only Blue Eyes’] / result_frame[‘Total’]) * 100
…
It actually does its job, but my question here is if there’s anyway of simplifying the whole operation and make it a little bit cleaner?
Edit: Corrected some variable names so they would match