More on groupby

Simple grouping and aggregation you already saw in an earlier chapter. In this lesson, we will apply groupby to solve more complex problems.

Imputation of missing values

sampviously, you learned how to address missing "age" value by substituting it with the mean or median age. While using the overall mean or median is a simple approach, a more refined method involves grouping passengers by their passenger class (p_class) and then calculating the mean age for each class to impute missing ages within those respective groups.

Let us first justify why a different mean is appropriate by computing the mean for each group

df.groupby('pclass')['age'].mean()

 age
pclass

1 38.233441
2 29.877630
3 25.140620

As you can see from the result, first class passengers had a much higher mean age than the others.

Now let us appripriately impute the missing 'age' value with the mean of that specific pclass. But first let us see a few NaN values for pclass '1'. First few values are displaced

df[(df.pclass == 1) & df.age.isna()]['age']

  age
31 NaN
55 NaN
64 NaN
166 NaN

Let us also check a few rows of pclass '2' and '3' as well

df[(df.pclass == 2) & df.age.isna()]['age'].head(2)

  age
17 NaN
181 NaN

df[(df.pclass == 3) & df.age.isna()]['age'].head(3)

  age
5 NaN
19 NaN
26 NaN

Now we will replace NaN's with mean age for that pclass group using transform function

df['age'] = df.groupby('pclass')['age'].transform(lambda x: x.fillna(x.mean()))

Let us verify the values for the ones that are displayed above

df.iloc[[31, 55, 64, 17, 5]][['age', 'pclass']]

  age   pclass
31 38.233441 1
55 38.233441 1
64 38.233441 1
17 29.877630 2
5 25.140620 3

As you can see from above, the mean values for the respective pclass are used for imputing

Applying multiple aggregate functions

Let us say we want to find the mean, the median and also the sum of fare in each pclass. You could do all three using the aggregate function

df.groupby('pclass')['fare'].aggregate(['sum', 'mean', 'median'])

  sum   mean   median
pclass

1 18177.4125 84.154687 60.2875
2 3801.8417 20.662183 14.2500
3 6714.6951 13.675550 8.0500

Filtering on groups

You can also apply filters on the group to get only those groups that match the filter.

Suppose you want to find the deck's where the fare is more than 20 for all passengers in that deck category, you could run a query like below

df.groupby('deck').filter(lambda x: all(x.fare > 20))['deck'].unique()

array(['C'], dtype=object)

It turns out only deck 'C' is where the fare is more than 20 for all passengers.

results matching ""

    No results matching ""