Queries
So far you learned some fundamental basic operations on DataFrames. DataFrame is indeed very powerful due to the ability to run complex queries with semi SQL like syntax. In this lesson you will learn some of these queries.
In the previous lesson you loaded the titanic dataset and derived summary statistics. You will use the same data to derive more insight.
Simple Queries
Get Unique Values
Are you curious from where all the passengers embarked the sinking ship? Here is the query to find out:
print(titanic_df['embark_town'].unique())
Output:
['Southampton' 'Cherbourg' 'Queenstown' nan]
To run this query we invoked the unique() function on the 'embark_town' column values. Note that titanic['embark_town'] is a Series object. Invoking unique on the Series returned us only the unique values in 'embark_town' column. Notice that there is also a 'nan' value which stands for not a number.
There is another way of achieving the same result. You can use the dot operator on the DataFrame giving the column name instead of the enclosing the column name within the square bracket. Here is the revised code:
print(titanic_df.embark_town.unique())
Output:
['Southampton' 'Cherbourg' 'Queenstown' nan]
This also gives us the same result.
Get Unique Count
Using nunique() gives you the unique value count by ignoring the 'nan' values. So the below statement gives you a value of 3
titanic_df.embark_town.nunique()
Get Record Using Min/Max of a column
If you want to get the complete record of the passenger who paid the maximum fare, you could run a query like below:
titanic_df.loc[titanic_df.fare.idxmax()]
idxmin is equivalent of argmin in numpy and gives you the index number of the row that contains the minimum value of the given column
In the above example you could have also use iloc instead of loc as this dataframe has numerical index.
If you want to get the complete records of both the maximum and minimum fare passengers you could do so like:
titanic_df.loc[[titanic_df.fare.idxmax(), titanic_df.fare.idxmin()]]
Notice the double square brackets. The inner bracket references the row indexes. By leaving the columns indexes, gets you all the columns.
But if you only want to see certain columns then you could add the column information as shown below:
titanic_df.loc[[titanic_df.fare.idxmax(), titanic_df.fare.idxmin()], 'fare']
This gets you two rows of only the 'fare' columns. You can see multiple columns by adding the column values as an array as shown below:
titanic_df.loc[[titanic_df.fare.idxmax(), titanic_df.fare.idxmin()], ['pclass', 'who', 'fare']]
Filter Queries
You can filter columns, rows or values in a column using number of techniques and here you will see a few.
Filter Columns
Select all columns that start with 's' and get a new dataframe of the selected columns.
df_new = df.filter(regex='^s', axis=1)
Changing 'axis=0', applies the same regex to row labels.
You can also use 'like' to match partial string as shown below:
df.filter(like='s', axis=1)
Above query returns a dataframe with all columns that contain 's' in it.
Select all columns that are of a specific dtype
float_type_columns = df.select_dtypes(include='float64')
Filter Numerical Values - Range If you want to get only the passengers who are aged between 22 and 24 you could run a query like below:
titanic_df[titanic_df.age.between(22, 24)]
Filter String Values
If you want to filter based on a specific value contained in a column, you can use the expressions as shown below
titanic_df[titanic_df.embark_town == 'Cherbourg'] # gets all records that match value equal to Cherbourg
titanic_df[titanic_df.embark_town != 'Cherbourg'] # gets all records that does not match value equal to Cherbourg
You can use contains function on the str attribute to match full to partial matches of the given pattern.
df[df.embark_town.str.contains('amp')] # gets all records that contain 'amp' anywhere in the string value
df[~df.embark_town.str.contains('amp')] # gets all records that does not contain 'amp' anywhere in the string value
Note: Note however that the above expression will fail for NaN values. Ensure all the NaN values are imputed before running the above statement
If you want to match a regex pattern, you can still use contains on str by passing in the regex pattern
df[df.embark_town.str.contains(r'ampt[h-p]n')]
Filter String Values using List
If you want to filter values in a categorical column, you could use the isin function on the column as shown below:
titanic_df[titanic_df.embark_town.isin(['Southampton', 'Cherbourg'])]
The above statement returns a dataframe that contains records of only of those passengers who embarked from 'Southampton' or 'Cherbourg'
Filter String Values by Exclusion using List
If you want to exclude the roles that belong to particular set of values, you could apply the '~' operator to the above expression as shown below:
titanic_df[~titanic_df.embark_town.isin(['Southampton', 'Cherbourg'])]
The above statement will get you all records that does not belong to Southampton or Cherbourg.
Using Logical AND Operators
In the last lesson you learnt that the youngest traveller was 0.42 years old. Are you curious to find out how many children below the age of 1 survived the tragedy? Let us build a query to find out:
df = titanic_df[
(titanic_df.age < 1)
& (titanic_df.survived == 1)
]
print(df)
Output:
survived pclass sex age sibsp parch fare embarked class \ 78 1 2 male 0.83 0 2 29.0000 S Second 305 1 1 male 0.92 1 2 151.5500 S First 469 1 3 female 0.75 2 1 19.2583 C Third 644 1 3 female 0.75 2 1 19.2583 C Third 755 1 2 male 0.67 1 1 14.5000 S Second 803 1 3 male 0.42 0 1 8.5167 C Third 831 1 2 male 0.83 1 1 18.7500 S Second who adult_male deck embark_town alive alone 78 child False NaN Southampton yes False 305 child False C Southampton yes False 469 child False NaN Cherbourg yes False 644 child False NaN Cherbourg yes False 755 child False NaN Southampton yes False 803 child False NaN Cherbourg yes False 831 child False NaN Southampton yes False
Above query is quite intuitive to understand; You chain the necessary conditions with the & operator. Since we want all children below one year we start with the query titanic.age < 1 and then add the next query to find out how many survived; titanic.survived == 1. We combine both of these two queries using the AND operator &. Results of this query shows us that there were a total of 7 children who survived the tragedy in this sample. Also note that the result of this query is also a DataFrame and hence you can run a describe() or an info() on it to find more information.
You can chain as many such queries as you want and there is no limit to that.
Complex Logical OR/AND Operations
Now let us find out how many of the travellers were children under the age of 1 and seniors over the age of 60. Let us build a query to do that. But this time we do not want to see the full results instead on the total count of the number of rows returned.
total_rows = titanic_df[
((titanic_df.age < 1) | (titanic_df.age > 60))
& (titanic_df.survived == 1)
].shape[0]
print(total_rows)
Output:
12
In this query we joined the two separate age queries with the OR operator using the | symbol. Then joined the combined age query with the survived query from the previous example with an AND operator.
groupby function
ometimes you want to group rows based on certain value in a specific column. For e.g, let us say you want to "find the number of passengers from each embark town". That is when groupby function comes handy. Although the count function could get us the same result for this specific example, you will soon see in advanced group by lesosn that you can extend this concept to solve more complex problems.
The groupby function groups rows that have the same values and then we apply the size() function to find the total size of each group as shown below
groups = titanic_df.groupby(['embark_town'])
print(type(groups))
print(groups.size())
Output:
<class 'pandas.core.groupby.DataFrameGroupBy'> embark_town Cherbourg 168 Queenstown 77 Southampton 644 dtype: int64
Groupby function will collect all the records in each embark town and creates a new DataFrameGroupBy object.
If you are now curious to find the count of number of 'males' vs 'females' in who came from each town, you would add the 'sex' column to your groupby. Here is the modified query:
groups = titanic_df.groupby(['embark_town', 'sex'])
print(groups.size())
Output:
embark_town sex Cherbourg female 73 male 95 Queenstown female 36 male 41 Southampton female 203 male 441 dtype: int64
To find the percentage of each group compared to the total of the sample, you could run the below command:
100 * groups.size() / len(titanic_df)
Output:
embark_town sex Cherbourg female 8.193042 male 10.662177 Queenstown female 4.040404 male 4.601571 Southampton female 22.783389 male 49.494949 dtype: float64
More advanced groupby functions are in another chapter.
Pandas Reference: http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.describe.html