Queries

So far you learned some fundamental basic operations on DataFrames. DataFrame is indeed very powerful due to the ability to run complex queries with semi SQL like syntax. In this lesson you will learn some of these queries.

In the previous lesson you loaded the titanic dataset and derived summary statistics. You will use the same data to derive more insight.

Simple Queries

Get Unique Values

Are you curious from where all the passengers embarked the sinking ship? Here is the query to find out:

print(titanic_df['embark_town'].unique())

Output:

['Southampton' 'Cherbourg' 'Queenstown' nan]

To run this query we invoked the unique() function on the 'embark_town' column values. Note that titanic['embark_town'] is a Series object. Invoking unique on the Series returned us only the unique values in 'embark_town' column. Notice that there is also a 'nan' value which stands for not a number.

There is another way of achieving the same result. You can use the dot operator on the DataFrame giving the column name instead of the enclosing the column name within the square bracket. Here is the revised code:

print(titanic_df.embark_town.unique())

Output:

['Southampton' 'Cherbourg' 'Queenstown' nan]

This also gives us the same result.

Get Unique Count

Using nunique() gives you the unique value count by ignoring the 'nan' values. So the below statement gives you a value of 3

titanic_df.embark_town.nunique()

Get Record Using Min/Max of a column

If you want to get the complete record of the passenger who paid the maximum fare, you could run a query like below:


titanic_df.loc[titanic_df.fare.idxmax()]

idxmin is equivalent of argmin in numpy and gives you the index number of the row that contains the minimum value of the given column

In the above example you could have also use iloc instead of loc as this dataframe has numerical index.

If you want to get the complete records of both the maximum and minimum fare passengers you could do so like:


titanic_df.loc[[titanic_df.fare.idxmax(), titanic_df.fare.idxmin()]]

Notice the double square brackets. The inner bracket references the row indexes. By leaving the columns indexes, gets you all the columns.

But if you only want to see certain columns then you could add the column information as shown below:


titanic_df.loc[[titanic_df.fare.idxmax(), titanic_df.fare.idxmin()], 'fare']

This gets you two rows of only the 'fare' columns. You can see multiple columns by adding the column values as an array as shown below:


titanic_df.loc[[titanic_df.fare.idxmax(), titanic_df.fare.idxmin()], ['pclass', 'who', 'fare']]

Filter Queries

You can filter columns, rows or values in a column using number of techniques and here you will see a few.

Filter Columns

Select all columns that start with 's' and get a new dataframe of the selected columns.


df_new = df.filter(regex='^s', axis=1)

Changing 'axis=0', applies the same regex to row labels.

You can also use 'like' to match partial string as shown below:


df.filter(like='s', axis=1)

Above query returns a dataframe with all columns that contain 's' in it.

Select all columns that are of a specific dtype


float_type_columns = df.select_dtypes(include='float64')

Filter Numerical Values - Range If you want to get only the passengers who are aged between 22 and 24 you could run a query like below:


titanic_df[titanic_df.age.between(22, 24)]

Filter String Values

If you want to filter based on a specific value contained in a column, you can use the expressions as shown below


titanic_df[titanic_df.embark_town == 'Cherbourg']   # gets all records that match value equal to Cherbourg
titanic_df[titanic_df.embark_town != 'Cherbourg']   # gets all records that does not match value equal to Cherbourg

You can use contains function on the str attribute to match full to partial matches of the given pattern.


df[df.embark_town.str.contains('amp')]  # gets all records that contain 'amp' anywhere in the string value
df[~df.embark_town.str.contains('amp')] # gets all records that does not contain 'amp' anywhere in the string value

Note: Note however that the above expression will fail for NaN values. Ensure all the NaN values are imputed before running the above statement

If you want to match a regex pattern, you can still use contains on str by passing in the regex pattern


df[df.embark_town.str.contains(r'ampt[h-p]n')]

Filter String Values using List

If you want to filter values in a categorical column, you could use the isin function on the column as shown below:


titanic_df[titanic_df.embark_town.isin(['Southampton', 'Cherbourg'])]

The above statement returns a dataframe that contains records of only of those passengers who embarked from 'Southampton' or 'Cherbourg'

Filter String Values by Exclusion using List

If you want to exclude the roles that belong to particular set of values, you could apply the '~' operator to the above expression as shown below:


titanic_df[~titanic_df.embark_town.isin(['Southampton', 'Cherbourg'])]

The above statement will get you all records that does not belong to Southampton or Cherbourg.

Using Logical AND Operators

In the last lesson you learnt that the youngest traveller was 0.42 years old. Are you curious to find out how many children below the age of 1 survived the tragedy? Let us build a query to find out:

df = titanic_df[
    (titanic_df.age < 1)
    & (titanic_df.survived == 1)
    ]
print(df)

Output:

     survived  pclass     sex   age  sibsp  parch      fare embarked   class  \
78          1       2    male  0.83      0      2   29.0000        S  Second   
305         1       1    male  0.92      1      2  151.5500        S   First   
469         1       3  female  0.75      2      1   19.2583        C   Third   
644         1       3  female  0.75      2      1   19.2583        C   Third   
755         1       2    male  0.67      1      1   14.5000        S  Second   
803         1       3    male  0.42      0      1    8.5167        C   Third   
831         1       2    male  0.83      1      1   18.7500        S  Second   

       who  adult_male deck  embark_town alive  alone  
78   child       False  NaN  Southampton   yes  False  
305  child       False    C  Southampton   yes  False  
469  child       False  NaN    Cherbourg   yes  False  
644  child       False  NaN    Cherbourg   yes  False  
755  child       False  NaN  Southampton   yes  False  
803  child       False  NaN    Cherbourg   yes  False  
831  child       False  NaN  Southampton   yes  False   

Above query is quite intuitive to understand; You chain the necessary conditions with the & operator. Since we want all children below one year we start with the query titanic.age < 1 and then add the next query to find out how many survived; titanic.survived == 1. We combine both of these two queries using the AND operator &. Results of this query shows us that there were a total of 7 children who survived the tragedy in this sample. Also note that the result of this query is also a DataFrame and hence you can run a describe() or an info() on it to find more information.

You can chain as many such queries as you want and there is no limit to that.

Complex Logical OR/AND Operations

Now let us find out how many of the travellers were children under the age of 1 and seniors over the age of 60. Let us build a query to do that. But this time we do not want to see the full results instead on the total count of the number of rows returned.

total_rows = titanic_df[
    ((titanic_df.age < 1)  | (titanic_df.age > 60))
    & (titanic_df.survived == 1)
    ].shape[0]
print(total_rows)

Output:

12

In this query we joined the two separate age queries with the OR operator using the | symbol. Then joined the combined age query with the survived query from the previous example with an AND operator.

groupby function

The groupby function groups rows that have the same values into summary rows, like "find the number of passengers from each embark town".

groups = titanic_df.groupby(['embark_town'])
print(type(groups))
print(groups.size())

Output:

<class 'pandas.core.groupby.DataFrameGroupBy'>
embark_town
Cherbourg      168
Queenstown      77
Southampton    644
dtype: int64

Groupby function will collect all the records in each embark town and creates a new DataFrameGroupBy object.

To find out the total count of each categories you can invoke a size() function on this object. If you are now curious to find the count of number of 'males' vs 'females' in who came from each town, you would add the 'sex' column to your groupby. Here is the modified query:

groups = titanic_df.groupby(['embark_town', 'sex'])
print(groups.size())

Output:

embark_town  sex   
Cherbourg    female     73
             male       95
Queenstown   female     36
             male       41
Southampton  female    203
             male      441
dtype: int64

To find the percentage of each group compared to the total of the sample, you could run the below command:

100 * groups.size() / len(titanic_df)

Output:

embark_town  sex   
Cherbourg    female     8.193042
             male      10.662177
Queenstown   female     4.040404
             male       4.601571
Southampton  female    22.783389
             male      49.494949
dtype: float64

Reference: http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.describe.html

results matching ""

    No results matching ""