Data Cleaning Commands for Pandas Dataframe

Depending on the source of the data, a significant amount of time may be spend in just cleaning the data. Any data analysis without cleaning the data thoroughly will lead to erroneous conclusions.

Sometimes the cleaning and re-arranging columns and column names is required for you so that you can write queries which are more intuitive and less error prone, there by improving productivity. In this lesson you will see some of the common dirty data scenarios and the techniques used for cleaning.

Column name has space in between words

Column names having spaces is sometimes seen in the data sets. The column names would be meaningful but have space in between two or more words. This arbitrary number of spaces will stump you while writing queries and wastes your time while calculating the number of spaces that you missed because of which the query is error-ing out with a message that such a column does not exist and you very well know that it does!

In this case it is better to replace all the spaces with the popular replacement characters; underscore (_) or hyphen (-). Here is the command you would use to fix with an underscore:


df.columns = df.columns.str.replace(' ', '_')

To replace multiple characters, you could use a regex expression instead and here is an example which replaces, space, a comma and an underscore all in one statement:


df.columns = df.columns.str.replace(r"[(|)| |,]", "_")

Dropping extra columns you do not need

Sometimes your data may have columns which are of no value for your analysis or you may have columns with duplicate data. In such cases you would like to delete such columns and keep a smaller data set. This not only saves space in your memory but also makes it easy to focus on the important data. You can drop such columns with the below command:

df.drop(['col1','col2',"col3"], axis=1, inplace=True)
del df['col1']  # another variation
df.pop('col1')  # another variation
df = df.drop(columns="col1")  # another variation
  • The same drop method can be used to drop both columns and rows although dropping columns is more popular.
  • For dropping columns you set axis=1 and for dropping rows you set axis=0
  • Using the inplace = True will ensure that the existing dataframe 'df' will be used to drop a column instead of creating a new dataframe to return by the method.

Using filter

Instead of giving separate column names like above, you could also provide a filter with regular expression to define the column names. Here is an example;

df.filter(regex="^C", axis=1)

If you run the above query, you will see that all the columns starting with 'C' is filtered out and returned as a new Dataframe.

A common error that students do is not use the inplace=True in which case the existing 'df' is not modified and instead a new dataframe with the dropped column is returned. If you do not assign the returned dataframe to a variable then that dataframe which is returned after dropping the column, is not reflected in the original dataframe.

Official reference

Inserting a new column

If you want to insert a new column as a specific index position in the columns, then you can use the insert method as shown below:


df.insert(4, 'fare-in-$', df['fare']/0.74)

The above method inserts a new column with the label 'fare-in-$' at index position 4 in the columns array by calculating the dollar value using the 'fare' column

Selecting subset of columns

Instead of dropping columns, you could also select only a subset of columns and here is how you do that:


df2 = df[['col1', 'col2', 'col3']]

In the above example, a new dataframe df2 is created using only col1, col2 and col3 columns of the 'df' dataframe

Exercise


import pandas as pd

detroit_demolitions = pd.read_csv('https://storage.googleapis.com/mbcc/datasets/Detroit_Demolitions_withColumns.csv')
detroit_demolitions.drop(['Price'], axis=1)
detroit_demolitions.drop([0])

detroit_demolitions

Run the above code and you will notice that both 'Price' column and the first row are not dropped in the dataframe. How will you fix this?

There are two ways of fixing it, try both the ways.

Converting column names to lower case

Many a times column names will have upper and lower case letter interspersed. This will again stump you while writing queries as you have to be mindful of which letter is which case. It is better to convert all letters to either upper case or lower case. Making it upper case is not very convenient either as you have to invoke another key on your keyboard to use upper case. Instead lower case is a better option and it is easier on the eyes also. Here is the command to convert all column names to lowercase:


df.columns = df.columns.str.lower()  # preferred way
df.columns = map(str.lower, df.columns) # using a map function
df.columns = [x.lower() for x in df.columns]  # using a for loop

The above code block shows three ways of getting the same result. The first one is preferred as it is neater although the performance advantage was negligible for this small number of column names.

To time a statement execution you can use the 'time' function of the 'time' module. Here is the revised statements which prints out the executing time difference after every statement:


import pandas as pd
import time as time

df = pd.read_csv('https://storage.googleapis.com/mbcc/datasets/Detroit_Demolitions_withColumns.csv')

a = time.time()
df.columns = df.columns.str.lower()  # preferred way
b = time.time()
print(b - a)
df.columns = map(str.lower, df.columns) # using a map function

c = time.time()
print(c - b)
df.columns = [x.lower() for x in df.columns]  # using a for loop

print(time.time() - c)

Note: The str attribute is applied on the Series object. You can apply this for any Series. As all columns in a dataframe are Series, you can apply this for a column also. Any string function can be applied to this attribute.

df['col1'].str.lower()

The above statement will convert all values in 'col1' of the dataframe to lower case.

Points to note

  • While we use the datetime module of python for the most part, you can use time module to clock statement executions as shown above. Time provides unix time stamps; expressed as a floating point number representing seconds from unix epoch time.
  • Using 'str' is safe with null values. Null values are ignored when any of the functions are applied like string manipulations etc.. You can also use list functions like slice, get etc., on str.

Using %% directive to time a block of code

If you want to time the entire code snippet then you can use %%time. Here is the example

%%time

a=0
while(a<10):
    10*a
    a += 1

Output:

CPU times: user 2.31 ms, sys: 854 µs, total: 3.16 ms Wall time: 2.72 ms

The above output shows the time it took for the entire code cell to execute. This block is not doing anything other than just looping a few times so we can measure the time.

Renaming a column label

Sometimes very long column labels are unwieldy to use in queries. You would like to change the names of such column names. Here is an example to do just that:

df = df.rename(columns = {'data_value':'data'})
df.rename(columns={"column1":"col1"}, inplace=True)

Get subset of columns using difference

To get columns without a subset of columns you could use difference


df.columns.difference(['col1', 'col2'])

The above statements returns an Index object of all columns names excluding col1 and col2.

Deduping

Deduplication or deduping for short, is a data cleaning step in which you identify and drop duplicate rows. There may be instances when duplicate rows may be real observation and hence legitimate. For e.g., if the time stamp is not noted and only the date is recorded for patient visit to a hospital, in certain cases a patient may visit the hospital multiple times a day. However in most scenarios duplicate records are erroneous information. Error could be human data entry or system anomaly. In any case you have to ask the question and think through the possible scenarios for duplication. Once you are satisfied that the duplicate data should be removed, Pandas gives us handy methods on dataframes to help identify and remove duplicates.

Here are some examples; First we create a dataframe with duplicate rows and then query to see if we can indeed identify duplicates by calling duplicated() method on dataframe. This method returns a Series with True/False values for rows which are duplicates or not respectively.


import pandas as pd
d = {'col1':[1, 1], 'col2':[2, 2]}
df = pd.DataFrame(data=d)

print(df.duplicated())

Output:
0 False
1 True
dtype: bool

In the output, the first row is identified as 'False' for duplicate but the second row is identified as duplicate. To make it more intuitive you can add the returned series as a column as shown below


import pandas as pd
d = {'col1':[1, 1], 'col2':[2, 2]}
df = pd.DataFrame(data=d)

df['is_duplicate'] = df.duplicated()
df

Output:

    col1    col2    is_duplicate
0    1        2        False
1    1        2        True

In the next example we drop duplicates by calling the drop_duplicates method.


df.drop_duplicates(inplace=True)
df

Output:
col1 col2
0 1 2

Note: inplace=True will drop the duplicates in existing dataframe which is a recommended way to conserve memory by not creating a new dataframe with the dropped rows.

Replace invalid characters in values

Many a times you may have some business domain related characters which come in the way of writing your queries. You might want to replace such characters so that can convert them to numerical values for your analysis. For e.g.,

  • having a $ symbol in the column values showing price.
  • having a comma for prices.
  • having a hyphen(-) for empty prices

You can replace such values using replace method:


df = df.replace('-', 0)  // replace all - with 0 across all values of the dataframe

Another example to replace the $ symbol prepending the values in the 'Price' column:


df['Price'] = df['Price'].str.lstrip('$')

Convert Data From One Type to Another

astype method

You will have instances when you have to convert data from one type to another. In the above example, the Price column is of String type so to covert that to float (so that you can run your numerical analysis) you would invoke the astype method by giving float type as the argument as shown below:


df['Price'] = df['Price'].astype(float)

In fact you can do all of these in one statement as shown below:


df['Price'] = df['Price'].str.lstrip('$').astype(float)

Only a number can be converted to float. You have to first strip out all the non numeric characters before applying astype(float) function. If the number has any non-numeric character then you will get value error: ValueError: could not convert string to float: ...

to_numeric method

If a column or DataFrame has numbers for all valid values and non-numeric values for invalid values, then you could use to_numeric method instead. To ensure that it does not error out for non-numeric values you can add errors attribute as shown below. All non-numeric values will get a NaN value and rest of the values will be converted to one of the numeric types based on the value:


pd.to_numeric(df['Price'], errors='coerce')

While the above function can be applied to individual columns, the below function can be applied to all columns of a dataframe that needs to be converted to numeric values:


df = df.apply(pd.to_numeric, errors='ignore')

Official reference

converters method in read_csv

You can also clean up your data during reading the datafile into Pandas dataframe. Here is an example for replacing the $ symbol present in Price column and converting the resulting stripped value to float while reading the file into a Dataframe:


import pandas as pd
detroit_demolitions = pd.read_csv('https://storage.googleapis.com/mbcc/datasets/Detroit_Demolitions_withColumns.csv',
                                   converters={'Price': lambda s:float(s.lstrip('$'))})
print(detroit_demolitions.head(1))

Handling Missing Values

Missing values in data sets are very common. Missing entries are given the value NaN, short for "Not a Number" and is given float64 dtype for performance reasons. Before you start any analysis you might want to know which values are missing in the DataFrame. Here is the command to do just that:


df.isnull().values.any()

If there are NaN values, it returns True else False.

Another function which does exactly the same is isna() which works similar to isnull

Complementary to isnull() or isna() is notnull()

Another way of finding the sum of all null values is

df.isnull().sum()

This provides you the count of null values across each column

To get the total sum across all columns you could use

df.isnull().sum().values.sum()

Although using info method on the DataFrame will list out the non-null values and the total number of rows across all columns.

Get all rows with missing values in a specific column

You can also see the entire rows of columns which have missing values. Here is a query which returns rows that have NaN values in 'embark_town' column


titanic_df[titanic_df['embark_town'].isna()].head(10)

Once you identify the missing values, you can either drop them if they are few in number or impute with other values and here are some examples below:

Note: In Pandas world, missing values, na, null, None type; all mean the same and they are represented by NaN.

Drop all missing values.

You use the dropna() method to drop all rows which have any field having a null or missing value


df.dropna(inplace=True)

With the above command, all rows which have any column value which missing value is dropped.

However if you only want to drop rows which have missing values in a specific column, you can still use the same function but apply it on the column of interest. Here is an example of both:


df.dropna(subset=["col1"], inplace=True)

With the above command, you drop only those rows which have missing values in 'col1'

Variations: If you want to drop those rows that have all missing values, use keyword argument 'how=all' and 'axis=0'. Refer: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

Fill with most common value across all columns

Sometimes it makes sense to simply fill in the missing value with the most commonly occuring value and there is the code snippet which helps you get just that:


df.apply(lambda x: x.fillna(x.value_counts().index[0]))

Note that the 'x' variable in the lambda function gets one column at a time.

Fill with a default for all missing values

You use fillna() method to fill all missing values with a default value. In the below example you replace all missing values with 0.


df.fillna(0, inplace=True)

You can also selectively fill missing values of a specific column:

df['col1'].fillna('empty', inplace=True)

However, if 'empty' is not one of the values in the column, you will get

ValueError: fill value must be in categories

To overcome this, you have to add the 'empty' category using the expression below

df['col1'].cat.add_categories('empty', inplace=True)

and then run the fillna again

In the above example, you fill the missing values with 'empty' string for only the column with the label 'col1'.

You can also impute with mean, median etc., and here is an example:

df['age'].fillna(df['age'].mean()], inplace=True)

To impute with the previous row value use the below statement:

df['price'].ffill(inplace=True)

The above statement is same as

df['price'].fillna(method='ffill', inplace=True)

Complementary to 'ffill' is 'bfill'

Reference: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

Imputing with linear regression

If you see a linear regression use-case for imputing missing values, then you can apply the below function:


df['gdp'].interpolate(method='linear', inplace=True)

The gdp column presumably contains the country's GDP value across many years in ascending order and using linear regression, you can impute the missing values.

The 'method' parameter can be changed to other values. Refer: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html#pandas.DataFrame.interpolate

Imputing missing values is a science by itself. There are many other scenarios of dirty data and the ways in which you can wrangle data for analysis. Please refer to the various methods as described here: https://ebooks.mobibootcamp.com/eda/datacleaning.html

Replacing one value with another

To replace a specific column value with another, you can use the replace function


df['embark_town'].replace('Queenstown', 'Kingstown', inplace=True)

All the values with 'Queenstown' in embark_town column will be replaced with 'Kingstown'. Of course you can also replace with a value of any other data type.

In the replace method you can also use regex by adding the keyword argument 'regex=True' and here is an example


df.replace(r'\s+', np.nan, inplace=True, regex=True)

The above regex expression matches and replaces one or more empty spaces, newline and tab characters with NaN value.

Using regex to extract values into columns

You can also extract a regex group using str.extract. More here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

Replacing a subset of values

Let us say you want to impute one or more values based on row and column index or label values. Then you could use the 'loc' or 'iloc' or 'at' functions. Here are some examples:


df.loc[2, 'deck'] = 'A'

The above statement selects row index number '2' and column value 'deck' and updates the value in this cell to 'A'


df.iloc[3, 11] = 'A'

The above statement selects row index number '3' and column index number '11' which is the 'deck' and sets the cell value to 'A'. Column indexes begin with '0'


df.loc[4:6, 'deck'] = 'A'

The above statement selects all the rows from '4' to '6' (range) and changes the 'deck' column value to 'A'


df.loc[[2, 4], 'deck'] = 'A'

The above statement selects 2 and 4 row indexes and changes the 'deck' column value to 'A'


import numpy as np
df.loc[[2, 4], :] = np.nan

The above statement selects row indexes 2 and 4 and changes all the column values to NaN

at

You can replace 'loc' with 'at' to get the same desired results in all the above examples. It is preferable to use 'loc' instead of 'at' at in some edge cases, 'at' will silently convert datatypes behind the hood without warning. Here is an example


df = pd.DataFrame([['a', 'b', 3], ['c', 'c', 1], ['e', 'f', 30]], columns=['A', 'B', 'C'])
df.info()

Output:

 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   A       3 non-null      object
 1   B       3 non-null      object
 2   C       3 non-null      int64 
 

As you can see the 'C' column is assigned an int64 type and now if you try to update the value of the first row, column 'C' to a decimal value with 'at' you will notice that the decimal value is truncated to form an 'int64'


 df.at[0, 'C'] = 10.70

However, the same is not so when you use 'loc' and it converts the entire column to 'float' as a new float value is set to one of the rows.

Note:

  • If the given row index and column label or index does not exist, then both 'loc' and 'at' will create a new row and/or column to satisfy the set value. But 'iloc' cannot enlarge the dataset.
  • You can also pass in a query for the row or column list values as long as the query expression returns an Index or list. For e.g., df.loc[df[df.marks>70].index, 'marks'] = 'A'. This expression can be applied to the dataframe example shown in the earlier chapter to see it working.

results matching ""

    No results matching ""