DataFrames
DataFrames is a complex datastructure which uses several Series objects underneath it. DataFrames are used to represent table data. Similar to a table, DataFrames have columns, column names, and rows with row indexes. Here is a simple example:
df = pd.DataFrame({
'marks': [70, 66, 100, 88], 'age': [29, 32, 31, 28], \
'sex': ['F', 'M', 'F', 'F'], 'name':['Jane', 'John', \
'Sally', 'Sandy'], 'ssn':['1234', '3456', '4567', '5678']
})
print(df)
Output:
age marks name sex ssn 0 29 70 Jane F 1234 1 32 66 John M 3456 2 31 100 Sally F 4567 3 28 88 Sandy F 5678
In the above example, name, ssn, age, marks and sex make up the column names (labels). Rows have index numbers from 0 to 3. By default column names are ordered alphabetically. You can get the values of an entire row or a column by invoking functions. When ever you retrieve row or column values, you receive a Series object with the respective values. Here is an example:
Get Column values from column name (label)
Using column label, you can retrieve all the values of a column. Here is the code to get all the values of the 'age' column:
col_values = df['age']
print(type(col_values))
Output:
<class 'pandas.core.series.Series'>
Notice that the retrieved value is a Series object. Let us now print out the values of the series object.
print(col_values)
Output:
0 29 1 32 2 31 3 28 Name: age, dtype: int64
nlargest and nsmallest
The nlargest and nsmallest that you used on Series object works on DataFrames too!
You have to provide the number of largest records to get along with the column(s) that it should sort and find the largest values. Here is an example:
import pandas as pd
df= pd.DataFrame({'marks': [70, 66, 100, 100, 88],
'age': [29, 32, 30, 31, 28]}
)
df.nlargest(3, 'marks')
Output:
marks age 2 100 30 3 100 31 4 88 28
You can specify multiple columns too. If you want to get nlargest on both 'marks' and 'age' column, you would change the columns value to a list of column values as shown below:
import pandas as pd
df = pd.DataFrame({'marks': [70, 66, 100, 100, 88],
'age': [29, 32, 30, 31, 28]}
)
df.nlargest(3, ['marks', 'age'])
Output:
marks age 3 100 31 2 100 30 4 88 28
Change index
Another beauty of using DataFrames is that you can easily change the current index or even reindex, as the case may be, to makes things easier. In the above example, the rows carry index numbers 0 through 3. However in real life data, every table may have a natural primary key and you may be constantly querying against a specific value of the primary key. With a DataFrame you can easily replace the numerical index with any other column values. Let us say in the above example, we want to replace the row index with 'ssn' column. Here is the code to do just that
df.set_index('ssn', inplace=True)
print(df)
Output:
age marks name sex ssn 1234 29 70 Jane F 3456 32 66 John M 4567 31 100 Sally F 5678 28 88 Sandy F
Notice that the existing index is replaced with the values in the ssn column and the 'ssn' column is no longer there. If you want to keep the 'ssn' column, you can add keyword argument drop=False. Adding inplace=True keyword argument will ensure this index change happens in the existing object instead of creating a new object.
Alternative way
You can achieve the same result by using the below statements as well. First just replace the index column with the new column and then drop the existing 'ssn' column:
df.index = df['ssn']
df.drop('ssn', axis=1, inplace=True)
print(df)
Output:
age marks name sex ssn 1234 29 70 Jane F 3456 32 66 John M 4567 31 100 Sally F 5678 28 88 Sandy F
If you do not add the inplace=True
argument to the drop command, then you will receive a new DataFrame object with the required column dropped. However if you add this parameter then the existing DataFrame object is modified in place.
Replacing the existing index is by far the most commonly used command in the data wrangling step.
Sorting
A dataframe can be sorted on index or on one or more columns.
Here are the methods:
- df.sort_index() - sorts the entire dataframe in ascending order based on index
- df.sort_values(by=['col1']) - sorts the entire dataframe in ascending order based on col1
- df.sort_values(by=['col1', 'col2']) - sorts the entire dataframe in ascending order based on col1 and col2. You can add more columns separated by a comma
You can add optional keyword argument 'ascending=False' to sort the dataframe in reverse order. Also having 'inplace=True' will sort the existing dataframe instead of returning a new one. There are few more variations in the argument list. Please refer the complete list here:https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values
Change Column ordering
You can change the column positions using reindex function and setting 'axis=1' as shown below.
df.reindex(df.columns.sort_values(), axis=1)
In the above example, you get a new dataframe in which all the columns are sorted in ascending order of their labels.
Note
- More reference: https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.set_index.html
- It is important to note that
inplace=True
is an argument you can send to many DataFrame commands which all basically make modifications for the existing DataFrame instead of returning a new DataFrame - For extra large data sets it is important to have inplace=True parameter added or else you may run out of memory as you will be creating unnecessary large almost duplicate dataframe objects.
- Index can have duplicate values. To find if the index is having duplicates you can use df.index.duplicated() which returns a Series with True or False values for index row which is duplicated. Refer: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.duplicated.html