DataFrames

DataFrames is a complex datastructure which uses several Series objects underneath it. DataFrames are used to represent table data. Similar to a table, DataFrames have columns, column names, and rows with row indexes. Here is a simple example:

df = pd.DataFrame({
                'marks': [70, 66, 100, 88], 'age': [29, 32, 31, 28], \
                'sex': ['F', 'M', 'F', 'F'], 'name':['Jane', 'John', \
                'Sally', 'Sandy'], 'ssn':['1234', '3456', '4567', '5678']
                })
print(df)

Output:

   age  marks   name sex   ssn
0   29     70   Jane   F  1234
1   32     66   John   M  3456
2   31    100  Sally   F  4567
3   28     88  Sandy   F  5678

In the above example, name, ssn, age, marks and sex make up the column names (labels). Rows have index numbers from 0 to 3. By default column names are ordered alphabetically. You can get the values of an entire row or a column by invoking functions. When ever you retrieve row or column values, you receive a Series object with the respective values. Here is an example:

Get Column values from column name (label)

Using column label, you can retrieve all the values of a column. Here is the code to get all the values of the 'age' column:

col_values = df['age']
print(type(col_values))

Output:

<class 'pandas.core.series.Series'>

Notice that the retrieved value is a Series object. Let us now print out the values of the series object.

print(col_values)

Output:

0    29
1    32
2    31
3    28
Name: age, dtype: int64

nlargest and nsmallest

The nlargest and nsmallest that you used on Series object works on DataFrames too!

You have to provide the number of largest records to get along with the column(s) that it should sort and find the largest values. Here is an example:


import pandas as pd
df= pd.DataFrame({'marks': [70, 66, 100, 100, 88], 
                  'age': [29, 32, 30, 31, 28]} 
                )

df.nlargest(3, 'marks')

Output:


    marks    age
2    100        30
3    100        31
4    88        28

You can specify multiple columns too. If you want to get nlargest on both 'marks' and 'age' column, you would change the columns value to a list of column values as shown below:


import pandas as pd
df = pd.DataFrame({'marks': [70, 66, 100, 100, 88], 
                  'age': [29, 32, 30, 31, 28]} 
                )

df.nlargest(3, ['marks', 'age'])

Output:


    marks    age
3    100    31
2    100    30
4    88    28

Change index

Another beauty of using DataFrames is that you can easily change the current index or even reindex, as the case may be, to makes things easier. In the above example, the rows carry index numbers 0 through 3. However in real life data, every table may have a natural primary key and you may be constantly querying against a specific value of the primary key. With a DataFrame you can easily replace the numerical index with any other column values. Let us say in the above example, we want to replace the row index with 'ssn' column. Here is the code to do just that

df.set_index('ssn', inplace=True)
print(df)

Output:

      age  marks   name sex 
ssn                              
1234   29     70   Jane   F 
3456   32     66   John   M  
4567   31    100  Sally   F  
5678   28     88  Sandy   F

Notice that the existing index is replaced with the values in the ssn column and the 'ssn' column is no longer there. If you want to keep the 'ssn' column, you can add keyword argument drop=False. Adding inplace=True keyword argument will ensure this index change happens in the existing object instead of creating a new object.

Alternative way

You can achieve the same result by using the below statements as well. First just replace the index column with the new column and then drop the existing 'ssn' column:

df.index = df['ssn']
df.drop('ssn', axis=1, inplace=True)
print(df)

Output:

      age  marks   name sex
ssn                        
1234   29     70   Jane   F
3456   32     66   John   M
4567   31    100  Sally   F
5678   28     88  Sandy   F

If you do not add the inplace=True argument to the drop command, then you will receive a new DataFrame object with the required column dropped. However if you add this parameter then the existing DataFrame object is modified in place.

Replacing the existing index is by far the most commonly used command in the data wrangling step.

Sorting

A dataframe can be sorted on index or on one or more columns.

Here are the methods:

df.sort_index() - sorts the entire dataframe in ascending order based on index
df.sort_values(by=['col1']) - sorts the entire dataframe in ascending order based on col1
df.sort_values(by=['col1', 'col2']) - sorts the entire dataframe in ascending order based on col1 and col2. You can add more columns separated by a comma

You can add optional keyword argument 'ascending=False' to sort the dataframe in reverse order. Also having 'inplace=True' will sort the existing dataframe instead of returning a new one. There are few more variations in the argument list. Please refer the complete list here:https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values

Change Column ordering

You can change the column positions using reindex function and setting 'axis=1' as shown below.


df.reindex(df.columns.sort_values(), axis=1)

In the above example, you get a new dataframe in which all the columns are sorted in ascending order of their labels.

Note

More reference: https://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.set_index.html
It is important to note that inplace=True is an argument you can send to many DataFrame commands which all basically make modifications for the existing DataFrame instead of returning a new DataFrame
For extra large data sets it is important to have inplace=True parameter added or else you may run out of memory as you will be creating unnecessary large almost duplicate dataframe objects.
Index can have duplicate values. To find if the index is having duplicates you can use df.index.duplicated() which returns a Series with True or False values for index row which is duplicated. Refer: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.duplicated.html

DataFrame Structure

DataFrames

Get Column values from column name (label)

nlargest and nsmallest

Change index

Sorting

Change Column ordering

Note

results matching ""

No results matching ""