Pandas
While NumPy became famous for its efficient N-dimensional arrays, Pandas library became famous for its DataFrame data structure - used for manipulating data in table format. DataFrames are not only efficient while working with large amounts of data but also provides useful and powerful functions to slice, dice and manipulate date. Pandas has been extensively used in financial applications. Pandas name is derived from the term 'panel data'.
Pandas uses NumPy structures internally. It uses its N-dimensional arrays and provides higher level, more powerful methods for manipulation.
Just like NumPy, Pandas is also an open source project and here is the GibHub link: https://github.com/pandas-dev/pandas
How exactly does Pandas help?
Pandas library solves some of the routine data manipulation tasks with easy to use and high performance api calls. Here are a few examples:
- It is easy to handle missing data (represented as NaN) in floating point as well as non-floating point data.
- DataFrame columns can be inserted and/or deleted by either returning a new DataFrame or manipulating the existing DataFrame in-place.
- Group by functionality on data sets helps in creating variety of sub collections on data sets. Very similar to functions seen in relational databases.
- Label-based slicing, custom indexing capabilities.
- Can merge and join data sets.
- API calls available for loading data from flat files (CSV and delimited), Excel files, databases etc..
- Variety of Time series-specific functionalities
To use Pandas module, you first have to install and then import in your notebook. Since you have downloaded Anaconda, you already have Pandas installed on your computer so you can straight away import the module with import pandas as pd
. Although you can import pandas with any other alias, it is a common practice to import as 'pd'.
Now let us understand the two main data structures of Pandas; Series and DataFrames.
Series
The Series is the primary building block of Pandas. A Series represents a one-dimensional labeled indexed array based on the NumPy ndarray. Therefore, Series has only one axis (axis == 0) called 'index'. Like an array, a Series can only hold homogeneous data. A Series can be created and initialized by passing values, a NumPy ndarray, a Python list, or a Python Dictionary as the parameter of the Series constructor. Here are examples of defining a Series:
Series built with list values
import pandas as pd
a = pd.Series([1, 25, 65])
print(a)
print(a[0])
Output
0 1
1 25
2 65
dtype: int64
1
In the above example, the index values for the Series are assigned values similar to NumPy array or Python list. Accessing the index value 0 will get you the value of the first element which is 1 in this example.
Note: If you construct a Series with heterogeneous data, the dtype of the resulting Series will be chosen to accommodate all of the data types involved. For example, if strings are involved, the result will be of object dtype. If there are only floats and integers, the resulting array will be of float dtype. If only integers are involved, the dtype will be of type int
Series built with index values
You can also build a Series with your own specific index values. You can give any combination of alpha numeric characters as the index values.
b = pd.Series([1, 25, 65, pd.np.nan], index=['a', 'Bx', 'c', 'd'])
print(b)
print(b['Bx'])
Output
a 1.0
Bx 25.0
c 65.0
d NaN
dtype: float64
25.0
In this example note that we created a Series with NaN (not a number) type by referencing the NaN type from NumPy. Using pd namespace we accessed the np namespace and then referenced the nan type. Because we have a nan type, the datatype was automatically assigned as a float. To access the second element in the Series, we use the index value 'Bx' as we replaced the default index values with our own values by sending in the custom values as the second parameter in the Series constructor.
Slicing works similar to ndarray and list except you can also use your own custom index, Notice that the natural index, starting from '0' also works along with the custom index.
import pandas as pd
b = pd.Series([1, 25, 65, pd.np.nan], index=['a', 'Bx', 'c', 'd'])
print(b[0:2])
print(b['c':])
Output:
a 1.0
Bx 25.0
dtype: float64
c 65.0
d NaN
dtype: float64
Top or Bottom n entries
Using nlargest(n)
on the Series object will return the top 'n' largest values. 'n' can be any number. Complementary to this is nsmallest(n)
Sorting values and indexes
You can sort the values of a Series by using sort_values
and you can sort indexes by using sort_index
. Here is an example:
a = pd.Series([10, 2, 65, 35, 64], index=['c','aa','ac','a','e'])
sorted_index = a.sort_index()
sorted_values = a.sort_values()
Output:
a 35 aa 2 ac 65 c 10 e 64 dtype: int64 aa 2 c 10 a 35 e 64 ac 65 dtype: int64
By default the sort function is displayed in ascending order. You can change this by adding the parameter 'ascending=False'
Once sorted, you can also get the n largest or n smallest values by using slicing also. To get the top 3 and bottom values, you could use:
a.sort_values()[:3]
and a.sort_values()[-3:]
Note: By default the sort functions used above returns a new Series object with the values sorted.
If you want the original Series sorted, you have to use inplace=True
keyword argument.
Map Function on Series
The standard functions to calculate mean(), max(), min() etc., that you learnt for NumPy are also available for Pandas series. In addition a very handy function that you see quite often is map function. Maps are used for transforming data from one format to another format. The function map expects you to pass a single value from the Series to which a given lambda function is applied and then returns the new value. Let us take an example; suppose you want to strip the '$' sign present in the price value, you would use a map with a lambda function. The below example shows that:
import pandas as pd
prices = pd.Series(['$1.5', '$25', '$65'])
print(prices)
prices = prices.map(lambda price_value: price_value.lstrip('$'))
print(prices)
Output:
0 $1.5 1 $25 2 $65 dtype: object 0 1.5 1 25 2 65 dtype: object
Note however, that the map returns a new Series object and does not modify the existing object hence the variable 'prices' is being reassigned to the returned value from the map.
Although we used map for the above scenario, you indeed do not need to use map for such simple operations. You could directly manipulate the string value in this case by applying string functions. Here is the example showing the same functionality derived from directly applying the string function lstrip
import pandas as pd
prices = pd.Series(['$1.5', '$25', '$65'])
print(prices)
prices = prices.str.lstrip('$')
print(prices)
Output:
0 $1.5 1 $25 2 $65 dtype: object 0 1.5 1 25 2 65 dtype: object
map is typically used for transformations which are more complex.
Also, all the broadcasting rules that you learnt for NumPy are applicable to Series objects as well.
Queries
You can run queries to filter out certain values. Here are some examples.
You can use relational and logical operators to filter and get certain records.
In the below example, you are getting those values that are greater than or equal to 10
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s2 = s1[s1>=10]
print(s2, type(s2))
Note that the value returned is a Series object too.
In the below example, you get all the values that are greater than 14 and less than 50 using the '&' operator between the two relational operators.
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s2 = s1[(s1 > 14) & (s1 < 50)]
s2
Logical OR operator can be applied using '|' notation.
Note: It is better to enclose the relational operators within a pair of parentheses as shown in the example, to avoid unforeseen errors during computation
Selecting Elements
There are multiple ways you can select an element in a series and let us understand them one by one.
Using index number
series supports the standard indexing mechanism to get an element with '0' be the starting index. You can also go in the reverse direction starting from -1. These are very similar to Python lists or np arrays that you have dealt with earlier.
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s1[0] # gets you the first element of the Series
s1[-1] # gets you the last element of the Series
Note: For negative indexes to work, you should have a custom index defined. Only then it works. For the default indexing it does not work.
Using iloc
You can also use iloc function when you are dealing with using the index number. Here is an example;
a = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
a.iloc[0] # gets you the first element of the Series
Output:
100
Retrieving multiple values
You can also retrieve multiple values with iloc and here is an example
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s1.iloc[[0, 3]]
Output:
a 20
d 50
With iloc you can also get a range of values similar to slicing and here is an exmaple
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s1.iloc[0:3] # gets all values from index 0 to index 3 excluding 3
Traditional methods using positional index numbers
Of course the regular slice function and the regular retrieval using the positional index also works on Series
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s1[0:3] # This gives the same result as iloc[0:3]
s1[0] # this retrieves the first element of the Series
Updating values
Series object is mutable so you can update or delete any element and here are some examples of using the positional index and iloc to get your job done
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s1[0] = 100 # changes the first element to 100
s1.iloc[2] = 200 # updates the third element to 200
If iloc and regular index gives the same results which one should you use? Well, if you are only retrieving or updating 1 value, either one of them is fine but if you are retrieving or updating multiple values then .iloc would be better fit and here is an example
s1 = pd.Series([20, 14, 32, 50, 49, 100])
s1.iloc[[0, 3]] = 100, 200 # you can update both the first element and the fourth element with one statement using iloc
Using loc property
You can also use loc instead of iloc in all the above examples. However, if you have a non numerical index, then you have to use loc to retrieve values using the non numerical index. Here is an example
s1 = pd.Series([20, 14, 32, 50, 49, 100], index=['a', 'b', 'z', 'd', 'e', 'f'])
print(s1.loc[['a','d']])
print(s1.iloc[[0, 3]])
Both the above print statements retrieves the first and the fourth value in the series except iloc retrieves it using numerical index and loc uses the custom index labels.
If the custom index label is not provided then both loc and iloc behave the same and you can use the positional index to get the values.