Visualizations with Seaborn

What is Seaborn?

The end result of most data analysis is visualization of the findings with beautiful diagrams. Seaborn is an open source Python library used for visualizations. Seaborn is especially friendly with a Pandas DataFrame and as an analyst you will find working with Seaborn more easy compared to Matplotlib.

Seaborn is built on top of Matplotlib just like Pandas is built on Numpy. Just like Pandas, Seaborn provides us powerful higher level functions to create diagrams.

In this example we will import Seaborn to make use of its ready made data sets. Seaborn is also already installed with Anaconda. So you are all set to import the library when you are using Anaconda. It is also available in Colab environment. When you import the library, you get access to many datasets which are part of the library. One such dataset is the famous Titanic data. This is a sample data of the travellers in the Titanic ship disaster. If you are curious to know the other datasets that come with Seaborn, here is the github link: https://github.com/mwaskom/seaborn-data.

Let us now load the Titanic dataset in our notebook:

load built-in dataset

import seaborn as sns
titanic = sns.load_dataset('titanic')
print(type(titanic))

Output

<class 'pandas.core.frame.DataFrame'>

The standard alias namespace for Seaborn is sns. You will notice that load_dataset method on sns returns a Pandas DataFrame object.

Simple, single variant charts are really easy using Matplotlib. Pandas functions on DataFrames uses Matplotlib underneath it. But if you want to plot multi-variant charts or if you want to add more sophistication with few lines of code, Seaborn is the way to go.

While Seaborn improves your productivity, it comes at the cost of flexibility. Here is the chart which depicts the many different popular Visualization tools and where they stand with respect to Flexibility and Productivity

survival chart

In this lesson we will continue plotting on Titanic data but using Seaborn. For most of the seaborn functions we just feed the DataFrame object to function calls and set the various variable arguments and/or variable keyword arguments for tweaking the default graphs and charts.

Facetgrid

If you want to show the distribution of 'age' over multiple facets; survived and class columns with separate diagrams for each distinct value of each facet, then you would write just three lines of code in Seaborn as shown below:

grid = sns.FacetGrid(titanic, row='survived', col='class')
grid.map(sns.histplot, "age")

Output: survival chart

histplot function calculates the the distribution of age by counting the number of observations that fall within disrete bins.

To see a box plot instead of distplot, you can change the map parameter like the example below:


g = sns.FacetGrid(titanic, row='survived', col='class')
g.map(sns.boxplot, "age")

You also just plot the histogram referencing the 'hist' type from 'plt' as shown below:


import matplotlib.pyplot as plt
sns.set_style(style='whitegrid')
grid = sns.FacetGrid(titanic, col='survived', row='pclass', height=2.2, aspect=1.6)
grid.map(plt.hist, 'age', alpha=.5, bins=20)

survival chart

From this you notice that the best chance of survival is in 1st class and the least chance of survival is in 3rd class.

To learn more on variable keyword arguments that are set refer: https://seaborn.pydata.org/generated/seaborn.FacetGrid.html#seaborn.FacetGrid

Applying just the distplot

Although drawing a histogram from Pandas is much easier, Seaborn also provides a wrapper to draw histogram of density plot as shown below:


sns.distplot(titanic_df['age'].dropna(), bins=10, color='g')

In the above example, we dropped all the na values before plotting. If na values are not dropped, then while computing the KDE, it will throw error, even though you may see the plot.

While 'g' stands for green, 'b' stands for blue etc., when set as a value for the color attribute, you can also use the standard hexadecimal colors. Here is an example of using a hex color:

sns.distplot(titanic_df['age'].dropna(), bins=10, color='#FF5733')

Bar chart

You can create a bar plot with pclass on x-axis and y-axis showing the mean survived value with a hue on 'sex' with just one line of code as below:

sns.barplot(x="pclass", y="survived", hue="sex", 
            data=titanic_df).set_title("Class wise survivors")

Output: survival chart

The bar height is equal to the average value of survived for the specific split. We note from the chart that females had better chance of survival in this tragedy, no matter which pclass they belonged to, compared to males in the same class. Females and males in 1st class had a better chance of survival compared to other classes. Hue option enables you to split and show data by an additional categorical value.

You also see the standard error bar (confidence interval) also in these charts. The error bars represents the variability of data and indicates the error or uncertainty in a reported measurement. Error bars can also be expressed in a plus-minus sign (±), representing the upper and lower limit of the error.

Since this is a sample dataset and not the entire population, having a confidence interval bar is appropriate. Pandas calculates that automatically by taking multiple samples to derive the line.

References:

However if this were the entire population data then you would want to remove this error line. To do that, you just add another parameter; errorbar=None and the revised code would be:

sns.barplot(x="pclass", y="survived", hue="sex", data=titanic_df, errorbar=None).set_title("Class wise survivors")

More on CI here: https://towardsdatascience.com/a-very-friendly-introduction-to-confidence-intervals-9add126e714

Cat Plot

To show the relationship between a numerical and one or more categorical variables you can use 'catplot'. To show the categories; 'sex', 'survived', 'pclass' categorical values, with the numerical value 'age', you would use a catplot as shown below:

fig = sns.catplot(x="sex", y="age",
            hue="survived", col="pclass",
            data=titanic_df, kind="bar", errorbar=None,
            height=4, aspect=.7)

fig.set_axis_labels("", "Mean Age")
fig.set_titles("{col_var} {col_name}")

Output: survival chart

You also see a few extra parameters here; height sets the height in inches for each facet; aspect sets the aspect ratio of each facet, so that aspect * size gives the width of each facet in inches. There are many more parameters you can set; reference: https://seaborn.pydata.org/generated/seaborn.catplot.html#seaborn.catplot

Heat Maps

Another very useful function in Seaborn is the ability to generate heat maps. Supposing you want to draw a heat map with correlation co-efficients between all numerical columns for your titanic data and see which of the variables are strongly or weakly correlated, then you would write the below code:


titanic_df = sns.load_dataset('titanic')
corr = pd.DataFrame(titanic_df).corr()
sns.heatmap(corr, vmin=-1, vmax=1, center=0, 
            cmap=sns.diverging_palette(200, 20, n=100))

Output: heatmap

From the heat map we see that there is a negative correlation between survived and pclass. And a slightly positive correlation between survived and fare. It is also interesting to see the other correlations.

Uses in ML

  • While this eBook does not cover ML models, it is still important to note that correlation matrix can be used to select the features to construct your ML model. Weakly correlated predictors to the target variable can be dropped and heavily correlated variables can be selected as features for your model

References:

LM Plot

LM plot is intended as a hybrid to fit regression models across conditional subsets of a dataset. It combines regplot and facetgrid

To see a the correlation between fare and age with a regression line, here is the code:

sns.lmplot(x='age', y='fare', data=titanic_df)

Output: lm plot

CountPlot

With countplot you can get a total count of individual category types. Here is an example. In this example you can get the count of all offense categories in one plot. Note that the labels are rotated and the font size is set to ensure that the labels do not overlap.


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

demolitions = pd.read_csv("https://storage.googleapis.com/mbcc/datasets/Detroit_Demolitions_withColumns.csv")

ax = sns.countplot(x='Contractor Name', 
                   data=demolitions[demolitions.Neighborhood == 'Joy Community'], 
                   palette='Set2');
ax.set(xlabel='Offense Category', ylabel='Frequency')
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, fontsize=10)
ax.set_title("Frequency of Offense") 
ax.figure.set_size_inches(30, 10)

Output:

In the above plot, we have set a slew of parameters; rotation, fontsize etc.. We did this by first getting the handle of the subplot of the figure and then changing the default settings. You can also get the reference to the figure from the subplot. Once you have the figure object, you can change its size by setting the width and height in inches.

Notice that the palette parameter of the countplot is set with a value of 'Set2' which represents a type of colormap. There are more options;

Reference:

To plot pairwise relationships

To plot each pairwise relationships in one graph use pairplot: https://seaborn.pydata.org/generated/seaborn.pairplot.html

Official reference

results matching ""

    No results matching ""