Data Driven Stories

So far you learnt the techniques of deriving metrics or visualizations in bite sized solutions. You are also advised to read our EDA eBook to get a basic understanding on the concepts of Exploratory Data Analytics (EDA). In this lesson you will learn the art of weaving a Data Driven Story by putting it all together. Although the procedure is given as a series of steps, your data analysis may or may not exactly align with these steps. As long as you derive meaningful insights from your dataset your analytics is complete with or without the steps given below.

Start every step with an initial question. Plot meaningful self explanatory diagrams with meaningful labels. Conclude the step with your own conclusion by observing the results.

Note: Steps 4 to 8 can be in any order. You may omit some steps altogether if there is no relevant data for that step. The data cleaning Step 3, may be revisited again and again as you discover more dirty values during the analysis in other steps. You may also add more steps based on your dataset's unique makeup. If you have data from multiple sources, then you may consider merging data from multiple sources and then continue with your analysis.

Step 1. Data Ingestion

For the most part, open data sets are available in a tabular format. You can use the built-in methods to readily convert a .csv, .json or any other file format by using one of the functions of Pandas as described in File Reading and Writing Chapter

Step 2. Bird's Eye View

When a dataset is given to you to derive insights, the first step is to get an overview of your data.

Some of the fundamental questions in a table data could be:

How many rows are present?
How many columns are present?
What are the datatypes of each column?

You can get the answers for all the above questions by applying the following methods on Pandas DataFrame as detailed in the chapter Summar Statistics

info()
describe()
head()

Step 3. Cleaning the Data

Now it is time to start one round of cleaning by fixing the column names, filling in null or empty values etc., as detailed in chapter Data Cleaning

For the next set of steps, you can follow these guidelines:

Ask some initial questions (1-2 max) and start analysing the data to answer those specific question. Asking questions is the hardest. It comes by practice. Start by first plotting diagrams based on data types and that may open doors for some questions.
Have some possible hunches in mind for your answers and then find the answers from the data. Look at multiple independent variables (variables could be other column values or you may have to gather additional data to supplement your initial data) which could be contributing to the answer to your questions.
Find some initial results. And assess how confident you are with the results you found. What would make you more confident with the results? Try and think are there other lurking variables that you did not consider?
Present your findings using single-variable (1d) and multiple-variables (2d) visualizations charts.
Finally reflect on your prior hunches and what the actual answers were. In almost all cases you now have more questions even though you answered your initial questions.
Repeat the cycle till you have reasonably understood the data on hand and are able to show your findings through visualizations along the way.
List out the assumptions you made along the way, limitations of your findings and other extraneous factors which could influence your findings.

Step 4. Numerical Data Analysis

If your dataset has any numerical values then this step is essential. It is worth noting however that some columns might be ingested into the DataFrame with a non-numeric data type even though it should have been of a numeric type. In that case go back to the previous step and convert that column values to numeric. E.g., price with a $ and/or ',' ingested as a non-numeric type which needs to be converted to numeric, so that you can perform arithmetic operations on it. If you don't have any numerical values to consider then move to next step.

At this point you will get a fair idea of which numerical column you would like to explore more. Find the outliers, mean, median, distribution etc., by plotting a Box and Whisker plot, Histogram as detailed in chapter Simple Visualization

You may also want to segment any continuous data with some criteria if applicable as detailed in chapter Simple Queries

Step 5. Categorical Data Analysis

If any of the column values categorical, then you can create Bar charts using Seaborn library. If the category labels are too big then consider horizontal bar chart. If you see part of whole scenarios, then consider stacked bar. Refer chapter Visualization with Seaborn

Step 6. Time Series Plots

If you see any date/time related column values, then plot a line chart as explained in chapter Time Series

Step 7. Geospatial Diagram

If you see any latitude and longitude values, county, state, country values worth investigating, then apply the techniques as detailed in chapter Geospatial Diagrams

Step 8. Multivariate Relationships

When a given tabular data contains multivariate observation data, it is a good idea to see possible correlation relationships between them. You can first get a subset of columns of interest and create a correlation Heat map as explained in chapter Visualization with Seaborn

Using the heatmap as the guide, pick two variants showing a strong correlation and consider plotting a Seaborn lmplot or regplot, showing a positively or negatively correlated relationship.

Step 9. Insights!

Congratulations! If you made it up to this step, then you have tamed the beast!
By creating the many different diagrams and charts, you would now have a fair idea of your dataset and hopefully, some interesting discoveries about the data might have been made.

However not all diagrams and analysis would be meaningful. Before every analysis however, you would have started with a simple question and concluded with your observation. In this last step, you will only explain the meaningful insights that you derived leaving out the irrelevant.

In essence you have arrived at the Explanatory Phase by taking the help of the Exploratory Phase. In this phase, provide a final conclusion summarizing the salient features of your findings which is worthy of weaving a story, there by adding your own unique insights on the data!

In the event that nothing interesting flashes from the data, then, revisit the above steps by asking more questions, getting more data and digging deeper into your analysis, which might again take you in another unique discovery path. Repeat all the steps as needed, till you are satisfied with your findings and are ready to share your findings and/or hypothesis with the whole world!

Remember however that, many EDA exercises only leads to more analysis, more data collection using which the hypothesis is proved with inferential statistical techniques and only then a business decision should be made.

Examples

Every dataset is unique and hence the steps you would take for your analysis would be unique as well. Some data need less cleaning as the data may be already clean but some others need significant steps in data cleaning process itself before the detailed analysis.

Every analyst's train of thoughts and the questions that they come up is different as well and hence even if two analysts are given the same dataset, they typically come up with their own unique charts and plots and derive their own unique insights. Even after an analyst completes an analysis, another analyst may see opportunities to enhance the analysis further and derive even more insights. A good analyst can almost always enhance any analysis!

Our students have created amazing analytics on several open data. You can find the them at: Mbcc Talent Website

Open Datasets

Many governments all over the world are making their data 'open' there by contributing to the public domain. Anyone can freely use/analyze the 'open data' without any restrictions except for, in some cases, adding an attribution statement when published on the web. US Govt Data, UK Govt Data and India Govt Data are some of the examples of open data.

The reason for open data movement is broadly two fold; Government organizations which receives tax payers money are obligated to make non-sensitive information public in many countries. Secondly, people are realizing that there is a wealth of information hidden in the massive amount of data that organizations have collected over the many years and by making it public, innovative private companies and its citizens can derive insights from data and share with the government to help make better informed decisions for its citizens and the rest of the world.

Below you will find a few other open dataset that you can use for your analysis:

Other API References

To extract data from HTML file (screen scrapping applications) refer to Beautiful soup documentation to help you extract data: https://beautiful-soup-4.readthedocs.io/en/latest/

To extract PDF data, refer: https://stanford.edu/~mgorkove/cgi-bin/rpython_tutorials/Using_Python_to_Extract_Tables_From_PDFs.php
Automatically generate all kinds of diagrams that are possible on your dataset: https://github.com/AutoViML However note that it is important to understand what you are trying to generate though. Although libraries like these makes it simple and you can use it as a starting point, you should dig deeper to truly understand your data.

Exercise

Challenge yourself to derive insights by applying all the techniques that you have learned so far by choosing any open dataset of your choice (outside of the list given above is also fine). Then take it a step further; refer the internet and enhance your analysis by going above and beyond what you have learned in these eBooks and come up with your unique story!

8. Data Driven Stories