NumPy and Pandas Crash Course for Data Analysts.
Introduction
In data analytics you rarely work with single numbers. It is almost always a collection of some kind of data type. In the simplest denomination, you will have a collection of numbers or strings. Python does provide list structure to hold collection. Python lists are heterogeneous by nature, which means every element of a Python list can be of different data type. Although this feature adds flexibility, it comes at the cost of performance. If your collection of data is of the same type, then Python lists will not only take more space but also more time to perform simple operations like calculating mean, standard deviation etc., when compared to data structures present in Python libraries that are meant for statistical operations.
When you are dealing with extremely large amount of data of the same type, Python lists are not a great choice. Instead you should use NumPy Arrays or Pandas Series. Pandas Series is also a building block to construct a Pandas DataFrame, which is meant for data represented in a table format. Both NumPy and Pandas are open source libraries written specifically with Data Analysis in mind. In this book you will learn some of the important aspects of using these modules. To succeed with this eBook you should have basic Python programming knowledge. If not, please follow our Python eBook, before you start following this eBook.
Any analysis is not complete without showing some form of diagrams; be it bar charts, scatter plots, histogram etc.. In this eBook you will also learn how to create simple diagrams using Matplotlib and Seaborn libraries.
If you have followed our Python eBook, then you would have already installed Anaconda on your machine. In that case, both NumPy and Pandas are already installed along with Jupyter Lab and you can readily start creating a notebook and use the packages right off the bat.
The other option is to install Python, NumPy, Pandas, Seaborn and JupyterLab separately. Recommendation is to download Anaconda so that you have all the necessary packages already available and you can focus on analysing data.
One other very convenient option is to create your notebook on Google's Cloud Platform for Jupyter notebooks - colab. Details are on this page: https://ebooks.mobibootcamp.com/python/cloudnotebook.html
Google's Colab environment already has many of the necessary libraries that are commonly used by Data Scientists or Analysts. However, if you need a package which is not available on the Colab environment, then you can install them using the command line directive '!' as show below:
!apt install proj-bin libproj-dev libgeos-dev
!pip install https://github.com/matplotlib/basemap/archive/v1.1.0.tar.gz
Updating Libraries: If you need to update an already installed library add -U to the install command. For e.g, if you want to update plotly library that ia already installed on Colab run the below command to get the latest version of Plotly:
pip install -U plotly
Using bash
If all the statements in a code cell has a command line directive, then you can use %%bash
instead. Here is an example:
%%bash
ls
pwd
apt install proj-bin libproj-dev libgeos-dev
Package Managers pip and apt & Jupyter NB plugins
- pip is used to download and install packages directly from PyPI repository. PyPI is hosted by Python Software Foundation.
- apt is used to download and install packages from Ubuntu repositories which are hosted by Canonical. Canonical only provides packages for selected python modules. Whereas, PyPI hosts a much broader range of python modules.
- There are some cool plugins for Jupyter Notebook which help improve your productivity. Refer to this article:https://towardsdatascience.com/jupyter-notebook-extensions-517fa69d2231