NumPy
Over the years statisticians and programmers have built many libraries using which routine calculations like mean, median, standard deviation on a set of numbers and data operations like matrix addition/multiplication becomes not only simple but highly scalable and efficient. One such very popular library is NumPy.
NumPy Array
NumPy is famously known for its N-dimensional array data structure and the ease of complex operations that can be performed on these arrays. Powerful functions are provided which efficiently operate on this array object to calculate statistical and mathematical metrics.
By using these structures and their methods, you will be not only be eliminating the for-loops for iterating the basic Python list object but also be able to work on several times more data with the same hardware. In this lesson, you will learn some common operations that can be performed on them.
While Python lists can hold heterogeneous data, NumPy arrays can only hold one type of data. Here is an example of constructing 1-dimensional array:
import numpy as np
a = np.array([15, 35, 55])
print(type(a))
print(a)
Output:
<class 'numpy.ndarray'>
[15 35 55]
In the above example, you first import the NumPy module by adding the statement:
import numpy as np
This statement directs the Python interpreter to load the NumPy module which is already installed on your computer. It also directs the interpreter to refer to NumPy using np alias namespace. Although you can name what ever else you want for your NumPy alias namespace, the standard however is to use np. It is also a standard to create an alias namespace instead of directly using numpy without an 'as np' after the import.
Now that NumPy is imported, you can use the dot operator on np to invoke the array method to create a NumPy array which is then assigned to variable a. If you now print out the data type of a you will see that it is of type np.ndarray. You can also print all elements of an array by sending the ndarray to the print function which is similar to the Python list.
Important attributes of ndarray
- ndarray.ndim: The number of axes (dimensions) of the array.
- ndarray.shape: This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the number of axes, ndim.
- ndarray.size: The total number of elements of the array. This is equal to the product of the elements of shape.
- ndarray.dtype: Provides the data type of the elements in the array. You can specify the standard Python data types or use additional data types provided by NumPy, e.g., numpy.int32, numpy.int16, and numpy.float64 etc.
print(a.dtype) # Prints the datatype of items stored
print(a.ndim) # Prints the array dimension
print(a.shape) # Prints the array shape
print(a.size) # Prints the array size
Output:
int64
1
(3,)
3
Notice that by default, it assigned int64 as the data type for integers. To change to the algebraic complex number you would specify the dtype as complex as shown below:
a = np.array([15, 35, 55], dtype=complex)
print(a)
Output:
[ 15.+0.j 35.+0.j 55.+0.j]
Important functions for statistics
Until now although you have constructed an array which is supposed to hold numbers in lesser memory space compared to Python lists, you have not seen any other functions. Now you will see some cool statistical functions that you can apply on array of numbers:
scores = np.array([92, 34, 88, 80, 73, 100, 100])
print(scores.mean()) # Prints the mean of the array elements
print(scores.max()) # Prints the max value in the array
print(scores.min()) # Prints the min value in the array
print(scores.argmin()) # Prints the index number of the min value
print(scores.std()) # Prints the standard deviation of the array elements
Output:
81.0
100
34
1
21.2670099987
If there are NaN values in the dataset, you can use instead 'np.nanmax(scores)', 'np.nanstd(scores)' etc., which ignore the NaN values while calculating the metrics. Otherwise, even if a single NaN value is present, the result is always a NaN.
Although you could have used the built-in function in Python to find the max and min values in a list, NumPy provides us these functions as part of the the array itself along with functions for mean and standard deviation.
To find the median and mode you could use the np.median function and scipy's stats module. Here is the example:
np.median(scores)
from scipy import stats
stats.mode(scores)
Output:
88.0
ModeResult(mode=array([100]), count=array([2]))
Can you count how many loops we avoided by using NumPy? You will see more of these convenience methods which eliminates multiple loops in the next lesson.
Arrays of different data types
Here are some more examples of creating homogeneous arrays of different types.
print(np.array([1.0, 1.5, 2.0, 2.5]).dtype) # Float type
print(np.array([True, False, True]).dtype) # Boolean type
print(np.array(['AL', 'AK', 'AZ', 'AR', 'CA']).dtype) # Unicode 2 characters
Output:
float64 bool
<U2
Array accessing and slicing
Array element access and slicing of arrays are similar to Python list. Here are some examples:
countries = np.array([
"US", "CA", "MX"
])
print(countries[0]) # Prints out the first element of the array
print(countries[1:]) # Get elements from index 1 (included) to the end of the list
print(countries[:1]) # Get elements from the beginning to index 1 (excluded)
print(countries[0:2]) # Get elements from 0 index to 2 index (excluded)
print(countries[:]) # Get all elements
Output:
US
['CA' 'MX']
['US']
['US' 'CA']
['US' 'CA' 'MX']
Slicing in multidimensional arrays
You can convert a 1-dim to 2-dim by applying the reshaping method and here is an example:
x = np.reshape(np.arange(0,12), (3,4))
x
Output:
array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])
You can also reshape by setting the shape property of an existing array as shown below
x = np.array([2, 3, 5, 8, 0, 1])
x.shape = (3,2)
With multidimensional you can slice by giving the index values for each dimension. Here are some examples:
Example | Output | Comments |
---|---|---|
x[1] |
array([4, 5, 6, 7]) | gets the second row completely |
x[1:] |
array([[ 4, 5, 6, 7], [ 8, 9, 10, 11]]) |
gets all the rows starting from second |
x[:2] |
array([[0, 1, 2, 3], [4, 5, 6, 7]]) |
gets all the rows until 2 - excluding 2 |
x[:,2] |
array([ 2, 6, 10]) | gets all the rows of the third column |
x[:,(1,3)] |
array([[ 1, 3], [ 5, 7], [ 9, 11]]) |
gets all the rows of the given two column values after the : |
x[(2):,(1,3)] |
array([[ 9, 11]]) | gets given two (1,3) column values of the third row (2) |
Note: Adding a comma after the ':' changes the selection completely!
Using ellipsis to select all dimensions except the inner most one
In numpy you can use ellipses to select full slices (:) of all the outer dimensions and select only the few indices of the inner dimension after the eclipse. Here is an example
a = np.arange(16).reshape(2, 2, 2, 2)
a[...,0]
Output:
array([[[ 0, 2], [ 4, 6]], [[ 8, 10], [12, 14]]])
The above slice selects all indices of the outer dimensions and selects only the 0th index of the inner most dimension
Logical Operators
You can apply logical expressions on the array data and here are some examples
Example | Output | Comments |
---|---|---|
x[x>9] |
array([10, 11]) | gets all array elements meeting the given expression |
np.any([x>2]) |
True | gets the result as applied to the entire array |
np.all([x>2], axis=1) |
array([[False, False, False, True]]) | adding axis=1 will get you the result as applied to columns |
np.any([x>2], axis=0) |
array([[False, False, False, True], [ True, True, True, True], [ True, True, True, True]]) |
adding axis=0 will get you the result as applied to each element |
x[np.logical_and(x > 2, x < 9)] |
array([3, 4, 5, 6, 7, 8]) | 'and' operators between multiple logical operators |
np.logical_or(x < 4, x > 6) |
array([[ True, True, True, True], [False, False, False, True], [ True, True, True, True]]) |
'or' operator that returns boolean by applying the expression on each element |
Similarly you have logicalnot and logical_xor that apply the _not and xor operators respectively
Note:
You cannot use multiple logical operators together like the example given below
x[(x>9) or (x<2)]
This gives you a value error as it can't figure out if you want a vector of booleans or one boolean. So instead use the 'any' or 'all' operator similar to Python as shown below:
np.any([x>9,x<2])
NumPy arrays can only be homogeneous
With NumPy array, you cannot construct an array with heterogeneous data similar to Python lists. Here is an attempt at constructing one and see how the data types are converted:
mixed_array = np.array(['1', 2, np.nan])
print (mixed_array)
print (type(mixed_array[1]))
Output:
['1' '2' 'nan']
<class 'numpy.str_'>
Note that all values are converted to strings.
Convenience functions commonly used
Example | Output | Comments |
---|---|---|
np.arange(2) |
array([0, 1]) | similar to Python range except, what you get is ndarray instead of Python list |
np.arange(0, 1, 0.3) |
array([0. , 0.3, 0.6, 0.9]) | same as above but this includes the stop and the step value |
np.zeros(2) |
array([0., 0.]) | get an array of zeroes of the given size |
np.zeros((2,3)) |
array([[0., 0., 0.], [0., 0., 0.]]) |
get a multidimensional array of zeroes of the given size |
np.ones(2) |
array([1., 1.]) | using this you get 1's instead. You can create multidimensional here also |
x = np.array([3, 4]) np.any(x>0) |
True | returns True if the expression is true for any element |
np.random.default_rng().random(2) |
array([0.xxxxxxx, 0.xxxxxx]) | returns ndarray of random numbers of uniformly distributed float between 0 and 1 of the given size |
np.random.default_rng().integers(2, 10) |
x | returns 'x' a random integer between the given range |
np.random.default_rng().standard_normal(10) |
array(10 floats) | returns ndarray of standard normal distributed floats between 0 and 1 of the given size |
Note: The legacy implementation of random.random should be avoided in favor of default_rng method. Refer: https://numpy.org/doc/stable/reference/random/index.html
Reference
- Reference for more math functions: https://numpy.org/doc/stable/reference/routines.math.html