Lecture 4
NumPy (https://numpy.org/) is one of the fundamental Python libraries for scientific computing. In essence, its aim is to make vector and array processing in Python much more efficient. Therefore, it would be your go-to for (numerical) data processing.
Numerical data processing with NumPy can, most often that not, be magnitudes faster than what you can write in Python, even if the operations are the same. This is because NumPy is partly written in C.
For example, if we want to compute the matrix multiplication of two arrays:
A = [[1, 4], [9, 5]] # 2 dimensional 'matrices' A and B
B = [[1, 2], [3, 4]]
C = [[0, 0], [0, 0]] # our result 'pre-allocated' with zeros
for i in range(len(A)):
for j in range(len(B)):
for k in range(len(B)):
C[i][j] += A[i][k] * B[k][j]
The previous example is quite un-weidly. We have to manually loop through the matrices and apply the computation for each element. This can be very slow in Python. NumPy provides a much cleaner and quicker interface:
import numpy as np
A = np.array([[1, 4], [9, 5]])
B = np.array([[1, 2], [3, 4]])
C = A @ B # or np.matmul(A, B)
print(C)
Results:
# => [[13 18]
# => [24 38]]
Before we can use NumPy, we must first install it if its not already. NumPy can easily be installed with one of your package managers of choice. For example, if you want to install via conda:
or with pip:
As we’ve seen previously, we use np.array
to create a numpy array from a Python data type
Results:
# => [[1 2 3]
# => [4 5 6]
# => [7 8 9]]
We’ve created a 3x3 matrix of integers. Note that, out-of-the-box, NumPy doesn’t support ragged arrays (matrices that are not rectangular), so this will not work as you expect:
A numpy array has various attributes that are useful for our numerical computing. Some of these include:
A = np.array([[1, 4], [9, 5]])
print(A.shape) # the shape of the array
print(A.size) # number of elements
print(A.ndim) # number of dimensions
print(A.nbytes) # storage used
print(A.dtype) # data type of elements
Results:
# => (2, 2)
# => 4
# => 2
# => 32
# => int64
In the previous example, the elements in the array we int64
. But normally, we will see float64
. However, there are many other available data types, where each of the different data types affects how much memory is used to represent the data.
https://numpy.org/doc/stable/user/basics.types.html https://numpy.org/doc/stable/reference/arrays.dtypes.html
When creating a NumPy array, NumPy will select what it thinks to be the most appropriate data type. However, we can tell NumPy explicitly what the data type should be with the dtype
argument.
A = np.array([[1, 2], [9, 5]], dtype=np.int8)
print(A)
print(A.dtype)
A = np.array([[1, 2], [9, 5]], dtype=np.float)
print(A)
print(A.dtype)
Results:
# => [[1 2]
# => [9 5]]
# => int8
# => [[1. 2.]
# => [9. 5.]]
# => float64
In some cases, we wish to change the data type of arrays after its creation. For this we use the .astype()
method. This method takes a single argument: the data type you wish to change the array to.
int64
We could change it to a float64
array:
float64
Or float32
:
float32
NumPy also provides us with a number of different functions to create arrays. Instead of doing this:
We could instead use the np.zeros
function, passing a tuple where each element of the tuple describes how many elements should be made in each dimension:
A = np.zeros((2,)) # 1 dimensional
A = np.zeros((2, 2)) # 2 dimensional
A = np.zeros((2, 5, 5)) # 3 dimensional
Another commonly used array creation function is the np.random.randn
function. This creates an array where elements are sampled from a normal distribution.
Results:
# => [[-0.68213848 -0.44274759]
# => [ 0.6748596 0.64244208]]
Note the interface is a little different than .zeros
, where instead of passing a tuple, we pass multiple arguments to the function.
It is also convenient to create arrays with ranges of elements.
Results:
# => [5 6 7 8 9]
Results:
# => [ 5. 5.26315789 5.52631579 5.78947368 6.05263158 6.31578947
# => 6.57894737 6.84210526 7.10526316 7.36842105 7.63157895 7.89473684
# => 8.15789474 8.42105263 8.68421053 8.94736842 9.21052632 9.47368421
# => 9.73684211 10. ]
There are many more ways to create arrays. Some include:
np.ones
- a matrix of 1’snp.eye
- an identity matrixnp.diag
- create a matrix with supplied elements across the diagonalnp.fromfunction
- load elements from the return of a functionnp.fromfile
- load elements from a data fileThough, the best resource for understanding is NumPy’s own documentation on the subject: https://numpy.org/doc/stable/user/basics.creation.html
In native Python, when we have a ‘matrix’ like data structure (just a list of lists), and we want to access a particular element from this matrix, we have to do something like:
Results:
# => 3
However, in NumPy, we seperate the indexes by comma:
Results:
# => 3
If we wanted to get all elements from the 2nd column we would use the :
notation. For example:
Results:
# => [2 4]
Likewise, all elements from the 2nd row:
Results:
# => [3 4]
Note that when we slice an array, we are not copying the elements:
Results:
# => [[ 1 10]
# => [ 3 4]]
Any modification you make to the b
variable will also affect A
. For that we must use .copy()
If we have a multi-dimensional array, to which we wish to index on the final dimension, one way to achieve this is by doing the following:
This can get pretty tedious the more that the number of dimensions increases. But! we have one syntactical shortcut at our disposal: the ellipses ‘…’. Using the ellipses in place of the many ‘:’ slices on each dimension, we’re telling NumPy to just take all elements from the prior dimensions. For example:
NumPy arrays can also be composed of boolean elements
Results:
# => [[ True False]
# => [False True]]
And we can also use boolean elements to help with indexing:
Results:
# => [1 5]
Therefore we can apply computations to only part of the array using this indexing feature:
Results:
# => [[11 -1]
# => [ 0 15]]
After an array has been created, we can modify its structure/shape using various functions. The first we shall look at is .reshape
. For example, let us create a vector of 4 elements and then reshape it into an array of 2x2 elements. Of course, the new shape of the array must be proportional to the original number of elements: 2x2 elements = 4 elements.
A = np.arange(1, 5)
mat_A = A.reshape(2, 2)
print(mat_A)
print(A) # A is not changed! No need for copy
Results:
# => [[1 2]
# => [3 4]]
# => [1 2 3 4]
If we wanted to take a 2d array and reshape it into a vector, we could of course use the .reshape
function again. But we could also use .flatten
.
Results:
# => [1 2 3 4]
When specifying the new dimensionality of the reshaped array, -1
is a shortcut to specify the dimensionality to allow reshaping to occur correctly. For example:
Results:
# => [1 2 3 4]
# => [[1 2]
# => [3 4]]
We’re telling NumPy to create an array with 2 elements on the 1st dimension, and then however many elements on the second dimension.
We can add and remove dimensions using .expand_dims
and .squeeze
, respectively.
Results:
# => [1 2 3 4]
# => [[1]
# => [2]
# => [3]
# => [4]]
We are taking a vector and adding a dimension. Note that we have to use np.expand_dims
passing the object we want to expand and not A.expand_dims
.
We can use an indexing trick with None
to do the expansion in just the same way:
Results:
# => [1 2 3 4]
# => [[1]
# => [2]
# => [3]
# => [4]]
Where None
indicates to NumPy where we want to add the new dimension.
If we want to instead remove a dimension, we can use .squeeze()
Results:
# => [1 2 3 4]
We are removing the 2nd dimension, but note that the elements must be singletons. So you cannot squeeze a 2x2 array.
Another useful feature is the matrix transpose:
Results:
# => [[1 2]
# => [3 4]]
# => [[1 3]
# => [2 4]]
or even:
Results:
# => [[1 3]
# => [2 4]]
If we have multiple arrays we want to ‘join’ together, we can use np.hstack
for horizontally joining, or np.vstack
for vertically joining arrays. Note the dimensions must match in the direction your stacking.
[1 2 3 4 5 6]
Results:
# => [[1 2 3]
# => [4 5 6]]
hstack
and vstack
can be useful when the required output shape is simply defined. However, there is a more general function - np.concatenate
- that will be more often useful to us.
[1 2 3 4 5 6]
Here we see that we can achieve the same result as np.hstack
using concatenate
. Notice also that there is a second argument to the concatenate
function: the dimension upon which the concatenation will take place.
https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html
We have already seen some basic examples of arithmetic operations in NumPy. But its worth looking at these in detail.
One of the best reasons to use NumPy is that the computations are vectorized and can be broadcast. We’ll see examples of what these mean.
Results:
# => [[ 1 4 9]
# => [ 4 10 18]]
We can perform vector and matrix arithmetic using Python’s infix operators like +
, *
, etc.
When we perform arithmetic operations, NumPy will convert the data into arrays for us. While this can help, its not best practice for vectors and matrices, for scalars it will be fine.
Results:
# => [[ 1 4 9]
# => [ 4 10 18]]
When we are working with singletons or scalar values, NumPy will automatically perform the broadcasting for us. So for example, if we want to double each element of an array:
Results:
# => [[ 2 4 6]
# => [ 8 10 12]]
NumPy will automatically broadcast the scalar 2
to every element of the shape and size of B
.
NumPy provides, in many cases, both infix and function operations.
Operation | Infix | Function |
---|---|---|
Addition | + | np.add |
Subtraction | - | np.subtract |
Multiplication | * | np.multiply |
Division | / | np.divide |
Matrix Multiplication | @ | np.matmul |
Power | ** | np.power |
Cos/Tan/Sin | np.cos, np.tan, np.sin | |
Square root | np.sqrt | |
Exponential, Logarithm | np.exp, np.log |
There are a number of different operations one can perform on a matrix. Such as the dot product of two matrices:
Results:
# => [ 7 10]
The inner product:
Results:
# => [ 5 11]
One mystical function is the einsum
function. This function can effectively replace other functions like dot
and inner
but it takes some understanding on how it works. einsum
is the application of Einstein Summation, a succinct method of describing the multiplication between matrices. Lets first look at an example of the outer product:
Results:
# => [ 7 10]
Or the inner product:
Results:
# => [ 5 11]
In einsum
we are giving a letter for each dimension of each array we pass to the function.
So with: 'i,ij->j'
for the inner product of matrices A and B, we are saying that the first dimension of A (its only dimension) is labelled i, while for B the dimensions are labelled as i and j respectively. The labels that exist in both sequences are summed over.
Einsum can take a little time to fully understand and appreciate, but it can be a very powerful function with a very succinct syntax.
https://www.youtube.com/watch?v=CLrTj7D2fLM - Khan Academy - Einstein Summation Convention
Lets say you have some function that computes the square of a number:
Results:
# => 16
As the function is simple, it takes one argument and returns one argument, we can pass a NumPy array and will get the correct result.
Results:
# => [ 1 4 9 16 25 36 49 64 81]
However, if the function is more complicated, it will not work.
def myfunc(a, b):
"Return a-b if a>b, otherwise return a+b"
if a > b:
return a - b
else:
return a + b
print(myfunc(A, 2))
Results:
# => Traceback (most recent call last):
# => File "<stdin>", line 1, in <module>
# => File "/tmp/pyqVNaN0", line 3, in <module>
# => File "/tmp/babel-jHhWMz/python-nKlyRH", line 8, in <module>
# => print(myfunc(A, 2))
# => File "/tmp/babel-jHhWMz/python-nKlyRH", line 3, in myfunc
# => if a > b:
# => ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
To allow us to use this function over an array, we can use the np.vectorize
function to create a new function, which applies myfunc
over each element.
Results:
# => [3 4 1 2 3 4 5 6 7]
Here we pass the function we want to vectorize myfunc
to the np.vectorize
function. The return of this function is another function!
We’ve only scratched the surface of what NumPy can offer us! One of the best starting points for learning about NumPy is NumPy’s own user guide on the web: https://numpy.org/doc/stable/user/index.html