There are three open-source modules not built into Python that are often referred to as the "scientific software stack":
Numpy provides a class for representing vectors, matrices and higher-order tensors, and functions for performing linear algebra operations.
Many algorithms and models can be represented using matrix operations:
Matrix operations are easily vectorisable and thus scale well. Representing problems using matrices might be unintuitive at the beginning, but once you get the hang of it, you can apply your skills to various problems.
The convention is to import numpy under the shorthand name np
:
import numpy as np
All tensors use the ndarray
class:
x = np.array([1, 2, 3])
x
x + 2
We could just use Python lists of (lists of ...) numbers to represent tensors. But this would be slow:
Numpy stores tensors like arrays in C, and uses highly-optimized BLAS libraries for tensor operations. Anything performance-critical (especially loops!) is implemented in C rather than Python. So it is fast as C, with the convenience of Python.
Let's compare performance:
x = range(10000)
%timeit [v + 2 for v in x]
x = np.arange(10000)
%timeit x + 2
Of course this requires us to express our algorithms without resorting to for
loops. Looping over a numpy array is as bad as a list:
x = np.arange(10000)
%timeit np.array([v + 2 for v in x]) # bad idea
Let's start with a simple vector again:
x = np.array([1, 2, 3])
Every numpy array has a dimensionality, shape and data type:
x.ndim
x.shape
x.dtype
Indexing works as for lists and strings:
x
x[0]
x[::-1]
A matrix can be constructed from a list of rows:
x = np.array([[1, 2, 1], [2, 3, 2]])
x
print(x.ndim, x.shape, x.dtype)
Indexing can now take a specification for each dimension:
x[0, 0]
First row:
x[0]
First column (:
is shorthand for "everything"):
x[:, 0]
Transposing a matrix:
x.T
We can also create tensors just by giving their shape:
np.zeros((1, 2))
np.ones((2, 2, 2))
np.random.rand(3, 2)
Transposing a higher-order tensor:
x = np.zeros((3, 4, 5))
x = x.transpose(0, 2, 1) # new order of the three dimensions
x.shape
You may have noted that sometimes, numbers end with a dot (.
) -- this denotes floating point numbers.
np.array([1, 1])
np.array([1, 1], dtype=np.float32)
The data type can be specified in any array-constructing function. Important data types include:
Conversions are done using .astype
(which creates a copy even if the type stays the same) or asarray
(which only copies if needed):
x = np.zeros(3, dtype=np.bool)
x
x.astype(np.double)
np.asarray(x, dtype=np.int)
Shapes can be changed if the total number of elements stays the same.
x = np.arange(6).reshape(2, 3)
x
x.ravel()
x.reshape(3, 2)
x.reshape(-1, 2, 1) # -1 is inferred to match
Slicing, reshaping and transposition create views into the same underlying memory, not copies of the data. Changing data through a view affects the original array:
x = np.zeros(5)
y = x[::2] # a view of every second element
print(y)
y += 3
print(x)
(If it helps: A numpy array stores a pointer to the first item in a block of memory, the size of each dimension, and the stride of each dimension telling how far apart entries in that dimension are stored in memory. You can access them as x.data
, x.shape
, and x.stride
. Reshaping, taking subtensors, transposing and reversing can all be expressed by changing the pointer, size or stride, without copying any memory. In general, if no memory copy is needed, numpy will avoid it.)
We can also select elements from an array using a boolean mask, or an array of indices. This is called fancy indexing. It creates a copy, not a view:
x = np.arange(10)
y = x[x > 5]
print(y)
y *= 0
print(x)
However, using slice assignment, we can modify elements selected with fancy indexing:
print(x)
x[x > 5] = 1
print(x)
(If it helps: This statement calls x.__setitem__
, it does not construct x[x > 5]
and then assign to it.)
Most operations on arrays are applied elementwise:
x = np.arange(6).reshape(2, 3)
x * 5
They work as long as the two operands have compatible shapes. That is, either the same shape:
x + np.ones((2, 3))
Or the same shape, but with some dimensions of size 1:
x + np.arange(3).reshape(1, 3)
x + np.arange(2).reshape(2, 1)
Dimensions of size 1 ("singleton dimensions") are implicitly replicated as needed. This is called broadcasting. If the dimensionality of the operands is not the same, then dimensions of size 1 are appended on the left.
x
y = np.array([10, 20, 30])
x + y # broadcasting: add y to every row
(If it helps anybody, this is the same as using bsxfun
in Matlab/Octave.)
Since missing dimensions are always added on the left, a one-dimensional array is always interpreted as a row vector. To get a column vector, we need to make it two-dimensional.
y = np.array([10, 20])
y[:, np.newaxis] # view of y with new dimension in the end
x + y[:, np.newaxis] # add column vector to every column
We can also add a row vector and a column vector to get a matrix:
x = np.arange(3)
y = np.arange(2)
x + y[:, np.newaxis]
We saw that *
just does an elementwise multiplication. For a dot product, we use np.dot
or the @
operator:
x = np.ones((2, 3))
y = np.ones((4, 3))
x @ y.T
Often we need to compute multiple sums or other statistics. This is called a reduction operation, and can be applied to a subset of dimensions:
x = np.random.randn(10, 20, 30)
y = x.mean(axis=0)
y.shape
Negative axes are counted from the end, as in array indexing:
y = x.mean(axis=-1)
y.shape
We can keep the dimension we reduce over as a singleton dimension:
y = x.mean(axis=1, keepdims=True)
y.shape
This is especially useful when combined with broadcasting.
x -= x.mean(axis=(1, 2), keepdims=True) # subtract mean over last two dims
Other reduction operations include std
, var
, sum
, median
, min
, max
. Most can be accessed both as methods of an array (x.min()
) and as a module function (np.min(x)
), some only as the latter.
To fully benefit from the speed of numpy, we must express everything as vectorized operations. For example, let's take the following matrix:
x = np.arange(15).reshape(3, 5)
x
And assume that from each row, we want to subtract the smallest element. It is tempting to write this as:
for row in x:
m = row.min()
row -= m
print(x)
But this can be done entirely without a Python loop. We can compute the minima of all rows in one go:
x = np.arange(15).reshape(3, 5)
m = x.min(axis=1, keepdims=True)
print(m)
And then subtract them from the corresponding rows, thanks to broadcasting:
x -= m
print(x)
It takes a while to get the hang of array-oriented programming, but then it is very powerful.
The native format for saving a numpy array is a .npy
file:
x = np.arange(15).reshape(3, 5)
np.save('smallmatrix.npy', x)
It can be loaded again with:
x = np.load('smallmatrix.npy')
x
For .npy
files too large to fit into your RAM, you can map them directly from disk to memory:
x = np.load('smallmatrix.npy', mmap_mode='r') # 'r' for readonly
x
Multiple arrays can be combined into a .npz
file, which is a disguised ZIP archive of .npy
files:
np.savez('matrices.npz', x=np.arange(5), y=np.arange(10))
They are also loaded with np.load
, giving access to all arrays stored within:
data = np.load('matrices.npz')
print(data.files)
data['x']
data['y']
You can also read text files with np.loadtxt()
and write them with x.tofile()
, and Matlab files with the scipy.io.matlab
module.
The most widespread module for plotting data and creating figures for papers in Python is matplotlib
. It nicely integrates with notebooks when entering the %matplotlib inline
magic function. Let's do so and import it:
%matplotlib inline
import matplotlib.pyplot as plt
pyplot
is an easy interface for matplotlib
, and imported as plt
by convention. It relies on numpy
for the data representation.
x = np.linspace(0, 10, 100)
y = np.cos(x)
plt.plot(x, y, label='low')
plt.plot(x, 2 * y, color='r', linestyle=':', label='high')
plt.legend(loc='upper right')
data = {'apples': 2, 'pasta': 0, 'fennel': 30}
xpos = [0, 1, 2]
plt.bar(x=xpos, height=data.values())
plt.xticks(xpos, data.keys())
plt.suptitle('remaining stock')
x = np.random.randn(40, 50)
x += np.arange(50)
plt.matshow(x)
plt.colorbar()
The matplotlib tutorial includes a guide on choosing color maps for a particular purpose.
fig, axs = plt.subplots(1, 2, sharey=True, figsize=(8, 3))
axs[0].plot(20 - np.random.rand(20) - np.arange(20))
axs[1].plot(20 - np.random.rand(20) - np.arange(20))
axs[0].set_title('orange dataset')
axs[1].set_title('apple dataset')
plt.plot(np.random.rand(20))
plt.suptitle('randomness')
plt.savefig('myplot.pdf', transparent=True, bbox_inches='tight')
If you start the interactive console with ipython --matplotlib
, you will get interactive plotting: The first plotting command opens a figure in a separate window, and all other plotting commands update it.
In a Python script, call plt.show()
whenever you want to display the figure(s) you created. Or just call plt.savefig()
and do not show anything on screen.
These were just the basics -- see the matplotlib gallery or the matplotlib reference for more plotting goodness.
If you want to dig deeper, also look at pandas for data analysis and visualization.
scikit-learn is a Python package of machine learning algorithms.
From their web page:
Installation with pip:
sudo pip3 install scikit-learn
Installation with conda:
conda install scikit-learn
Check if it can be imported:
import sklearn
Datasets are represented as numpy arrays of input features, such as:
rng = np.random.RandomState(23) # reproducible random generator
x = rng.randn(100, 2) # 100 data points of 2 features
x.shape
And, in a supervised setting, corresponding target classes or values:
y = (np.linalg.norm(x, axis=-1) > 1).astype(np.int)
y.shape
scikit-learn provides classes for:
Generally, these classes provide a fit()
method for training, and transform()
for application to new data.
In addition, scikit-learn provides utilities for:
Above, we created a toy dataset of 2D points. Visualized:
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='Set2')
We'll do a train/test split:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.20, random_state=42)
To obtain a validation set, you'd further split the training data.
Let's fit a classifier to it. Scikit-learn includes many of those. We'll try Gaussian Naive Bayes:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
Hyperparameters would be passed to the constructor.
We can now train the classifier:
gnb.fit(x_train, y_train)
And apply it to our test set:
y_pred = gnb.predict(x_test)
Let's plot the result (along with the known ground truth decision boundary):
plt.scatter(x_test[:, 0], x_test[:, 1], c=y_pred, cmap='Set2')
plt.gca().add_artist(plt.Circle((0, 0), 1.0, fill=False, ec='r', ls=':'))
scikit-learn also includes evaluation metrics for classification:
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred)
metrics.confusion_matrix(y_test, y_pred)
print(metrics.classification_report(y_test, y_pred,
target_names=['inside', 'outside']))
Instead of a single train/test or train/validation split, we can do cross-validation:
from sklearn.model_selection import cross_val_score
gnb = GaussianNB()
scores = cross_val_score(gnb, x_train, y_train, cv=5) # 5 folds
print(scores)
print('%.3f ± %.3f' % (scores.mean(), scores.std()))
This will use the default scoring function of the model (here, accuracy). See the documentation for more examples.
from sklearn.preprocessing import StandardScaler
scl = StandardScaler()
scl.fit(x_train)
x_train_normalized = scl.transform(x_train)
x_test_normalized = scl.transform(x_test)
We can plot this again:
plt.scatter(x_train_normalized[:, 0], x_train_normalized[: ,1],
c=y_train, cmap='Set2')
Again, see the documentation for more examples.
from sklearn.pipeline import Pipeline
scl = StandardScaler()
gnb = GaussianNB()
pipeline = Pipeline([('transformer', scl), ('estimator', gnb)])
Then use this as if it was a plain model:
scores = cross_val_score(gnb, x_train, y_train, cv=5) # 5 folds
print(scores)
print('%.3f ± %.3f' % (scores.mean(), scores.std()))
Scores are exactly the same because...
...this particular model will model the mean and standard deviation of the data anyway.
For more examples and documentation, see: