scikit-learn,

a standard library for machine learning in Python. It describes itself like this:

Machine Learning in Python
•Simple and efficient tools for data mining and data analysis
•Accessible to everybody, and reusable in various contexts
•Built on NumPy, SciPy, and matplotlib
•Open source, commercially usable - BSD license

scikit-learn offers a big selection of machine learning algorithms, as well as validation methods and tools to optimize hyper parameters and conduct feature selection.

scikit-learn can be installed by using pip to install the sklearn library. Different algorithms are located in different modules.
For example, the K-NN classifier is located in the neighbors module:

from sklearn.neighbors import KNeighborsClassifier

Algemeen

Every classifier (and every regressor) has
- a `fit(X, y)` function to fit the model to the data
- a `predict(X)` function that delivers an array of predictions

The `fit(X, y)` function expects a matrix with all examples as rows and the attributes as columns, as well as a vector that contains the label.

The `predict(X)` function only expects the attribute matrix.

Mostly, passing a pandas dataframe as `X` works fine, but sometimes this causes problems and you need to supply a numpy matrix. You can get a numpy matrix from a dataframe via the `values` attribute: `matrix = df.values`

Validatie met een Test set

from sklearn.model_selection import train_test_split

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)

Stratify sampling

When sampling data like we are doing, it's possible that the classes in the testset are differently distributed that the ones in the trainingset by random chance, especially when we are dealing with small sample sizes. We would like the training and testing data to have similar distributions so that we can make good estimations of the performance.

In Scikit-Learn, instead of regular sampling, we can do stratified sampling. Which means that Scikit-Learn makes sure that the distribution of classes is the same for both sets.

Fit and predict

clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(y_pred[:5])
The result is a Numpy array. Scikit-Learn works with numpy under the hood. Although it accepts Pandas objects as input, functions like predict still return numpy arrays

Computing the Accuracy

The accuracy is the percentage of correctly classified examples. Like all performance measures, it is implemented in the metrics submodule:

from sklearn import metrics

accuracy = metrics.accuracy_score(y_test, y_pred)
print(accuracy)

Interpretatie van the Accuracy (Baseline)

Above value is the accuracy, i.e., the fraction of correctly classified examples. We aim at a high accuracy (good predictions!)

Highest accuracy is 1, lowest is 0. Is our result good?
•Depends on the number of classes and the class distribution. For example, what if one class makes up 80% of the data and you just always predict that class?

Maak daarom een baseline en compare to default classifier for a very simple baseline

bijv
from sklearn.dummy import DummyClassifier
# DummyClassifier always predicts majority class (= most frequent class)
dummy = DummyClassifier(strategy='most_frequent')

# used in same way as the k-NN classifier
dummy.fit(X_train, y_train)
dy_pred = dummy.predict(X_test)
print(metrics.accuracy_score(y_test, dy_pred))

Confusion Matrix

The accuracy alone can be deceiving, so what else can we do? The confusion matrix shows us what kind of mistakes the model made (which classes have been confused with eachother):

cm = metrics.confusion_matrix(y_test, y_pred)
print(cm)

Here the rows denote the true classes, and the columns the predicted classes. Let's make it a bit more readable by putting it in a DataFrame and setting the row and column index:

(the clf object has an attribute classes_ that contains the classes in the dataset)

cm_df = pd.DataFrame(cm, index=clf.classes_, columns=clf.classes_)
cm_df

If you have many classes, often a heat map is a nice viusalization. Let's pretend that we have a lot of classes and use the Seaborn implementation:

sns.heatmap(cm_df, annot=True, cbar=False, square=True)

Precision , Recall

Recall of class X: fraction of examples from class X that are classified correctly. How many percent of class X do you find?
•Precision of a class X: fraction of correctly classified examples if prediction is X. "If you predict X, how certain are you that it really is an X"
•There is a tradeoff between the two

https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c

Overfitting

If all went well, you probably noticed that setting k too low resulted in relatively poor performance. This is due to overfitting.
Overfitting means that your model is capturing noise in the training data, and not generalizing well to new data (your test data). This is the result of a model that is too complex/flexible, making it very easy for the model to fit to noise in the training data. The opposite of this problem is underfitting, which means the model is not flexible/complex enough to capture the pattern in the data.
Different kinds of models have different levels of flexibility/complexity. But within a model, the level of complexity can usually be influenced by the hyperparameters settings.
In the case of k-NN, the parameter k (n_neighbors) controls the complexity:

lower k: more flexibility, and therefore higher chance of overfitting
higher k: less flexibility, and therefore a higher chance of underfitting

Getting the best model is a matter of finding the sweet spot that results in the best fit
bla

Preprocessing

We discussed k-NN having problems with "uneven" axes, but we didn't deal with that problem yet. If we don't do anything about it, then features with higher values will have more influence on the results. For example, if we were modeling people and had information about their age and their yearly income, then yearly income would have a much bigger influence on the result than age, because the values are much bigger.

StandardScaler

In Scikit-Learn, the StandardScaler standardizes you features by substracting their mean, and dividing by their standard deviation. This alleviates the problem that k-NN can have with "uneven" axes. This is a common preprocessing step for many algorithms

Transformers

The StandardScaler is an example for a transformer in sklearn.

Similar to predictors, they have a fit() function that learns the transformation parameters. For the StandardScaler, these are mean and standard deviation of every column.
To actually perform the transformation, the transform() function is used
The fit_transform() function is a shortcut for first calling fit() and then transform() on the data.

A little inconvenience is that transform() returns a numpy matrix instead of a pandas DataFrame.

from sklearn.preprocessing import StandardScaler
df_iris = pd.read_csv("../data/iris.csv")
X = df_iris.drop("species", axis=1)
y = df_iris.species

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Information Leaks in Split Validation

do not split a transformed dataset in a train and a testset

•in split validation (and also cross validation (later)) we need to be super careful about data/information leaking from test data into training data
•normalization uses whole data set to calculate scaling parameters => effectively leaking information from test into training
•solution: place calculation of standardization factors into training part
•best practice: use pipelines (later)

So the solution is to first fit the scaler on the training set, and then apply it to the test set *(just like you would do with a predictive model) *

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Python Knowledge Center

maandag 1 april 2019

Scikit-Learn