This article was originally posted by Derrick Mwiti on neptune.ml/blog where you can find more in-depth articles for machine learning practitioners.
Keras metrics are functions that are used to evaluate the performance of your deep learning model. Choosing a good metric for your problem is usually a difficult task.
- you need to understand which metrics are already available in Keras and tf.keras and how to use them,
- in many situations you need to define your own custom metric because the metric you are looking for doesn’t ship with Keras.
- sometimes you want to monitor model performance by looking at charts like ROC curve or Confusion Matrix after every epoch. Lucky for you, this article explains all that!
Keras metrics 101
In Keras, metrics are passed during the compile stage as shown below. You can pass several metrics by comma separating them.
from keras import metrics
model.compile(loss='mean_squared_error', optimizer='sgd',
metrics=[metrics.mae,
metrics.categorical_accuracy])
How you should choose those evaluation metrics?
Some of them are available in Keras, others in tf.keras. Sometimes you need to implement your own custom metrics.
Let’s go over all of those situations.
Which metrics are available in Keras?
Keras provides a rich pool of inbuilt metrics. Depending on your problem, you’ll use different ones.
Let’s look at some of the problems you may be working on.
Binary classification
Binary classification metrics are used on computations that involve just two classes. A good example is building a deep learning model to predict cats and dogs. We have two classes to predict and the threshold determines the point of separation between them.binary_accuracy and accuracy are two such functions in Keras.
binary_accuracy, for example, computes the mean accuracy rate across all predictions for binary classification problems.
keras.metrics.binary_accuracy(y_true, y_pred, threshold=0.5)
The accuracy metric computes the accuracy rate across all predictions. y_true represents the true labels while y_pred represents the predicted ones.
keras.metrics.accuracy(y_true, y_pred)
The confusion_matrix displays a table showing the true positives, true negatives, false positives, and false negatives.
keras.metrics.confusion_matrix(y_test, y_pred)
In the above confusion matrix, the model made 3305 + 375 correct predictions and 106 + 714 wrong predictions.
You can also visualize it as a matplotlib chart which we will cover later.
You can also visualize it as a matplotlib chart which we will cover later.
Multiclass classification
These metrics are used for classification problems involving more than two classes. Extending our animal classification example you can have three animals, cats, dogs, and bears. Since we are classifying more than two animals, this is a multiclass classification problem.
The shape of y_true is the number of entries by 1 that is (n,1) but the shape of y_pred is the number of entries by number of classes(n,c)
categorical_accuracy metric computes the mean accuracy rate across all predictions.
keras.metrics.categorical_accuracy(y_true, y_pred)
sparse_categorical_accuracy is similar to the categorical_accuracy but mostly used when making predictions for sparse targets. A great example of this is working with text in deep learning problems such as word2vec. In this case, one works with thousands of classes with the aim of predicting the next word. This task produces a situation where the y_true is a huge matrix that is almost all zeros, a perfect spot to use a sparse matrix.
keras.metrics.sparse_categorical_accuracy(y_true, y_pred)
top_k_categorical_accuracy computes the top-k-categorical accuracy rate. We take top k predicted classes from our model and see if the correct class was selected as top k. If it was we say that our model was correct.
keras.metrics.top_k_categorical_accuracy(y_true, y_pred, k=5)
Regression
The metrics used in regression problems include Mean Squared Error, Mean Absolute Error, and Mean Absolute Percentage Error. These metrics are used when predicting numerical values such as sales and prices of houses. Check out this resource for a complete guide on regression metrics.
from keras import metrics
model.compile(loss='mse', optimizer='adam',
metrics=[metrics.mean_squared_error,
metrics.mean_absolute_error,
metrics.mean_absolute_percentage_error])
metrics.categorical_accuracy])
How to create custom metric in Keras?
As we had mentioned earlier, Keras also allows you to define your own custom metrics.
The function you define has to take y_true and y_pred as arguments and must return a single tensor value. These objects are of type Tensor with float32 data type.The shape of the object is the number of rows by 1. For example, if you have 4,500 entries the shape will be (4500, 1).
You can use the function by passing it at the compilation stage of your deep learning model.
model.compile(...metrics=[your_custom_metric])
How to calculate F1 score in Keras (precision, and recall as a bonus)?
Let’s see how you can compute the f1 score, precision and recall in Keras. We will create it for the multiclass scenario but you can also use it for binary classification.
The f1 score is the weighted average of precision and recall. So to calculate f1 we need to create functions that calculate precision and recall first. Note that in multiclass scenario you need to look at all classes not just the positive class (which is the case for binary classification)
def recall(y_true, y_pred):
y_true = K.ones_like(y_true)
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
all_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
recall = true_positives / (all_positives + K.epsilon())
return recall
def precision(y_true, y_pred):
y_true = K.ones_like(y_true)
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
precision = true_positives / (predicted_positives + K.epsilon())
return precision
def f1_score(y_true, y_pred):
precision = precision_m(y_true, y_pred)
recall = recall_m(y_true, y_pred)
return 2*((precision*recall)/(precision+recall+K.epsilon()))
The next step is to use these functions at the compilation stage of our deep learning model. We are also adding the Keras accuracy metric that is available by default.
model.compile(...,metrics=['accuracy', f1_score, precision, recall])
Let’s now fit the model to the training and test set.
model.fit(x_train, y_train, epochs=5)
Now you can evaluate your model and access the metrics you have just created.
(loss,
accuracy,
f1_score, precision, recall) = model.evaluate(x_test, y_test, verbose=1)
Great, you now know how to create custom metrics in keras.
That said, sometimes you can use something that is already there, just in a different library like tf.keras 🙂
Which metrics are available in tf.keras?
Recently Keras has become a standard API in TensorFlow and there are a lot of useful metrics that you can use.
Let’s look at some of them. Unlike in Keras where you just call the metrics using keras.metrics functions, in tf.keras you have to instantiate a Metric class.
For example:
tf.keras.metrics.Accuracy()
There is quite a bit of overlap between keras metrics and tf.keras. However, there are some metrics that you can only find in tf.keras.
Let’s take a look at those.
tf.keras Classification Metrics
tf.keras.metrics.AUC computes the approximate AUC (Area under the curve) for ROC curve via the Riemann sum.
model.compile('sgd', loss='mse', metrics=[tf.keras.metrics.AUC()])
You can use precision and recall that we have implemented before, out of the box in tf.keras.
model.compile('sgd', loss='mse',
metrics=[tf.keras.metrics.Precision(),
tf.keras.metrics.Recall()])
tf.keras Segmentation Metrics
tf.keras.metrics.MeanIoU – Mean Intersection-Over-Union is a metric used for the evaluation of semantic image segmentation models. We first calculate the IOU for each class:
model.compile(... metrics=[tf.keras.metrics.MeanIoU(num_classes=2)])
tf.keras Regression Metrics
Just like Keras, tf.keras has similar regression metrics. We won’t dwell on them much but there is an interesting metric to highlight called MeanRelativeError.
MeanRelativeError takes the absolute error for an observation and divides it by constant. This constant, normalizer, can be the same for all observations or different for each sample.
Therefore, the mean relative error is the average of the relative errors.
tf.keras.metrics.MeanRelativeError(normalizer=[1, 3, 2, 3])
How to create a custom metric in tf.keras?
In tf.keras you can create a custom metric by extending the keras.metrics.Metric class. To do so you have to override the update_state, result, and reset_state functions:
- update_state() does all the updates to state variables and calculates the metric,
- result() returns the value for the metric from state variables,
- reset_state() sets the metric value at the beginning of each epoch to a predefined constant (typically 0)
class MulticlassTruePositives(tf.keras.metrics.Metric):
def __init__(self, name='multiclass_true_positives', **kwargs):
super(MulticlassTruePositives, self).__init__(name=name, **kwargs)
self.true_positives = self.add_weight(name='tp', initializer='zeros')
def update_state(self, y_true, y_pred, sample_weight=None):
y_pred = tf.reshape(tf.argmax(y_pred, axis=1), shape=(-1, 1))
values = tf.cast(y_true, 'int32') == tf.cast(y_pred, 'int32')
values = tf.cast(values, 'float32')
if sample_weight is not None:
sample_weight = tf.cast(sample_weight, 'float32')
values = tf.multiply(values, sample_weight)
self.true_positives.assign_add(tf.reduce_sum(values))
def result(self):
return self.true_positives
def reset_states(self):
# The state of the metric will be reset at the start of each epoch.
self.true_positives.assign(0.)
Then we simply pass it at compile stage:
model.compile(...,metrics=[MulticlassTruePositives()])
Performance charts: ROC curve and Confusion Matrix in Keras
Sometimes the performance cannot be represented as one number but rather as a performance chart. Examples of such charts are ROC curve or confusion matrix. In those cases, you may want to log those charts somewhere for further inspection.
To do it you need to create a callback that will track the performance of your model on every epoch end. Then, you can take a look at the improvement in a folder or an experiment tracking tool. So let’s do that.
First, we need a callback that creates ROC curve and confusion matrix at the end of each epoch.
import os
from keras.callbacks import Callback
import matplotlib.pyplot as plt
import numpy as np
from scikitplot.metrics import plot_confusion_matrix, plot_roc
class PerformanceVisualizationCallback(Callback):
def __init__(self, model, validation_data, image_dir):
super().__init__()
self.model = model
self.validation_data = validation_data
os.makedirs(image_dir, exist_ok=True)
self.image_dir = image_dir
def on_epoch_end(self, epoch, logs={}):
y_pred = np.asarray(self.model.predict(self.validation_data[0]))
y_true = self.validation_data[1]
y_pred_class = np.argmax(y_pred, axis=1)
# plot and save confusion matrix
fig, ax = plt.subplots(figsize=(16,12))
plot_confusion_matrix(y_true, y_pred_class, ax=ax)
fig.savefig(os.path.join(self.image_dir, f'confusion_matrix_epoch_{epoch}'))
# plot and save roc curve
fig, ax = plt.subplots(figsize=(16,12))
plot_roc(y_true, y_pred, ax=ax)
fig.savefig(os.path.join(self.image_dir, f'roc_curve_epoch_{epoch}'))
Now we simply pass it to the model.fit() callbacks argument.
performance_cbk = PerformanceVisualizationCallback(
model=model,
validation_data=validation_data,
image_dir='performance_vizualizations')
history = model.fit(x=x_train,
y=y_train,
epochs=5,
validation_data=validation_data,
callbacks=[performance_cbk])
You can have multiple callbacks if you want to.
Now you will be able to look at those visualizations as your model trains:
Note:
If you want to log everything to the experiment tracking tool like Neptune your callback would look a bit different:
from keras.callbacks import Callback
import neptune
import numpy as np
from scikitplot.metrics import plot_confusion_matrix, plot_roc
import matplotlib.pyplot as plt
neptune.init('jakub-czakon/examples')
neptune.create_experiment('keras-metrics')
class NeptuneLoggerCallback(Callback):
def __init__(self, model, validation_data):
super().__init__()
self.model = model
self.validation_data = validation_data
def on_batch_end(self, batch, logs={}):
for log_name, log_value in logs.items():
neptune.log_metric(f'batch_{log_name}', log_value)
def on_epoch_end(self, epoch, logs={}):
for log_name, log_value in logs.items():
neptune.log_metric(f'epoch_{log_name}', log_value)
y_pred = np.asarray(self.model.predict(self.validation_data[0]))
y_true = self.validation_data[1]
y_pred_class = np.argmax(y_pred, axis=1)
fig, ax = plt.subplots(figsize=(16, 12))
plot_confusion_matrix(y_true, y_pred_class, ax=ax)
neptune.log_image('confusion_matrix', fig)
fig, ax = plt.subplots(figsize=(16, 12))
plot_roc(y_true, y_pred, ax=ax)
neptune.log_image('roc_curve', fig)
Notice that you don’t need to create folders for images as the charts will be sent to your tool directly. On the flip side you have to create an experiment to start tracking your runs. Once you have that it is business as usual.
neptune_logger=NeptuneLoggerCallback(model=model,
validation_data=validation_data)
history = model.fit(x=x_train,
y=y_train,
epochs=5,
validation_data=validation_data,
callbacks=[neptune_logger])
You can explore metrics and performance charts in the app.
How to plot Keras history object?
Whenever fit() is called, it returns a History object that can be used to visualize the training history. It contains a dictionary with loss and metric values at each epoch calculated both for training and validation datasets.
For example, lets extract the ‘accuracy’ metric and use matplotlib to plot it.
import matplotlib.pyplot as plt
history = model.fit(x_train, y_train,
validation_split=0.25,
epochs=50, batch_size=16, verbose=1)
# Plot training & validation accuracy values
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_‘accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
Keras Metrics Example
Ok, so you’ve gone a long way and learned a bunch. To refresh your memory let’s put it all together in an single example. We’ll start by taking the mnist dataset and created a simple CNN model:
import tensorflow as tf
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
validation_data = x_test, y_test
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
We’ll create a custom metric, multiclass f1 score in keras:
def recall(y_true, y_pred):
y_true = K.ones_like(y_true)
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
all_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
recall = true_positives / (all_positives + K.epsilon())
return recall
def precision(y_true, y_pred):
y_true = K.ones_like(y_true)
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
precision = true_positives / (predicted_positives + K.epsilon())
return precision
def f1_score(y_true, y_pred):
precision = precision_m(y_true, y_pred)
recall = recall_m(y_true, y_pred)
return 2*((precision*recall)/(precision+recall+K.epsilon()))
We’ll create a custom tf.keras metric: MulticlassTruePositives to be exact:
class MulticlassTruePositives(tf.keras.metrics.Metric):
def __init__(self, name='multiclass_true_positives', **kwargs):
super(MulticlassTruePositives, self).__init__(name=name, **kwargs)
self.true_positives = self.add_weight(name='tp', initializer='zeros')
def update_state(self, y_true, y_pred, sample_weight=None):
y_pred = tf.reshape(tf.argmax(y_pred, axis=1), shape=(-1, 1))
values = tf.cast(y_true, 'int32') == tf.cast(y_pred, 'int32')
values = tf.cast(values, 'float32')
if sample_weight is not None:
sample_weight = tf.cast(sample_weight, 'float32')
values = tf.multiply(values, sample_weight)
self.true_positives.assign_add(tf.reduce_sum(values))
def result(self):
return self.true_positives
def reset_states(self):
# The state of the metric will be reset at the start of each epoch.
self.true_positives.assign(0.)
We’ll compile the keras model with our metrics:
import keras
model.compile(optimizer='sgd',
loss='sparse_categorical_crossentropy',
metrics=['accuracy',
keras.metrics.categorical_accuracy,
f1_score,
recall_score,
precision_score,
tf.keras.metrics.TopKCategoricalAccuracy(k=5),
MulticlassTruePositives()])
We’ll implement keras callback that plots ROC curve and Confusion Matrix to a folder:
import os
from keras.callbacks import Callback
import matplotlib.pyplot as plt
import numpy as np
from scikitplot.metrics import plot_confusion_matrix, plot_roc
class PerformanceVisualizationCallback(Callback):
def __init__(self, model, validation_data, image_dir):
super().__init__()
self.model = model
self.validation_data = validation_data
os.makedirs(image_dir, exist_ok=True)
self.image_dir = image_dir
def on_epoch_end(self, epoch, logs={}):
y_pred = np.asarray(self.model.predict(self.validation_data[0]))
y_true = self.validation_data[1]
y_pred_class = np.argmax(y_pred, axis=1)
# plot and save confusion matrix
fig, ax = plt.subplots(figsize=(16,12))
plot_confusion_matrix(y_true, y_pred_class, ax=ax)
fig.savefig(os.path.join(self.image_dir, f'confusion_matrix_epoch_{epoch}'))
# plot and save roc curve
fig, ax = plt.subplots(figsize=(16,12))
plot_roc(y_true, y_pred, ax=ax)
fig.savefig(os.path.join(self.image_dir, f'roc_curve_epoch_{epoch}'))
performance_viz_cbk = PerformanceVisualizationCallback(
model=model,
validation_data=validation_data,
image_dir='perorfmance_charts')
We’ll run training and monitor the performance:
history = model.fit(x=x_train,
y=y_train,
epochs=5,
validation_data=validation_data,
callbacks=[performance_viz_cbk])
We’ll visualize metrics from keras history object:
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
We will monitor and explore your experiments in a tool like TensorBoard or Neptune. You just need to add another callback or modify the one you have created before:
Tensorboard
from tf.keras.callbacks import TensorBoard
tensorboard_cbk = TensorBoard(log_dir="logs/training-example/")
history = model.fit(..., callbacks=[performance_viz_cbk,
tensorboard_cbk])
With TensorBoard you need to start a local server and explore your runs in the browser.
tensorboard --logdir logs/training-example/
Neptune
neptune.init('jakub-czakon/examples')
neptune.create_experiment('keras-metrics')
class NeptuneLoggerCallback(Callback):
def __init__(self, model, validation_data):
super().__init__()
self.model = model
self.validation_data = validation_data
def on_batch_end(self, batch, logs={}):
for log_name, log_value in logs.items():
neptune.log_metric(f'batch_{log_name}', log_value)
def on_epoch_end(self, epoch, logs={}):
for log_name, log_value in logs.items():
neptune.log_metric(f'epoch_{log_name}', log_value)
y_pred = np.asarray(self.model.predict(self.validation_data[0]))
y_true = self.validation_data[1]
y_pred_class = np.argmax(y_pred, axis=1)
fig, ax = plt.subplots(figsize=(16, 12))
plot_confusion_matrix(y_true, y_pred_class, ax=ax)
neptune.log_image('confusion_matrix', fig)
fig, ax = plt.subplots(figsize=(16, 12))
plot_roc(y_true, y_pred, ax=ax)
neptune.log_image('roc_curve', fig)
neptune_logger = NeptuneLoggerCallback(model=model,
validation_data=validation_data)
history = model.fit(..., callbacks=[neptune_logger])
Check this example experiment run if you are interested:
Final Thoughts
Hopefully, this article gave you some background into model evaluation techniques in keras.
We’ve covered:
- built-in methods in keras and tf.keras, implementation of your own custom metrics, how you can visualize custom performance charts as your model is training.
For more information check out the Keras Repository and TensorFlow Metrics documentation.
Happy training!