Evaluating and Iterating in Model Development

Published in

The Startup

6 min readJun 23, 2020

Imagine you have a remote that controls your home theatre system. Normally, the remote will have several different buttons, each controlling specific functions- a button to increase/decrease the volume, one to change the source of the sound from direct speakers to surround sound, another to switch to radio, and maybe one to tune the radio. What if this wasn't the case, and each button affected several different functions all at once, maybe increasing the volume 0.8x, varying the radio’s tuning by 0.6x, and increasing the bass to the maximum? This would prevent us from tuning the different factors that contribute towards a perfect home theatre experience, or at least, make it very tough.

The same applies to machine learning and is called orthogonalization, where different factors are modified separately for optimum results.

Orthogonalization can be defined as the art of making logical modifications to different parameters in order to obtain a desired result.

In machine learning, we have a chain of assumptions that we aim to satisfy, treating each separately.

Assumption 1 — Fit the training set well on the cost function — Make sure the model performs well on the training set. The performance is usually required to be nearly human-like on the train set (as the parameters are based on its values).

If the training set does not fit well on the cost function, we try and fix this by -

Training a bigger neural network
Switch to a better optimization algorithm (adam, etc.)

Assumption 2 — Fit the dev set well on cost function — if the model performs well on the training set, we then check if it performs well on the dev set too.

If the dev set doesn't fit well on the cost function but performs well with the training set, we can say that the model has been overfitted to the train set. The possible solutions are -

Get a bigger training set
Implement different types of regularization based on specifics of the problem

Assumption 3- Fit the test set well on the cost function-after the model satisfies both assumptions above, it is then expected to perform well on the test set.

If the model performs poorly on the test set, it may be overfitted to the dev set, increasing the size of the dev set may help.

Assumption 4- Performs well in real life-after the model has performed well on the TDT data, the final test is to observe if it does well in real-life applications.

If the model performs poorly in the final stage, either the dev set distribution is faulty and hence needs to be altered, or the cost function is measuring the wrong values.

Applied machine learning is a highly iterative process, and required us to implement the Idea-Code-Experiment cycle multiple times before perfecting a solution. This process of iteration is made tough when one has to prioritize between several different potential evaluation metrics, and hence the best and fastest iteration would require a single value to be used as the model’s evaluation metric.

Consider a model that predicts if or if not a person has a particular disease, say cancer.

We will have to deal with four different types of predictions —

False Positives (FP) (patient incorrectly diagnosed w. cancer)
False Negatives (FN) (patient incorrectly diagnosed as cancer-free)
True Positives (TP) (patient correctly diagnosed w. cancer)
True Negatives (TN) (patient correctly diagnosed as cancer free)

Here, consider a model that predicts with an accuracy of 96%.

Accuracy = (TP + TN)/(TP + FP + TN + FN)

and hence

Error = (FP + FN)/(TP + FP + TN + FN)

Here, despite the fact that we have a model with good accuracy, we can’t necessarily say that the model is good enough to put to use.

Why?

Consider a case where the FN = 0 and all cases are FP.

This means that for every 100 patients, 4 patients (1000 patients, 40 misdiagnosed, so on) are diagnosed as cancer-free, INCORRECTLY, which is highly dangerous, and will most certainly prove to be fatal, as the patient will not receive the treatment she needs.

Hence, it is important to define a metric that shows us how many of the patients with cancer were correctly diagnosed. This quantity is called recall and we aim to maximize recall and hence minimize the 'miss' (1-Recall)

Recall = TP/(TP + FN)

What if FP = 0, with the same error?

This means that for every 100 patients, 4 patients, are misdiagnosed to have cancer, for which they are likely to spend lakhs treating, for no reason at all, the chemotherapy even proving to be fatal in some cases.

Since we aim to minimize this too, we are required to define another quantity — Precision — which tells us what percentage of the patients diagnosed with cancer, actually had the disease, and is given by

Precision = TP/(TP + FP)

Consider two classifiers A,B:

  | Precision | Recall | F1 Score  |
      +---+-----------+--------+-----------+
| A | 94%       | 92%    | (apx.)93% |
       +---+-----------+--------+-----------+
| B | 93%       | 95%    | (apx.)94% |

The above graph portrays the precision-recall tradeoff, i.e. normally, when precision increases, recall decreases and vice versa. Why does this happen?

Say you are highly intent on increasing the precision, i.e. you want to be very sure, that every person your algorithm diagnoses with cancer really does have cancer. To do so, you would probably impose a more stringent threshold in the classification algorithm, almost like an Ivy League college only accepts students that meet their high standards, to ensure that the maximum number of students make it through college. Here, diagnosing a patient with cancer is the same as accepting a student into a college, and the patient actually having cancer can be compared to an accepted student making it to graduation.

But on increasing 'standards’, the Ivy Leagues may actually be missing out on several students who if given the opportunity would be able to graduate. Similarly, here, increasing precision makes it more likely for us to wrongly misdiagnose a patient, and predict that she doesn’t have cancer, thus decreasing recall.

Similarly, lowering the threshold to increase the recall will increase the number of false positives and hence decrease the precision.

Any good model has both a high precision and a high recall. The optimum value (highest possible) is at the point of intersection of the lines depicted in the graph above.

But while carrying out iterations, we want a single value that will immediately give us an intuition about the direction in which we are to proceed, and here (in the case of precision-recall tradeoff), this single metric is the F1 number. The F1 number is the harmonic mean of the precision and recall.

F1 score =

harmonic mean of precision and recall =

(2/(1/p + 1/r))

Precision and Recall are measured on the dev set — hence a good dev set is important, and so is a single evaluation metric, which is the F1 score here. In the case where the different models are tested over different data sets, average error is also a good metric.

When you have n metrics to choose from while evaluating the performance of a model, it is advisable to choose to optimize some metrics — i.e. find the best value for them — and treat the others as satisficing metrics — i.e. to make sure those metrics pass a certain threshold condition.

Consider the case of the cancer diagnosis model. Say we, like every developer, want our program to run fast, and take up minimum memory size, apart from having as high an accuracy as possible.

Hence we are evaluating our model based on three metrics — Test accuracy, Runtime and Memory size. It is now time for us to look at our metrics, and draw out our expectations from each.

Firstly, we want the accuracy to be as high as possible, i.e. we want an optimum accuracy. Hence, while iterating we must aim to optimize the test accuracy, which makes it an optmizing metric.

Now, let us look at the runtime and the memory size, which are both similar metrics, as we only require them to pass a threshold. Here, we want to reduce both memory size and runtime, and hence we must set a maximum value for both metrics, and aim at ensuring the values of the metrics corresponding to the model are below the maximum. Hence these are to be treated as satisficing metrics.

A final tip, is to ensure that the Dev and Test sets come from the same distribution, as the aim of the Dev set is to help find a model that fits the cost function well on the test set, and finally in real life application.

Evaluating and Iterating in Model Development

Written by Malavika Unnikrishnan