
4.6. LEARNING FROM MULTIPLE TASKS 61
images from the users that each is distributed differently. We also know that we want to
optimize the uploaded d ata from the users as best as possible, although 10,000 images will
not work well as an entire training set.
One way we could adjust our sets would be to combine the testing and uploaded data into
a single large array, then shuffle the array and break up the data into groups for a training,
development, and testing set. Suppose we set our training set size to be 205,000 and our
development set size and testi ng set size each to 2,500. If we were to break data up randomly,
then we would only expect around 119 images in the development and test set to come from
the uploaded data from u ser s .
Instead, since our primary goal is to optimize the data uploaded by users, we could again use
bin sizes of 205,000 for the traini ng set and 2,500 for both the development and testing set,
but this time we will make up the development and testin g sets entirely from the uploaded
data from users. While the data is not split evenly, we tend to prefer this dist r i bu t ion of
data because it allows us to focus more on optimizing the right goals.
4.5.1 Bias and variance with mismatched data distributions
When we distribute the types of data differently between our training set and development
set, it may be likely that we have a model with a low training error and a relatively high
development error, even without having a variance problem . Instead of directly comparing
the results against the development set, we will introduce a training-development set,
which has the same distribution as the training set although we do not act ual l y use data in
this set to train on. Now, we can look at the training set error rate, the training-development
set error rate, and the development set error rate, where we will determine the variance
based on the difference between the training s et error rate and the training-development set
error rate. If we have both a low training set error rate and training-development set error
rate while our development set error rate is high, then we say we have a data mismatch
problem.
4.5.2 Addressing data mismatch
With a data mismatch prob l em, it may first be beneficial to look at where the errors are
coming from in the development set. If we find , for example, that our errors are due to
blurry i mages then we may want to get more t rai n i ng examples of blurry images. Instead of
actually going out and getting more data, we cou l d use artificial data synthesis, which
in this case would take crisp images and apply a blurring filter to them.
For another example, if we are trying to create a trigger word detector and we have a bunch
of low noise voice recordings, but we find that the user s’ data on ce our model is deployed
tends to have a lot of noise, then we can artifici al l y add car noise. One thing to keep in
mind here is that if we have a lot more data of voice recordings than we have of car noise,
adding the same car noise to each audio clip may cause our model to overfit the car noise
data.
4.6 Learning from multiple tasks
4.6.1 Transfer learning
Transfe r learning is a process of taking a neural network that has been trained for one thing
and using it as a starting poi nt to train something else. For instance, if we are training
an i mage cl ass ifi cat i on sys t em , we may be able to transfer the low-level implementation
details in a neur al network, that ideally look for edges and curves, and use those details
with radiology diagnosis.
When changing the last layer of a neural network as shown in Figure 4.6.1, we should set
the new weight and bias terms randomly. We are also able to add more layers to a network
if necessary and are not restricted to simply changing out the last layer when using transfer
learning.