Data augmentation automation Explained

4 min readSep 23, 2021

How do I get more data, if I don’t have “more data”? And how to automate it?

How do we fix the problem of a small Dataset in hands? and why?

Ever wondered how to do when your dataset is not big enough? Data augmentation is one of the solutions to help through that. To be able to understand fully what it is, I’d ask why bother in the first place? why care so much about a huge dataset? don’t they all say size doesn’t matter? Well not in machine learning.

We’ve all been there at least once. You have a project idea, you plan on developing the model, trying to gather the data, you end up finding a dataset with only a few hundred images. so why is it better with more data?

“A neural network is only as robust as the data fed to it”

Well, when you train a machine learning model, what we’re really doing is tuning its parameters such that it can map a particular input (an image for example) to some output. Our optimization goal is to chase that sweet spot of high accuracy where our model’s loss is low, which happens when your parameters are tuned in the right way. Naturally, if you have a lot of parameters, you would need to show your machine learning model the right amount of examples, to work on and eventually get good performance.

Data augmentation is the technique of artificially expanding labeled training datasets to enlarge them. Today, data augmentation is used as a secret sauce in nearly every state-of-the-art model for image classification, and is becoming increasingly common in other modalities such as natural language understanding as well.

How do I get more data, if I don’t have “more data”?

spoiler alert: By tricking your neural network into thinking that some alterations on your existing dataset are actually just another labeled data in there. The easiest method, images transformations.

A convolutional neural network that can does a classification on objects even if they’re placed in different orientations is said to have a property called invariance. More specifically, a CNN can be invariant to translation, viewpoint, size or illumination (Or a combination of them).

This essentially is the base that data augmentation relies on. In the real world scenario, our target application may exist in a variety of conditions, such as different orientations, locations, scales, brightness etc. We account for these situations by training our neural network with additional synthetically modified data.

When do I use data augmentation in my machine learning project’s life cycle ?

you have two options, either you enlarge your dataset by doing the data augmentation on it own, meaning before feeding it to the model you perform all the necessary transformations, which is called offline augmentation. The other option is to perform these transformations on a mini-batch, just before feeding it to your machine learning model and this method is called online augmentation, this method is preferred for larger datasets, as you can’t afford the explosive increase in size. Instead, you would perform transformations on the mini-batches that you would feed to your model.

Augmentation techniques

heuristic augmentation

these are some popular heuristic augmentation techniques

flip

crop

change cue

rotate

scale

When chosen carefully, data augmentation schemes tuned by experts can improve the model’s performance. However, such heuristic strategies in practice can cause large variances in end model performance, and may not produce augmentations needed for state-of-the-art models. so, How can we design learnable algorithms that explore the space of transformation functions efficiently and effectively, and find augmentation strategies that can outperform human-designed heuristics?

Automating the augmentation

So far, there have been roughly two approaches. The ‘AI-model approach’ that attempts to search through a big space of augmentation policies to find an optimal one using reinforcement learning or GANs. In it, we must train a whole GAN — a process requiring a tricky implementation and nontrivial computing resources. Not so good if our only GPU power is from Kaggle kernels and Colab notebooks .

What we get with the other data augmentation pipeline search strategy, the ‘randomness-based approach’, which reduces the search space (by using fewer parameters) and randomly samples policies. Sacrificing flexibility for speed, this approach, incarnated in the RandAugment algorithm, yielded a performance competitive with the AI-model approach…as of a couple years ago.

so the automation of augmentation is still a big research field, but for us humble developers, we can manage a smaller process that helps us out through that small dataset.

Process

Establish the required amount of new data samples. Answer the question: From the available samples, how many new ones should be generated?
Establish the desired complexity of the Data Augmentation process.
Perform image processing. You can use basic techniques of image processing. Or you may try the generation of new images with advanced techniques of data augmentation based on Generative Adversarial Neural Networks (GANs).