Deep learning and machine learning is big today. People are looking out for ways to get their hands with this futuristic technology. However, one common question that is frequently asked by newbies and learners is how many images are required for deep learning classification?
Honestly, I hate to say but there is no one clear answer to this question.
I mean, back in days I remember when Google Brain was in initial phases, the tech giant used over 15,000 images to train the algorithms, however, today this may be done with few hundred images or even less.
The right answer to this question is that it will depend on many factors including the complexity of the problem and your algorithm.
So, while I can’t answer the question of “how much images you may need for deep learning classification”, I can certainly give you some of the ways to think the deal with your question.
In this blog, I have come up with some steps that can help you to come up with the right number of images you may need for deep learning classification.
So, let’s dive into it.
What’s the Reason Behind this Question?
You may not realize but to be able to get an answer for the size of the training dataset you may need, it is important to know why you want to know this, as your answer may influence the next steps.
For example:
You may have large data.
If this is your answer than you may consider developing some learning curves or may decide to go with a large data framework to be able to use the maximum amount of data.
You may have very scarce data
If that’s your reason for the question, you may want to consider some other options to collect data, or may opt for data augmentation methods, which can help you to artificially enlarge the size of the training data
You haven’t started collecting data, in this case, you may want to collect some data initially and see if that’s enough for the algorithms. In case of data collection is expensive for you, talk to a domain expert for a specific answer.
Ok, now let’s see how many images you may need for deep learning classifications.
1) It Depends
As I said earlier, there is no one size fit all model in this regard, so I doubt there would be anyone who can give a generic answer without understanding the specific predictive modeling problem.
The answer to this question would have to be found by you only through empirical investigation. Some of the factors that may influence the amount of data required for training purpose include:
The complexity of the problem; and
The complexity of the algorithm
And this will be your starting point.
2) Reason by Analogy
Thousands and thousands of programmers and data scientists have worked on the deep learning models before you and many of them have published their studies, which in most cases are available for reading free.
So, before you may ask anyone else the size of the dataset required for training purposes, it is better to go through the similar studies that have been done previously to get a better estimate for the amount of training data you may require for the classification.
Also, many studies have been done to estimate the optimum performance scale for algorithms with respect to the size of the training data. Such studies can greatly help you to predict the right amount of data required for specific algorithms.
In fact, you may want to average the findings of multiple studies to get a good estimation.
3) Use Domain Expertise
For training the algorithm, you would need to chunk out a sample data representative of the problem, which you are looking to solve.
Now, it is important to remember that you want the algorithm to learn the function to be able to map input data to output data. Now, the learning performance of the algorithm for the mapping function will depend on the quality and quantity of the learning data you input in the model.
This also means that to train your model to higher performance levels, you would need to have large training data to help the model to understand all different relationships that may exist in the data and to be able to learn them and map.
For this reason, you may want to consult the domain expert (for the problem you are trying to solve) to understand all possible functionalities, features, and relationships that would be required to be learned by the model and to completely understand the complexity of the problem.
4) Statistical Heuristic
If you haven’t heard of the Statistical Heuristic methods, then these are the models that can be used to estimate the appropriate training data size.
The best part about the Statistical Heuristic methods is that the majority of these are designed for the classification problems, so it may come handy in your case as well. While some of the Statistical Heuristic is comprehensive and robust, others may at best be defined as ad hoc.
5) Dataset Size vs Model Proficiency
It’s not uncommon in data science to demonstrate the performance scale of algorithms against the size of the training data, or the complexity of the problem.
Unfortunately, not all of these studies are published and available for review purposes, or those that are published may not relate well with the type of problem you are looking to solve.
For that reason, I often suggest aspiring data scientists continue with the available data with all the data that’s available to you and any available learning algorithm.
Basically, you would be performing your own study to determine the performance scale of the model against the size of the training data.
The results of the study may be plotted with the model skills on the y-axis and size of the training data on the x-axis. This will give you a fair estimate of how the size of the data will affect the model skills in your specific problem.
The plotted graph will be called a “learning curve”.
Start Gowing with Folio3 AI Today
Connect with us for more information at [email protected]