Active learning machine learning: What it is and how it works

September 27, 2020
by
· 4 min read

This article was originally published at Algorithimia’s website. The company was acquired by DataRobot in 2021. This article may not be entirely up-to-date or refer to products and offerings no longer in existence. Find out more about DataRobot MLOps here.

Active learning is the subset of machine learning in which a learning algorithm can query a user interactively to label data with the desired outputs.

A growing problem in machine learning is the large amount of unlabeled data, since data is continuously getting cheaper to collect and store. This leaves data scientists with more data than they are capable of analyzing. That’s where active learning comes in.

What is active learning?

Active learning is the subset of machine learning in which a learning algorithm can query a user interactively to label data with the desired outputs. In active learning, the algorithm proactively selects the subset of examples to be labeled next from the pool of unlabeled data. The fundamental belief behind the active learner algorithm concept is that an ML algorithm could potentially reach a higher level of accuracy while using a smaller number of training labels if it were allowed to choose the data it wants to learn from.

Therefore, active learners are allowed to interactively pose queries during the training stage. These queries are usually in the form of unlabeled data instances and the request is to a human annotator to label the instance. This makes active learning part of the human-in-the-loop paradigm, where it is one of the most powerful examples of success.

How does active learning work?

Active learning works in a few different situations. Basically, the decision of whether or not to query each specific label depends on whether the gain from querying the label is greater than the cost of obtaining that information. This decision making, in practice, can take a few different forms based on the data scientist’s budget limit and other factors.

The three categories of active learning are:

Three categories of active learning

Stream-based selective sampling

In this scenario, the algorithm determines if it would be beneficial enough to query for the label of a specific unlabeled entry in the dataset. While the model is being trained, it is presented with a data instance and immediately decides if it wants to query the label. This approach has a natural disadvantage that comes from the lack of guarantee that the data scientist will stay within budget.  

Pool-based sampling 

This is the most well known scenario for active learning. In this sampling method, the algorithm attempts to evaluate the entire dataset before it selects the best query or set of queries. The active learner algorithm is often initially trained on a fully labeled part of the data which is then used to determine which instances would be most beneficial to insert into the training set for the next active learning loop. The downside of this method is the amount of memory it can require.

Membership query synthesis

This scenario is not applicable to all cases, because it involves the generation of synthetic data. The active learner in this method is allowed to create its own examples for labeling. This method is compatible with problems where it is easy to generate a data instance.

How is active learning different from reinforcement learning?

Reinforcement learning and active learning can both reduce the number of labels required for models, but they are different concepts.

Reinforcement learning

Reinforcement learning

Reinforcement learning is a goal-oriented approach, inspired by behavioral psychology, that allows you to take inputs from the environment. This implies that the agent will get better and learn while it’s in use. This is similar to how us humans learn from our mistakes. We are basically functioning with a reinforcement learning approach. There is no training phase, because the agent learns through trial-and-error instead, using a predetermined reward system that provides inputs about how optimal a specific action was. This type of learning does not need to be fed data, because it generates its own as it goes.

Active learning

Active learning is closer to traditional supervised learning. It is a type of semi-supervised learning, meaning models are trained using both labeled and unlabeled data. The idea behind semi-supervised learning is that labeling just a small sample of data might result in the same accuracy or better than fully labeled training data. The only challenge is determining what that sample is. Active learning machine learning is all about labeling data dynamically and incrementally during the training phase so that the algorithm can identify what label would be the most beneficial for it to learn from.

Trial
Set up your Trial account and experience the DataRobot AI Platform today
Start for Free
About the author
DataRobot

Value-Driven AI

DataRobot is the leader in Value-Driven AI – a unique and collaborative approach to AI that combines our open AI platform, deep AI expertise and broad use-case implementation to improve how customers run, grow and optimize their business. The DataRobot AI Platform is the only complete AI lifecycle platform that interoperates with your existing investments in data, applications and business processes, and can be deployed on-prem or in any cloud environment. DataRobot and our partners have a decade of world-class AI expertise collaborating with AI teams (data scientists, business and IT), removing common blockers and developing best practices to successfully navigate projects that result in faster time to value, increased revenue and reduced costs. DataRobot customers include 40% of the Fortune 50, 8 of top 10 US banks, 7 of the top 10 pharmaceutical companies, 7 of the top 10 telcos, 5 of top 10 global manufacturers.

Meet DataRobot