This article was originally published at Algorithimia’s website. The company was acquired by DataRobot in 2021. This article may not be entirely up-to-date or refer to products and offerings no longer in existence. Find out more about DataRobot MLOps here.
Semi-supervised learning is the type of machine learning that uses a combination of a small amount of labeled data and a large amount of unlabeled data to train models. This approach to machine learning is a combination of supervised machine learning, which uses labeled training data, and unsupervised learning, which uses unlabeled training data.
What is supervised and unsupervised learning?
In order to understand semi-supervised learning, it helps to first understand supervised and unsupervised learning.
Every machine learning model or algorithm needs to learn from data. For supervised learning, models are trained with labeled datasets, but labeled data can be hard to find. The reason labeled data is used is so that when the algorithm predicts the label, the difference between the prediction and the label can be calculated and then minimized for accuracy.
Unsupervised learning doesn’t require labeled data, because unsupervised models learn to identify patterns and trends or categorize data without labeling it. This means that there is more data available in the world to use for unsupervised learning, since most data isn’t labeled.
What is semi-supervised machine learning?
Semi-supervised machine learning is a combination of supervised and unsupervised learning. It uses a small amount of labeled data and a large amount of unlabeled data, which provides the benefits of both unsupervised and supervised learning while avoiding the challenges of finding a large amount of labeled data. That means you can train a model to label data without having to use as much labeled training data.
What is semi-supervised clustering?
Cluster analysis is a method that seeks to partition a dataset into homogenous subgroups, meaning grouping similar data together with the data in each group being different from the other groups. Clustering is conventionally done using unsupervised methods. Since the goal is to identify similarities and differences between data points, it doesn’t require any given information about the relationships within the data.
However, there are situations where some of the cluster labels, outcome variables, or information about relationships within the data are known. This is where semi-supervised clustering comes in. Semi supervised clustering uses some known cluster information in order to classify other unlabeled data, meaning it uses both labeled and unlabeled data just like semi supervised machine learning.
Is reinforcement learning semi-supervised?
Reinforcement learning is not the same as semi-supervised learning. Reinforcement learning is a method where there are reward values attached to the different steps that the model is supposed to go through. So the algorithm’s goal is to accumulate as many reward points as possible and eventually get to an end goal. An easy way to understand reinforcement learning is by thinking about it like a video game. Just like how in video games the player’s goal is to figure out the next step that will earn a reward and take them to the next level in the game, a reinforcement learning algorithm’s goal is to figure out the next correct answer that will take it to the next step of the process.
A common example of an application of semi-supervised learning is a text document classifier. This is the type of situation where semi-supervised learning is ideal because it would be nearly impossible to find a large amount of labeled text documents. This is simply because it is not time efficient to have a person read through entire text documents just to assign it a simple classification.
So, semi-supervised learning allows for the algorithm to learn from a small amount of labeled text documents while still classifying a large amount of unlabeled text documents in the training data.
How semi-supervised learning works
The way that semi-supervised learning manages to train the model with less labeled training data than supervised learning is by using pseudo labeling. This can combine many neural network models and training methods. Here’s how it works:
Train the model with the small amount of labeled training data just like you would in supervised learning, until it gives you good results.
Then use it with the unlabeled training dataset to predict the outputs, which are pseudo labels since they may not be quite accurate.
Link the labels from the labeled training data with the pseudo labels created in the previous step.
Link the data inputs in the labeled training data with the inputs in the unlabeled data.
Then, train the model the same way as you did with the labeled set in the beginning in order to decrease the error and improve the model’s accuracy.
DataRobot is the leader in Value-Driven AI – a unique and collaborative approach to AI that combines our open AI platform, deep AI expertise and broad use-case implementation to improve how customers run, grow and optimize their business. The DataRobot AI Platform is the only complete AI lifecycle platform that interoperates with your existing investments in data, applications and business processes, and can be deployed on-prem or in any cloud environment. DataRobot and our partners have a decade of world-class AI expertise collaborating with AI teams (data scientists, business and IT), removing common blockers and developing best practices to successfully navigate projects that result in faster time to value, increased revenue and reduced costs. DataRobot customers include 40% of the Fortune 50, 8 of top 10 US banks, 7 of the top 10 pharmaceutical companies, 7 of the top 10 telcos, 5 of top 10 global manufacturers.