To optimize data curation for AI, Lightly moves to self-directed learning


All machine learning models are tied to a critical factor: the quality of the data on which the model is trained.

The challenge of data curation to improve the quality of machine learning and AI models is one that is well understood. A 2021 MIT study found systemic issues in the way training data was labeled, leading to inaccurate results in AI systems. A study in the journal Quantitative Science Studies that analyzed 141 previous studies on data labeling found that 41% of models used datasets labeled by humans.

One of the vendors trying to rise to the challenge of optimizing data management for AI is a Swiss startup, Lightly. Founded in 2019, the company announced this week that it has raised $3 million in a seed funding round. However, Lightly is not looking for a data label supplier. Instead, the company wants to help manage data using a self-supervised machine learning model that could one day reduce the need for data label operations altogether.

“I continue to be amazed at how much of the work in machine learning is manual, very tedious, and not automated at all,” Lightly co-founder Matthias Heller told VentureBeat. “People always believe that with machine learning everything is so advanced, but machine learning and deep learning in particular is such a young technology and a lot of the tooling and infrastructure is just now being made available.”

A growing market for data curation and data labeling

There is no shortage of money or vendors in the market to help optimize data for machine learning, be it data curation or data labeling.

For example, Defined.ai, which was known as DefinedCrowd before its rebranding in 2021, has raised $78 million to date to advance its vision of data governance.

And Grand View Research has projected that the data labeling market will reach $8.2 billion by 2028, with a compound annual growth rate of 24.6% projected between 2021 and 2028. VentureBeat’s own list of the best data labeling software providers includes Appen’s Figure Eight , Amazon Sagemaker Ground Truth, SuperAnnotate, Dataloop and V7’s Darwin.

Other popular vendors include Labelbox and the open-source Labelstudio, both of which integrate with Lightly’s technology. Overall, Lightly plans an open approach so that users can use the company’s technology with any labeling supplier.

How the self-controlled model works

Three years ago, Heller and his co-founder Igor Susmelj were working on a machine learning project that involved labeling their data.

“We always wondered if the data we were labeling would actually help improve the model,” Heller says.

That led to Lightly, which includes a series of open source projects. The primary project is the Lightly library, which provides a self-supervised approach to machine learning on images.

There are multiple approaches to training data for machine learning, Heller explained. In a supervised approach, such as in computer vision, an image and an associated label are used in combination to teach a model, with a human doing the labeling.

Unsupervised learning, on the other hand, is the opposite: no human interaction is required. The self-driven model that Lightly enables falls somewhere in the middle and requires minimal human interaction.

“You can use the self-monitored model to manage data because the model learns certain information, certain similarities, what belongs together and what is different,” Heller said.

From open source to commercial solution

While Lightly can be used for free as an open source technology, users still need to do a lot of the work to set up the right environment and manage the configuration.

Lightly’s commercial service provides a managed offering with the infrastructure, tuned algorithms, and learning framework, all configured for users.

“Our main competition today is in-house tooling,” Heller said. “We use self-directed learning to tell you which 1% of the data to label and use for model training.”

Looking ahead, Heller provocatively predicts that the day may come in the future when data labeling will no longer be necessary as unsupervised machine learning continues to improve.

“I think the need for labels will decrease significantly in the coming years,” Heller said. “Maybe we won’t need labels in the future.”

VentureBeat’s mission is to be a digital city square for tech decision makers to learn about transformative business technology and transactions. Learn more about membership.

Leave a Reply

Your email address will not be published.