Snorkel AI dives into hot market of data curation

  • Enterprises want specially trained AI models

  • Companies need data that is labeled to effectively fine tune AI

  • Players like Snorkel AI are tackling this issue in the increasingly important data curation market

If you’re an enterprise looking to adopt generative artificial intelligence (GenAI), you probably want a model that’s been fine tuned to meet the needs of your specific vertical or – better still – your specific company. That specialized training requires data and lots of it. But how do you know which files in your vast trove of storage are the right ones to train the model to do what you want it to?

Enter data curation.

Companies like Snorkel AI, Snowflake, Databricks, Appen, Scale AI and LabelBox are all staking a claim in what is poised to become a very important market.

“This is a hot area, an important area,” AvidThink’s Roy Chua told Fierce.

“All of it comes down to how do you manage the ModelOps problem and also the DataOps or MLOps, which is how do you clean the data, how do you prepare the data, how do you manage data for AI and machine learning in general.”

Zooming in

Paroma Varma

For its part, Snorkel is solving the data curation problem with software. Co-founder Paroma Varma told Fierce the idea is to transform what is mostly a tedious manual process today and transform it into something that can be done programmatically.

The way it works is that Snorkel takes domain knowledge from subject matter experts and encodes that into a series of rules. The software – known as Snorkel Flow – then uses an algorithm to decide which rule should be applied to the data to categorize it. Once all the data is labeled, it is then fed into a model. The model’s responses are then checked for errors, and if any are discovered Snorkel can trace the source of the problem and correct it.

“I think anyone who’s working with AI today needs to develop their data,” Varma said. “If you use your model out of the box what we’re seeing is for these specialized enterprise use cases, these models rarely work. They’ll give you accuracies in the 30%, 20% range, just not ready for production yet. And then the only lever is your data, so you want to be able to develop it fast…so that you can continue your AI development process.”

Snorkel, she said, can condense the data development process from something like six months down to a few hours. And, Varma added, AI deployments aren’t a one and done process. The data used to train the models constantly needs to be updated so that it continues to meet business objectives. If there’s an error or a blind spot, enterprises don’t want that to take a year to fix.

Earlier this month, Snorkel released its second product – Snorkel Custom – which offers customers the chance to partner with the company’s researchers and engineers to build LLM pipelines. In a nutshell, Snorkel is taking and owning the end-to-end approach with Snorkel Custom rather than leaving the last bit to enterprises.

Varma said Snorkel AI’s technology was originally developed as a project in the Stanford AI lab in 2015. It later became its own company in 2019. To date, Snorkel has raised more than $135 million and as of 2021 had a valuation of $1 billion.

Diving into the future

Chua said he expects Snorkel will soon “get snapped up by someone,” be that a hyperscaler or even a competitor. The caveat is that it first needs to demonstrate that it is gaining traction in the market, he said. But Snorkel already has.

Varma told Fierce that Snorkel is working with many Fortune 500 companies and seven of the top 10 U.S. banks. It also has customers in the healthcare, e-commerce (hello again, Wayfair!) and insurance verticals.

Earlier this year, the investment arm of insurance Group QBE said it invested an undisclosed sum in the company. QBE said at the time that its North American division has been using Snorkel Flow to aid predictive analytics.

“The immediate value has come from reducing the friction involved in converting vast amounts of previously locked-up corporate data to improve the outcomes of ML solutions being applied to claims and underwriting business processes,” the company wrote. “We invested to help accelerate the evolution of Snorkel AI as it pushes further into Generative AI.”