Build Better Models Through Data-Centric Machine Learning Development With Snorkel AI

Summary

Machine learning is a data hungry activity, and the quality of the resulting model is highly dependent on the quality of the inputs that it receives. Generating sufficient quantities of high quality labeled data is an expensive and time consuming process. In order to reduce that time and cost Alex Ratner and his team at Snorkel AI have built a system for powering data-centric machine learning development. In this episode he explains how the Snorkel platform allows domain experts to create labeling functions that translate their expertise into reusable logic that dramatically reduces the time needed to build training data sets and drives down the total cost.

Deepchecks LogoBuilding good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started!


Galileo LogoData powers machine learning, but poor data quality is the largest impediment to effective ML today.

Galileo is a collaborative data bench for data scientists building Natural Language Processing (NLP) models to programmatically inspect, fix and track their data across the ML workflow (pre-training, post-training and post-production) – no more excel sheets or ad-hoc python scripts.

Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs, while seeing 10x faster ML iterations.

Galileo is offering listeners a free 30 day trial and a 30% discount on the product there after. This offer is available until Aug 31, so go to themachinelearningpodcast.com/galileo and request a demo today!

Predibase’s founders saw the pain of getting ML models developed and in-production, taking up to a year even at leading tech companies like Uber, so they built internal platforms that drastically lowered the time-to-value and increased access. The key was taking a “declarative approach” to machine learning, which Piero Molino (CEO) introduced with Ludwig, an open source framework to create deep learning models with 8,400+ GitHub stars, more than 100 contributors, and thousands of monthly downloads. With Ludwig, tasks that took months-to-years were handed off to teams in thirty minutes and just six lines of human-readable configuration that can define an entire machine learning pipeline.

Now with Predibase, we are bringing the power of declarative machine learning built on top of Ludwig to broader organizations with our enterprise platform. Like Infrastructure as Code simplified IT, Predibase’s machine learning (ML) platform allows users to focus on the “what” of their ML models rather than the “how”, breaking free of the usual limits in low-code systems and bringing down the time-to-value of ML projects from years to days. Go to themachinelearningpodcast.com/predibase today to learn more and try it for yourself!


Announcements

  • Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery.
  • Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started!
  • Data powers machine learning, but poor data quality is the largest impediment to effective ML today. Galileo is a collaborative data bench for data scientists building Natural Language Processing (NLP) models to programmatically inspect, fix and track their data across the ML workflow (pre-training, post-training and post-production) – no more excel sheets or ad-hoc python scripts. Get meaningful gains in your model performance fast, dramatically reduce data labeling and procurement costs, while seeing 10x faster ML iterations. Galileo is offering listeners a free 30 day trial and a 30% discount on the product there after. This offer is available until Aug 31, so go to themachinelearningpodcast.com/galileo and request a demo today!
  • Predibase is a low-code ML platform without low-code limits. Built on top of our open source foundations of Ludwig and Horovod, our platform allows you to train state-of-the-art ML and deep learning models on your datasets at scale. Our platform works on text, images, tabular, audio and multi-modal data using our novel compositional model architecture. We allow users to operationalize models on top of the modern data stack, through REST and PQL – an extension of SQL that puts predictive power in the hands of data practitioners. Go to themachinelearningpodcast.com/predibase today to learn more and try it out!
  • Your host is Tobias Macey and today I’m interviewing Alex Ratner about Snorkel AI, a platform for data-centric machine learning workflows powered by programmatic data labeling techniques

Interview

  • Introduction
  • How did you get involved in machine learning?
  • Can you describe what Snorkel AI is and the story behind it?
  • What are the problems that you are focused on solving?
    • Which pieces of the ML lifecycle are you focused on?
  • How did your experience building the open source Snorkel project and working with the community inform your product direction for Snorkel AI?
    • How has the underlying Snorkel project evolved over the past 4 years?
  • What are the deciding factors that an organization or ML team need to consider when evaluating existing labeling strategies against the programmatic approach that you provide?
    • What are the features that Snorkel provides over and above managing code execution across the source data set?
  • Can you describe what you have built at Snorkel AI and how it is implemented?
    • What are some of the notable developments of the ML ecosystem that had a meaningful impact on your overall product vision/viability?
  • Can you describe the workflow for an individual or team who is using Snorkel for generating their training data set?
    • How does Snorkel integrate with the experimentation process to track how changes to labeling logic correlate with the performance of the resulting model?
  • What are some of the complexities involved in designing and testing the labeling logic?
    • How do you handle complex data formats such as audio, video, images, etc. that might require their own ML models to generate labels? (e.g. object detection for bounding boxes)
  • With the increased scale and quality of labeled data that Snorkel AI offers, how does that impact the viability of autoML toolchains for generating useful models?
  • How are you managing the governance and feature boundaries between the open source Snorkel project and the business that you have built around it?
  • What are the most interesting, innovative, or unexpected ways that you have seen Snorkel AI used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on Snorkel AI?
  • When is Snorkel AI the wrong choice?
  • What do you have planned for the future of Snorkel AI?

Contact Info

Parting Question

  • From your perspective, what is the biggest barrier to adoption of machine learning today?

Closing Announcements

  • Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used.
  • Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
  • If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story.
  • To help other people find the show please leave a review on iTunes and tell your friends and co-workers

Links

The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

Liked it? Take a second to support tmacey on Patreon!