Summary Version control has become table stakes for any software team, but for machine learning projects there has been no good answer for tracking all of the data that goes into building and training models, and the output of the models themselves. To address that need Dmitry Petrov built the Data Version Control project known as DVC. In this episode he explains how it simplifies communication between data scientists, reduces duplicated effort, and simplifies concerns around reproducing and rebuilding models at different stages of the projects lifecycle. If you work as part of a team that is building machine learning models or other data intensive analysis then make sure to give this a listen and then start using DVC today. Announcements Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Bots and automation are taking over whole categories of online interaction. Discover.bot is an online community designed to serve as a platform-agnostic digital space for bot developers and enthusiasts of all skill levels to learn from one another, share their stories, and move the conversation forward together. They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news. For newcomers to the space they have the Beginners Guide To Bots that will teach you the basics of how bots work, what they can do, and where they are developed and published. To help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need they have compiled a list of the major options and how they compare. Go to pythonpodcast.com/discoverbot today to get started and thank them for their support of the show. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com) To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at pythonpodcast.com/chat Your host as usual is Tobias Macey and today I’m interviewing Dmitry Petrov about DVC, an open source version control system for machine learning projects Interview Introductions How did you get introduced to Python? Can you start by explaining what DVC is and how it got started? How do the needs of machine learning projects differ from other software applications in terms of version control? Can you walk through the workflow of a project that uses DVC? What are some of the main ways that it differs from your experience building machine learning projects without DVC? In addition to the data that is used for training, the code that generates the model, and the end result there are other aspects such as the feature definitions and hyperparameters that are used. Can you discuss how those factor into the final model and any facilities in DVC to track the values used? In addition to version control for software applications, there are a number of other pieces of tooling that are useful for building and maintaining healthy projects such as linting and unit tests. What are some of the adjacent concerns that should be considered when building machine learning projects? What types of metrics do you track in DVC and how are they collected? Are there specific problem domains or model types that require tracking different metric formats? In the documentation it mentions that the data files live outside of git and can be managed in external storage systems. I’m wondering if there are any plans to integrate with systems such as Quilt or Pachyderm that provide versioning of data natively and what would be involved in adding that support? What was your motivation for implementing this system in Python? If you were to start over today what would you do differently? Being a venture backed startup that is producing open source products, what is the value equation that makes it worthwile for your investors? What have been some of the most interesting, challenging, or unexpected aspects of building DVC? What do you have planned for the future of DVC? Keep In Touch dmpetrov on GitHub Blog @fullstackml on Twitter LinkedIn Picks Tobias Otter.ai Dmitry Go outside and get some fresh air Links DVC Iterative.ai Linear Regression Logistic Regression C++ Perl Git Version Control System Uber Michaelangelo Domino Data Lab Git LFS AUC == Area Under Curve metric for evaluating machine learning model performance Wes McKinney Interview PyTorch Podcast Interview Tensorflow TensorBoard MLFlow Quilt Data Data Engineering Podcast Episode Pachyderm Data Engineering Podcast Episode Apache Airflow Podcast Interview The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA