Summary Every machine learning model has to start with feature engineering. This is the process of combining input variables into a more meaningful signal for the problem that you are trying to solve. Many times this process can lead to duplicating code from previous projects, or introducing technical debt in the form of poorly maintained feature pipelines. In order to make the practice more manageable Soledad Galli created the feature-engine library. In this episode she explains how it has helped her and others build reusable transformations that can be applied in a composable manner with your scikit-learn projects. She also discusses the importance of understanding the data that you are working with and the domain in which your model will be used to ensure that you are selecting the right features. Announcements Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science. When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Your host as usual is Tobias Macey and today I’m interviewing Soledad Galli about feature-engine, a Python library to engineer features for use in machine learning models Interview Introductions How did you get introduced to Python? Can you describe what feature-engine is and the story behind it? What are the complexities that are inherent to feature engineering? What are the problems that are introduced due to incidental complexity and technical debt? What was missing in the available set of libraries/frameworks/toolkits for feature engineering that you are solving for with feature-engine? What are some examples of the types of domain knowledge that are needed to effectively build features for an ML model? Given the fact that features are constructed through methods such as normalizing data distributions, imputing missing values, combining attributes, etc. what are some of the potential risks that are introduced by incorrectly applied transformations or invalid assumptions about the impact of these manipulations? Can you describe how feature-engine is implemented? How have the design and goals of the project changed or evolved since you started working on it? What (if any) difference exists in the feature engineering process for frameworks like scikit-learn as compared to deep learning approaches using PyTorch, Tensorflow, etc.? Can you describe the workflow of identifying and generating useful features during model development? What are the tools that are available for testing and debugging of the feature pipelines? What do you see as the potential benefits or drawbacks of integrating feature-engine with a feature store such as Feast or Tecton? What are the most interesting, innovative, or unexpected ways that you have seen feature-engine used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on feature-engine? When is feature-engine the wrong choice? What do you have planned for the future of feature-engine? Keep In Touch LinkedIn @Soledad_Galli on Twitter solegalli on GitHub Picks Tobias Dune Movie Dune Series Soledad The Social Dilemma Don’t Be Evil by Rana Foroohar Closing Announcements Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links feature-engine Feature Engineering Python Feature Engineering Cookbook scikit-learn Feature Stores Podcast Episode Pandas Podcast Episode PyTorch Podcast Episode Tensorflow Feast Tecton Data Engineering Podcast Episode Kaggle Dask Data Engineering Podcast Episode The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA