Summary The speed of Python is a subject of constant debate, but there is no denying that for compute heavy work it is not the optimal tool. Rather than rewriting your data oriented applications, or having to rearchitect them, the team at Bodo wrote a compiler that will do the optimization for you. In this episode Ehsan Totoni explains how they are able to translate pure Python into massively parallel processes that are optimized for high performance compute systems. Announcements Hello and welcome to Podcast.__init__, the podcast about Python’s role in data and science. When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With the launch of their managed Kubernetes platform it’s easy to get started with the next generation of deployment and scaling, powered by the battle tested Linode platform, including simple pricing, node balancers, 40Gbit networking, dedicated CPU and GPU instances, and worldwide data centers. Go to pythonpodcast.com/linode and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Your host as usual is Tobias Macey and today I’m interviewing Ehsan Totoni about Bodo, an inferential compiler for Python that automatically parallelizes your data oriented projects Interview Introductions How did you get introduced to Python? Can you describe what Bodo is and the story behind it? What are some of the use cases that it is being applied to? What are the motivating factors for something like Dask or Ray as compared to Bodo? What are the software patterns that contribute to slowdowns in data processing code? What are some of the ways that the compiler is able to optimize those operations? Can you describe how Bodo is implemented? How does Bodo process the Python code for compiling to the optimized form? What are the compilation techniques for understanding the semantics of the code being processed? How do you manage packages that rely on C extensions? What do you use as an intermediate representation for translating into the optimized output? What is the workflow for applying Bodo to a Python project? What debugging utilities does it provide for identifying any errors that occur due to the added parallelism? What kind of support does Bodo have for optimizing a machine learning project with Bodo? (e.g. using PyTorch/Tensorflow/MxNet/etc.) When working with a workflow orchestrator such as Dagster for Airflow, what would the integration process look like for being able to take advantage of the optimized Bodo output? What are the most interesting, innovative, or unexpected ways that you have seen Bodo used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bodo? When is Bodo the wrong choice? What do you have planned for the future of Bodo? Keep In Touch LinkedIn @EhsanTn on Twitter ehsantn on GitHub Picks Tobias Paracord Crafts Ehsan [ Closing Announcements Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Bodo Data Engineering Podcast Episode University of Illinois Urbana-Champaign HPC MPI Elastic Fabric Adapter All-to-All Communication Dask Data Engineering Podcast Episode Ray Podcast Episode Pandas Extension Arrays Podcast Episode GeoPandas Numba LLVM scikit-learn Horovod Dagster Podcast.__init__ Episode Data Engineering Podcast Episode Airflow Podcast Episode IPython Parallel Parquet The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA