Summary The ecosystem of tools and libraries in Python for data manipulation and analytics is truly impressive, and continues to grow. There are, however, gaps in their utility that can be filled by the capabilities of a data warehouse. In this episode Robert Hodges discusses how the PyData suite of tools can be paired with a data warehouse for an analytics pipeline that is more robust than either can provide on their own. This is a great introduction to what differentiates a data warehouse from a relational database and ways that you can think differently about running your analytical workloads for larger volumes of data. Announcements Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Taking a look at recent trends in the data science and analytics landscape, it’s becoming increasingly advantageous to have a deep understanding of both SQL and Python. A hybrid model of analytics can achieve a more harmonious relationship between the two languages. Read more about the Python and SQL Intersection in Analytics at mode.com/init. Specifically, we’re going to be focusing on their similarities, rather than their differences. You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host as usual is Tobias Macey and today I’m interviewing Robert Hodges about how the PyData ecosystem can play nicely with data warehouses Interview Introductions How did you get introduced to Python? To start with, can you give a quick overview of what a data warehouse is and how it differs from a "regular" database for anyone who isn’t familiar with them? What are the cases where a data warehouse would be preferable and when are they the wrong choice? What capabilities does a data warehouse add to the PyData ecosystem? For someone who doesn’t yet have a warehouse, what are some of the differentiating factors among the systems that are available? Once you have a data warehouse deployed, how does it get populated and how does Python fit into that workflow? For an analyst or data scientist, how might they interact with the data warehouse and what tools would they use to do so? What are some potential bottlenecks when dealing with the volumes of data that can be contained in a warehouse within Python? What are some ways that you have found to scale beyond those bottlenecks? How does the data warehouse fit into the workflow for a machine learning or artificial intelligence project? What are some of the limitations of data warehouses in the context of the Python ecosystem? What are some of the trends that you see going forward for the integration of the PyData stack with data warehouses? What are some challenges that you anticipate the industry running into in the process? What are some useful references that you would recommend for anyone who wants to dig deeper into this topic? Keep In Touch LinkedIn hodgesrm on GitHub Picks Tobias Foundations Of Architecting Data Solutions: Managing Successful Data Projects by Ted Malaska & Jonathan Seidman Robert Reading old academic papers such as CStore Python Machine Learning by Sebastian Raschka Closing Announcements Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at pythonpodcast.com/chat Links Altinity Clickhouse Data Engineering Podcast Interview MySQL Data Warehouse Column Oriented Database SIMD == Single Instruction Multiple Data PostgreSQL Data Engineering Podcast Episode Microsoft SQL Server Pandas NumPy Tensorflow Jupyter Data Sampling Dask Data Engineering Podcast Ray Map/Reduce Vertica Sharding Hadoop SnowflakeDB Delta Lake Data Engineering Podcast Episode BigQuery RedShift Snowflake Data Sharing OracleDB Kubernetes DBT Data Engineering Podcast Episode CSV Parquet Data Engineering Podcast Episode Kafka UC Davis Web Scraping Clickhouse Python Driver SQLAlchemy Altinity Blog Post Materialized View PyTorch Podcast Interview scikit-learn Spark Data Engineering Podcast Interview BigQuery ML Apache Arrow Wes McKinney Podcast Interview User Defined Function KDB CStore Paper by Dr. Michael Stonebraker, et al Kinetica MapD/OmniSci The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA