Summary Being able to understand the context of a piece of text is generally thought to be the domain of human intelligence. However, topic modeling and semantic analysis can be used to allow a computer to determine whether different messages and articles are about the same thing. This week we spoke with Radim Řehůřek about his work on GenSim, which is a Python library for performing unsupervised analysis of unstructured text and applying machine learning models to the problem of natural language understanding. Brief Introduction Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. I would like to thank everyone who has donated to the show. Your contributions help us make the show sustainable. For details on how to support the show you can visit our site at pythonpodcast.com Linode is sponsoring us this week. Check them out at linode.com/podcastinit and get a $20 credit to try out their fast and reliable Linux virtual servers for your next project We are also sponsored by Sentry this week. Stop hoping your users will report bugs. Sentry’s real-time tracking gives you insight into production deployments and information to reproduce and fix crashes. Check them out at getsentry.com and use the code podcastinit at signup to get a $50 credit on your account. Visit our site to subscribe to our show, sign up for our newsletter, read the show notes, and get in touch. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Join our community! Visit discourse.pythonpodcast.com for your opportunity to find out about upcoming guests, suggest questions, and propose show ideas. Your hosts as usual are Tobias Macey and Chris Patti Today we’re interviewing Radim Řehůřek about Gensim, a library for topic modeling and semantic analysis of natural language. Interview with Radim Řehůřek Introductions How did you get introduced to Python? – Chris Can you start by giving us an explanation of topic modeling and semantic analysis? – Tobias What is Gensim and what inspired you to create it? – Tobias What facilities does Gensim provide to simplify the work of this kind of language analysis? – Tobias Can you describe the features that set it apart from other projects such as the NLTK or Spacy? – Tobias What are some of the practical applications that Gensim can be used for? – Tobias One of the features that stuck out to me is the fact that Gensim can process corpora on disk that would be too large to fit into memory. Can you explain some of the algorithmic work that was necessary to allow for this streaming process to be possible? – Tobias Given that it can handle streams of data, could it also be used in the context of something like Spark? – Tobias Gensim also supports unsupervised model building. What kinds of limitations does this have and when would you need a human in the loop? – Tobias Once a model has been trained, how does it get saved and reloaded for subsequent use? – Tobias What are some of the more unorthodox or interesting uses people have put Gensim to that you’ve heard about? – Chris In addition to your work on Gensim, and partly due to its popularity, you have started a consultancy for customers who are interested in improving their data analysis capabilities. How does that feed back into Gensim? – Tobias Are there any improvements in Gensim or other libraries that you have made available as a result of issues that have come up during client engagements? – Tobias Is it difficult to find contributors to Gensim because of its advanced nature? – Tobias Are there any resources you’d like to recommend our listeners explore to get a more in depth understanding of topic modeling and related techniques? – Chris Keep In Touch RaRe Technologies Twitter Email Github Mailing List Picks Tobias Dark Matter and the Dinosaurs by Lisa Randall Chris m-cli Radim 1177 BC: The Year Civilization Collapsed Links Nadia Eghbal Gensim SQL Addict NLTK Spacy Latent Dirichlet Allocation (LDA) LSI Keynote in Italy on distributed processing Google Scholar references for Gensim Stylometric analysis On Writing Well Student Incubator Wikipedia on topic modeling The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA