Summary Investigative reporters have a challenging task of identifying complex networks of people, places, and events gleaned from a mixed collection of sources. Turning those various documents, electronic records, and research into a searchable and actionable collection of facts is an interesting and difficult technical challenge. Friedrich Lindenberg created the Aleph project to address this issue and in this episode he explains how it works, why he built it, and how it is being used. He also discusses his hopes for the future of the project and other ways that the system could be used. Preface Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great. When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode today to get a $20 credit and launch a new server in under a minute. Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com) To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at podcastinit.com/chat Registration for PyCon US, the largest annual gathering across the community, is open now. Don’t forget to get your ticket and I’ll see you there! Your host as usual is Tobias Macey and today I’m interviewing Friedrich Lindenberg about Aleph, a tool to perform entity extraction across documents and structured data Interview Introductions How did you get introduced to Python? Can you start by explaining what Aleph is and how the project got started? What is investigative journalism? How does Aleph fit into their workflow? What are some other tools that would be used alongside Aleph? What are some ways that Aleph could be useful outside of investigative journalism? How is Aleph architected and how has it evolved since you first started working on it? What are the major components of Aleph? What are the types of documents and data formats that Aleph supports? Can you describe the steps involved in entity extraction? What are the most challenging aspects of identifying and resolving entities in the documents stored in Aleph? Can you describe the flow of data through the system from a document being uploaded through to it being displayed as part of a search query? What is involved in deploying and managing an installation of Aleph? What have been some of the most interesting or unexpected aspects of building Aleph? Are there any particularly noteworthy uses of Aleph that you are aware of? What are your plans for the future of Aleph? Keep In Touch Website @pudo on Twitter pudo on GitHub Picks Tobias Mechanical Soup Friedrich phonenumbers – because it’s useful pyicu – super nerdy but amazing sqlalchemy – my all-time favorite python package Links Aleph Organized Crime and Corruption Reporting Project OCR (Optical Character Recognition) Jorge Luis Borges Buenos Aires Investigative Journalism Azerbaijan Signal Open Corporates Open Refine Money Laundering E-Discovery CSV SQL Entity Extraction (Named Entity Recognition) Apache Tika Polyglot SpaCy Podcast.__init__ Episode LibreOffice Tesseract followthemoney Elasticsearch Knowledge Graph Neo4J Gephi Edward Snowden Document Cloud Overview Project Veracrypt Qubes OS I2 Analyst Notebook The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA