Friday lunchtime lecture: Bad Data (and how to fix it)

Open Data Institute Podcasts - A podcast by The Open Data Institute

Categories:

Bad data is everywhere. A CSV that doesn’t load, a spreadsheet that is badly formatted, a date column that has different formats, and so on. A lot of time is spent fixing these issues, instead of actually analysing the data. In this talk, you’ll hear about Good Tables, a tabular data validator that is able to check for issues like: All rows have the same number of columns There are no duplicate rows The data types are correct (e.g. a numeric column has only numbers, a date column has only dates in a specific format, etc.) It also allows writing custom checks using Python. Goodtables is useful both if you are a data publisher, by helping you to increase your data quality and facilitate the data reusability, and if you are a data user, by giving you a quick way to check the data for errors. It can be executed locally or via https://goodtables.io, a continuous tabular data validation service. You’ll also understand about how the Frictionless Data’s Data Package and Table Schema specifications can help you describe and load datasets. About the speaker Vitor Baptista is the engineering lead for the Open Knowledge International. Since joining in 2012, he worked on a range of projects related to open data, like building data portals using CKAN, improving fiscal transparency with OpenSpending, aggregating and releasing clinical trial data with OpenTrials, and more. His main interests are in how we can use data and data visualization to make better decisions to improve the world. He is currently based in Birmingham, UK.