Title: Understanding Data Cleaning Pipeline for Development Data
Advisor: Richard Anderson
Supervisory Committee: Richard Anderson (Chair), C Leigh Anderson (GSR, Evan's School), Kurtis Heimerl, and Abraham Flaxman (Global Health)
Abstract: The developing world is relying more and more on data driven policies. There are numerous development agencies that have pushed for on-ground data collection to support the development work they are pursuing. Many governments have launched their own efforts for more frequent information gathering. Overall, the amount of data collected is tremendous, yet we face significant issues in doing useful analysis. Most of these barriers are around data cleaning and merging, and they require a data engineer to support some parts of the analysis. In this thesis, we aim to understand the pain points of cleaning the development data and propose solutions that harness the thinking of a data engineer and reduce the manual workload of this tedious process. To achieve this goal, I am currently focused on two research directions: (1) understanding current data usage patterns and build a taxonomy of data cleaning in developing world; and (2) Building algorithms to support automated data cleaning, targeting selected problems in the space including matching transliterated names. The objective is to empower regular data users to easily do the necessary data cleaning and scrubbing for analysis.