Dirty Data

It can really mess up your analytics.

Apr 20, 2022

Databases around the world are filled with all kinds of data. Crime reports, public tax records, health statistics, financial transactions, and other kinds of data are stored and kept up-to-date. Governments and other organizations often publish large datasets on the Internet and encourage anyone to download and use them. Businesses and other individuals are often using that data to make important decisions. Tools that allow the data to be easily analyzed to find important insights are becoming more important, but an accurate analysis is dependent on the quality of that data.

Data sets can be filled with inaccurate information. Misspellings, misclassifications, and missing pieces are common data entry errors. A single error is unlikely to cause a major problem when analyzing data, but many errors can really add up. Data that is full of inaccurate information is often called ‘dirty data’. While it is rare to have a large data set that does not have any errors at all within it, the ones with very few errors are said to be ‘clean data’.

Efforts to reduce errors are often employed at the data entry point. If the electronic entry form checks for mistakes and omissions before adding the data to the database, then fewer problems enter the system. But even the best filters cannot eliminate all errors. When problems arise, the entry process is often upgraded to keep the same problem from occurring in the future, but often can’t fix the problems that have already happened.

Finding and fixing problems in large data sets can be very tedious. Even though governments are often required to publish their data, the bureaucrats that run them often have very little incentive to spend much time making sure everything is accurate. Errors can persist over long periods of time simply because no one in charge wants to bother to fix them.

My ‘Didgets’ system does many things with data. Next to finding insights within large data sets, finding and fixing data anomalies is probably one of its strongest features. Relational tables with millions of rows and thousands of columns can be loaded in with ease. A quick scan of each column will quickly point out data points that are of very low occurrence.

For example, if a table includes US customer addresses then a scan of the ‘state’ column will show how many customers are in each state. There are only 50 states in the US, so if the data entry program did not check for a valid state name, then ‘Misissippi’ or ‘Conecticut’ might be among the values entered. Unless the person entering the data misspelled the state name every time, then there should be few instances when it was entered wrong. The Didget tool will let the user find and correct these errors very quickly and easily.

Other problems are harder to find and fix. For example, one of the data sets I like to analyze and demonstrate to potential users is the ‘Chicago Crime Data’. The city has published all the crimes reported for the past 22 years. One of the data points is the city block where the crime occurred. They cross off the last two digits of the street address so that many addresses map to the same block (e.g. 013XX E 47TH ST applies to each crime reported on the 1300 block of that street).

Sometime in 2002, they decided to change the way some streets are named in the database. Previous to that time the block specified was ‘013XX E 47 ST’, but after the change they added the ‘TH’ to the end of ‘47’. At the time of the change, they did not go back and change any entries that used the old spelling; so now there are entries for the same block with different spellings. They did the same for other streets. ‘E 101 Place’ was changed to ‘E 101ST Place’, ‘W 102 AVE’ was changed to ‘W 102ND AVE’, etc..

Since they did it early on, far fewer crime reports have the old way of spelling than the new way. If they had changed the street names just 10 years ago, then nearly half of all crimes on those streets would be spelled one way or the other. Any analysis of crimes on a certain street would have to check both spellings in order to get any kind of accurate count.

Using the ‘Pivot Table’ feature of Didgets, it was very easy to find which streets were changed and in what year that occurred. Getting it fixed is another story. Since I have no control over the actual data set that is published, the Chicago government would need to fix their database. I contacted the site that publishes the data and pointed out some of the anomalies my tool detected. I suggested that they use my tool to find and fix the data themselves (they could be a great beta site).

Instead of thanking me for finding the problems and assuring that they would fix them, I got a bureaucratic response of ‘we will look into it’ along with a stern warning about trying to sell software to the city without the proper lobbying registrations. It didn’t matter that my software was free to download and try. I doubt that anyone took my concerns seriously. So any analysis of Chicago crime data has to be taken with a grain of salt and that is unlikely to change anytime soon.

While we cannot control how data is entered and cleaned by third parties, a tool that makes it easy to find and fix anomalies in our own data sets can be very important. After all, like looking through a window, you can only see good insights with data that is not too ‘dirty’.

Didgets

Dirty Data

It can really mess up your analytics.

Discussion about this post