There is an old saying about lies, big lies, and statistics. One of the most deceptive forms of communication can be to cherry-pick data to make it appear that your favorite narrative is supported by the actual data when it is not. We see this happen regularly in politics, in business, and in the news media. Instead of taking a careful, unbiased look at what the actual data indicates in order to come up with a valid conclusion; too many people will establish what conclusion they want to promote and then go looking for data that will support their position while ignoring any contrarian data.
Such practices seem to be common in this era of ‘disinformation’ and are not limited to just one side of the political spectrum. While it could be argued that the deception is not evenly distributed between all parties; no one is immune from it. What is truly amazing is that this seems to be happening even as public access to more and more data sets has become prevalent.
Local, state, and federal governments are often required to publish public records that are incredibly detailed. Average citizens have access to things like crime reports, price lists, anonymized health records, tax receipts, and voting results. It should be easier than ever to determine if what you are hearing on the evening news is supported by real data.
Likewise, businesses often collect and store vast amounts of data on their supply lines; their customers; and their internal processes. Data is available to be analyzed on a daily, hourly, or even more frequent basis. The data can be ‘sliced and diced’ to generate detailed reports that could be used to shift resources, streamline processes, or try new business approaches.
Unfortunately, analyzing that data is often a time-consuming task that requires special skills. Nearly anyone can quickly load a few thousand rows of data into a spreadsheet and run a few formulas; but trying to find patterns in data sets with many millions of rows can be overwhelming. Highly trained database experts and data scientists deal with such data sets all the time; but they often lack the domain knowledge that is necessary to recognize the most relevant items. Expertise within a given field is often needed to sift the wheat from the chaff.
For example, the database administrator for a police department can run all kinds of reports that analyze crime statistics; but usually only trained law enforcement officers can know if a sudden uptick in thefts or assaults in a certain part of the city are significant. Likewise, a sales manager is most likely to know if the sales data for a certain product line indicates a successful marketing campaign or not. Trained doctors will likely know if changing data points in health records indicate a community health problem or not.
My Didgets data management system is designed to do many things with data. One of the most developed features so far, is its ability to quickly load in large data sets and let average users with minimal training on the software, analyze those data sets for relevant insights.
A doctor with very limited database experience can create a table with just a few clicks of the mouse and see important patterns in a table of millions of health records. A police captain could load in the city’s crime reports and in just a few minutes figure out where the best place to assign new officers in order to fight crime. A newspaper reporter could load in real estate information and figure out if the mayor’s claim in a news conference is supported or not.
The CEO of a company is often reliant on reports regularly generated by their IT team. They might meet to discuss what information is needed for a particular report. The database expert will then work with the data to extract out the information that the CEO needs and generate a ‘preliminary’ report. The CEO will review the report and may make several changes until it has everything needed. This process of back-and-forth fine tuning may take days or even weeks. If something significant changes with the business, then the report might not be accurate anymore and the process needs to start over.
The person with the domain knowledge (e.g. the CEO) is often left wondering if something important has been overlooked in a report. Has everything been classified correctly? Did a mistake in a query result in significant under-reporting or over-reporting of some statistic? Is the data being manipulated by someone with a personal interest or agenda? Is an important business decision being ignored simply because the right questions were not being asked?
A tool that helps analyze data quickly and in many different ways can be extremely important for finding answers to those questions. Anomalies in the data can be quickly discovered and corrected. Graphs can be used to visualize patterns. Pivot tables can show data in a whole new light.
Using Didgets, a person can analyze a completely unknown data set that was downloaded from the Internet and find insights in just a few minutes. While testing out its features, I would often load in data sets that I found on Kaggle or other data sites and then just start playing with the data to see what it told me. There were data sets of voting results from the last election. There were reported data sets about COVID. There were data sets of city employee payrolls. There were details about public transportation.
With just a few clicks of the mouse, I could see which counties had the highest mail-in ballot rates. I could see what age groups had the highest hospitalization rates. I could see which departments within a city had the most employees and the lowest salaries. I could tell which bus routes were the most crowded in the morning.
I could make my own conclusions about what was happening without being totally dependent on whatever newscast wanted to spoon-feed me their storyline. Business leaders could ‘sanity check’ all their reports. This is the power of using the best tools to dig through mountains of data to find out what is real and what is not. A governmental ‘Ministry of Truth’ cannot easily cherry-pick the data in order to deceive a well informed public.
A mathematician, engineer, and statistician are interviewed for the same job.
Each of them is individually asked only one question.
The mathematician is interviewed first and is asked, “What is 2 + 2?” to which he answers, “A simple problem in which we must prove the identity of the addition operator. I can show you the proof; I'll start with L'Hospital's theorem. Can I borrow your chalk and blackboard?”
The engineer is second and is offered the same question. “Your problem provides only a single digit of significance; I'd have to say the answer is 4, plus or minus 1.”
The third applicant. “What's 2 + 2?” The statistician gets up from his chair, closes the window blinds, then closes the door, and slyly asks the interviewers, “What do YOU want the answer to be?”