Data Science and Serendipity

I'll know it when I see it.

Aug 01, 2022

Anyone who has studied the history of scientific discoveries and inventions knows that a great many advances were the product of fortuitous accidents. Serendipity occurs when an accident or ‘trigger event’ causes someone to take notice and discover something significant about what caused it.

Safety glass was discovered when its inventor accidently dropped a glass flask that he had previously filled with cellulose nitrate, a liquid plastic. Instead of shattering as expected, it broke but maintained its shape. Penicillin was discovered when some samples of bacteria were left out accidentally and some mold formed on one of them. The bacteria had failed to grow in areas near the mold.

Even though a trigger event might be caused by an accident; it still requires a trained mind to investigate and understand what caused the event and think of a useful application for the discovery. In fact, events like them might have occurred many times in the past but no witness realized what caused them or how they might be exploited. The ‘eureka moment’ often occurs when someone with the right experience happens to be in the right place at the right time.

Many discoveries also happened because the trigger event exposed an anomaly. Often the answer (or at least strong evidence) had been staring everyone in the face all along; but the anomaly had been dismissed as just ‘background noise’. The discovery came after someone noticed the anomaly and investigated why it was different than all the rest. Recognizing and focusing on the ‘outlier’ of all the various data points turned out to make all the difference.

Modern computing systems have the ability to gather, store, and analyze vast amounts of data. Companies of all sizes are collecting data on nearly every aspect of their business. Sales info, supply chains, customer profiles, and manufacturing processes are all measured and recorded on a frequent basis. It is not just summary information that is being maintained, but also very specific data points at a very granular level.

Data mining is a term used to describe the practice of sifting through large amounts of data to discover insights that can be used to make business decisions. Extremely large data sets often require special software tools and specially trained data scientists to examine the data in an effective manner. But often significant insights can be discovered with much smaller data sets. Petabytes of data do not necessarily need to be analyzed to find meaningful patterns or discover significant anomalies.

The amount of data a business can collect does not necessarily correlate to the size of the company. While it is a general rule that the bigger the business becomes, the more data it collects; some small businesses on tight budgets can collect large amounts of data. These businesses need simple tools that can sift through some hefty data sets in order to find the needed insights.

There is currently a pretty big gap between spreadsheets (the bread and butter of many small businesses) and the huge analytic processes of major companies. The largest businesses can often afford to employ expensive data scientists who use powerful, distributed computing environments to comb through vast amounts of data. They then use the analysis to generate many reports that can be shown to people with the expertise necessary to understand what the reports mean.

The skills needed to accomplish all this are generally beyond the capabilities of many small to medium-sized businesses. What they really need is a tool that is about as easy to use as a spreadsheet, but is powerful enough to quickly analyze data sets with many millions of data points. There are a number of offerings that are trying to fill this gap, but none have become as ubiquitous as the spreadsheet.

My Didget Management System along with its accompanying data analytics tool is designed to help fill that gap. A user can drop a CSV or Json file in a window within the tool and have a relational table built in just a few clicks of the mouse. The table can then be analyzed quickly and the user can use their experience and domain knowledge to understand which items are significant. The tool can also help find and clean up any errors in the data so that the analysis is more accurate.

It is not enough for a tool to just perform various queries quickly and easily. It needs to also be able to help the user intuitively find the most common data points and which ones are outliers. Our analysis tool can instantly show you the number of rows associated with every column value just by clicking on the column header. It then allows a subset of a table (created by selecting some of the most common or the least common values) to be analyzed separately.

Here is a short demo video that shows how our tool can analyze a data set that I downloaded off the Internet (the Covid death data from the CDC) using queries, charting, and pivot tables:

The software is currently in an open beta and is available for free download on our website. Feel free to try it out on any data set you can find to see how it can give you insights into data sets of any size. https://didgets.com/download

Didgets

Discussion about this post