Musings from an old programmer
A few of my thoughts on data management
I might be old, but I am not ancient. I never swapped out any of those big tape reels you see in documentaries about the Apollo space program. I never fed a program I wrote into a mainframe using a stack of punch cards. But I came of age at the dawn of the ‘Personal Computing’ revolution, so I am definitely old compared to those young programmers just getting started. I wrote my first program (in BASIC) on an Apple II computer (I think it was the Plus model) that my high school bought for the math department back in 1980 (or was it 1981?).
Since computing resources were very scarce (and expensive) I learned to program when the size and efficiency of the running code were paramount. Memory was measured in Kilobytes. CPUs barely ran in the MHz range. Floppy disks were the norm. It took several minutes to boot your computer. Unlike today, when few programmers pay much attention to optimizing every line of code; I would often spend hours fine tuning a single function.
It feels like efficiency is a lost art in many of today’s programs. With so much cheap memory, multi-core CPUs, and fast storage devices; almost all the emphasis is placed on getting a program finished quickly instead of worrying much about how efficient it makes use of those resources. If a program passes all of the functional tests, it is often declared finished and ready to ship.
I understand the market pressures that drive this kind of development. First-to-market is very important. Scale-out web services in the cloud hide a lot of inefficiencies in code. If the service starts bogging down when a million customers come online, just spin up a few more servers and the problem is solved, right?
It is often only when their bills for cloud services (or the electric bill if they host their own hardware) starts reaching astronomical heights that many companies start really paying attention to code efficiency. When the mounting effects of slow code, bloated software, and inefficient data storage methods start hitting the bean counters in the pocketbooks; management finally (and often reluctantly) starts to focus on optimizing them.
In spite of all the amazing advancements in computer hardware, data can always outgrow the capacity to handle it. It is simply much easier to generate and store more and more data than it is to double or triple the hardware capacity. Even when some hardware can be upgraded easily and cheaply, there may still be bottlenecks that can bog things down.
One area where this is most noticeable is with hard disk drives. Drives now have capacities that are about 1000 times bigger than they were just 20 years ago. But they are only about 5-10 times faster at reading and writing the data. That means it can take over 100 times as long to fill up (or restore from backup) a full drive today as it did in the recent past. SSDs are much faster than hard drives, but they are still much more expensive for storing large amounts of data.
Databases now have tables with hundreds of millions of rows. CPUs sometimes have to process billions of data points to generate a single report. Users often have to search through tens of millions of files to find the one they misplaced. With the ever growing volume of data, there is a great need for better and more efficient ways of managing it all.
Whenever I talk about efficiency in data management, I often refer to Einstein’s famous equation… E = mc2. This equation describes the relationship between matter and energy. When an atomic bomb explodes, an enormous amount of energy is released even though only a small amount of fissionable material is converted within the bomb. This is because the ‘c’ in the equation is a really big number (the speed of light) and it is squared. Even a small increase in the amount of matter being destroyed will result in a big difference in the amount of energy released.
The same concept applies to the storage and processing of data. If time, energy or storage space is multiplied by some really big numbers then the effect can be tremendous. Code is often written and tested with smaller numbers in mind, but data can grow to billions or even trillions of individual pieces.
For example, if you have an inefficient function that takes just one millisecond more than it should, you might not even notice if it is called for a few thousand data points. But if this same function is applied against a billion data points that inefficiency can result in an extra 1 million seconds (1 billion x .001 second). If you don’t have your calculator handy…that is over 277 hours or about 11.5 days. Likewise, 10 wasted bytes in a data record may not seem like much if you only have 100,000 records (barely a MB). But if you have to store 100 billion records, that wasted space amounts to a whole TB of storage.
Many of the software architectures used to manage data (file systems, databases, etc.) were created decades ago when data sets were significantly smaller than today. Some new architectures have emerged for sure, but too many of the old systems are still in use today.
My passion is in trying to find new and innovate ways to store and manage large amounts of data. I will use this newsletter to discuss my insights into this area.
