Big Data

The definition keeps changing.

Mar 03, 2022

I still remember seeing an advertisement for the first 1TB storage system that I had heard about. It was in the later half of the 1990s and I think the biggest hard drive being shipped at the time was 2GB. The system was about the size of a couple of refrigerators into which the vendor had packed 500 of those disk drives, along with special controllers and software to make it all look like one big TB drive to whatever computer system used it. I remember that it cost more than $1M.

At the time, a TB felt like ‘Big Data’ to nearly everyone. Only companies like airlines and banks could even afford the system. They could pay all the people needed to manage the system and they had enough data on millions of customers to actually fill it up. In those early days of computing, the storage was not only expensive, but was also fairly unreliable. Proper backup procedures are still important, but restore operations due to hardware failures are much less frequent now.

Today, a TB of storage costs about the same as a decent pizza ($20). It has been almost 10 years since the last time a drive failed on me. The largest capacity hard drives generally demand a premium per TB ($25 - $30) and I saw an ad yesterday for a 20TB drive for under $500. Solid State Drives (SSDs) are faster but more expensive than hard drives; but you can even buy these now in TB capacity for under $100. These days, average users have no trouble generating a TB of pictures, videos, documents and other stuff they create or download off the Internet.

The definition of ‘Big Data’ today seems to center more around the processing of data than the actual storage of it. Even small to medium sized businesses can usually afford to buy enough storage to house a PB of data (1PB = 1000TB). The problem generally comes when someone needs to process all that data or to just find a defined subset of data within it. ‘Big Data’ is now generally defined as data that is too big to process within a reasonable amount of time without special measures taken.

While the quantity of data is certainly important when processing it to find meaningful insights; the structure of that data may be far more important. Cloud architectures are often centered around ‘clusters’ of computers that can each process a portion of the data in concert with each other. These massively parallel systems can process huge amounts of data, but they require the data to be structured in a way that makes this possible. (See my post ‘The Power of Parallel Processing’ for further insights in this area)

Big data sets must be structured so that they can be broken up into smaller chunks to be processed independently. Some relational databases can break up large tables into separate ‘partitions’ that can each be managed by a separate server. A billion row table could; therefore, be broken into 10 different table sections, each with 100 million rows in it.

Another issue with big data is disk access speeds. It is far easier for disk makers to double the capacity of a drive than it is to double its data transfer rates. When I bought my very first hard drive (10 MB) back in 1986, it was very slow by today’s standards (although it was ‘lightning fast’ compared to my floppy drive). I wrote a program to see how long it would take to fill it up. The program ran for about 7 minutes before the drive was full. A decade later, I bought a 1GB drive (100x bigger) that took about 35 minutes to fill up (5x as long). My first TB drive had a transfer speed of about 100 MB/s. Drives today that have 20TB (20x bigger), are only about 2.5x faster.

The same goes for modern SSDs. It is easier to double the amount of flash memory on them than it is to increase their speeds dramatically. My 1TB SSD has a 7 GB/s read speed. I could have splurged for the 2TB model, but it would have had about the same access speeds.

What this means is that you not only have to break up big data into smaller chunks, but you have to distribute it across multiple storage devices. It doesn’t do you any good to break up a 1PB database into 10 or 100 sections if they are all stored within the same RAID array. If multiple servers processing the data have to share the same storage device, all the read and write operations get bottlenecked.

In order to really maximize performance, you not only need to break up data into smaller chunks and store them on separate devices, but you have to do it in an intelligent way. Doing so will greatly increase the chances that you can query or process the data without needing to read or write all of it.

If you were looking for a set of books on the Civil War in your public library, you would not have to look on every shelf if the computerized catalog was offline. This is because the library separates out books based on their content. All the fiction books are in a different section than the non-fiction books. If the library is big enough, it will be divided further into much more granular sections. Chances are, all the Civil War books are located together in just a couple bookcases in one corner of the library. You wouldn’t have to inspect all the books in the library to find the few you were looking for.

The same could be true of large data sets. If you had all the census data for the United States stored in one big database, it would be wise to break up the data by year and by location. Someone searching for data for someone who lived in Colorado in 1980 would not have to read in all the data. The system could limit its search to the section that holds just the data for that state in that census year.

My Didgets system is designed to break up large data sets into smaller chunks and store the data intelligently within them. That is one of the reasons why my system processes database queries so much faster than other databases. If you are querying a huge table of customers for the rows where the customer lives in Texas, it may only need to read in and process a small portion of the table to satisfy the query.

In my last post about file systems, I talked about how the Didget system stores each object’s metadata in a record that is only 64 bytes in size (compared to 256 bytes or even 4096 bytes for some other popular file systems). It got a bunch of comments on Slashdot about how that is not important at all. But let’s do the math:

If you have a 20TB drive with 200 million files on it, you have to read over 800GB from disk just for the file table if those files are stored within the NTFS file system (default for Microsoft Windows). If you were able to read it in at the max read speed for the disk drive, that would take over 3267 seconds or almost 55 minutes. The same operation on Didgets takes only 1/64th of that time or less than 1 minute. The amount of RAM needed to cache all that data is also cut down to 1/64th of NTFS. Other file systems may not be as wasteful as NTFS, but they still require several times the space and time of Didgets.

Didgets was designed with ‘Big Data’ in mind. We are still looking for more beta testers at www.Didgets.com

Lefty

Mar 3, 2022Edited

In 1990 I bought a 300MB SCSI drive for my 386 tower for $1000. Am a bit confused here. If Didgets uses a 64 byte metadata record and NTFS uses a 256k metadata record, wouldn't the read times be one quarter of the time?

2 replies by Andy Lawrence and others

2 more comments...

Didgets

Discussion about this post

Ready for more?