It has now been over a decade since I wrote the first line of code for my Didgets project (a new kind of general-purpose data management system). Before that, several years had already passed since I first started formulating the ideas and the design. Through the years, the project changed quite a bit as I added new features, tweaked the design, and focused on different aspects. Still, the initial design was solid enough that it is still very recognizable through all the changes.
During the most recent years, I have focused mainly on the database features. The system is able to import structured and semi-structured data from files or external databases to create relational tables that can be queried and analyzed quickly and easily. Trying to gain some more traction in the market, I often tout those features as the most prominent ones; but its file system features are really where it all began.
Didgets started out as a potential replacement for conventional file systems! There are hundreds of file systems that have been invented over the past 60 years. Each one has a few unique features, but all of them conform to the same basic architecture. Didgets was designed to be a radical departure from that architecture where data is stored in a set of specialized, intelligent data objects that are managed in a unique way.
The meta-data structures I designed to support tagging for files turned out to be very useful for efficiently handling highly-structured and semi-structured data so the scope of the project expanded into areas normally handled by relational databases and NoSql solutions. While those areas are very important for other kinds of data management; I wanted to focus this post on the root purpose of Didgets, which was an efficient way to handle large numbers of unstructured data (i.e. files).
When I started designing Didgets, the biggest hard drives were still under a single terabyte (TB). To get extremely large file system volumes, you had to combine several smaller drives into an array of disks using hardware and software emulation to make them look like one big drive. I still remember seeing the first advertisement for a 1TB system back in the late ‘90s. The vendor placed five hundred 2GB drives in a couple enclosures the size of refrigerators and was selling the whole system for over $1M.
Just 10 years later you could buy the same capacity in a single drive for just over $400. Today, that same money will buy you a drive with 20x the capacity (20TB) and even larger drives are expected soon. It is now possible at low costs to build file system volumes that can hold hundreds of millions of files. Storage capacity is no longer the biggest issue, but rather it is trying to manage and process huge amounts of data in a timely manner.
The original architecture for file systems was created decades ago when the biggest disks could at best, only hold a few thousand files. As volumes grew in size, file systems had to be updated in order to handle a much larger set of files but they still do many tasks the same way they have always done them. Once file systems grew big enough for a million decent-sized files; searching through all of them looking for a file or a set of files proved to be a challenge. It can now take hours or even days to find files using just file system calls. Finding better ways to store and cache the data are critical.
File systems were just never designed to perform searches quickly and efficiently. Operating system builders recognized this and created file indexing services that would crawl the file system and put file metadata in a special kind of database for faster lookups. MacOS has Spotlight. Microsoft built Windows Search. Many Linux distributions use Locate. A variety of other third-party ‘file indexers’ are also available. While all use various file system calls to gather data; none of them are an integral part of a file system.
My Didget system was designed to replace the archaic file system architecture with something better. It was built to handle hundreds of millions of files; attach a variety of meta-data tags to each file; and perform very fast queries based on them. Finding a single file or set of files quickly does not depend on a separate indexing service.
Didgets does have a few things in common with conventional file systems. Like them, it works with block storage devices like hard drives and SSDs. It stores each Didget’s content in a ‘data stream’ made of a collection of blocks. It has a sophisticated method of allocating and tracking blocks that is similar to other systems. Each Didget has a fixed-sized meta-data record that is stored in a system table and a unique identifier or ‘Key’ to distinguish it from all the other Didgets. Files can be arranged in a hierarchical folder structure like file systems use, but do not need to be.
The way the Didget system organizes and manages its meta-data is truly unique. A Didget record is small and contains information not found in other systems. This makes it easier and more efficient to find things. A variety of tags can be attached to each and every Didget and are designed to be searchable with minimal disk reads. A Didget can have one or more name tags, or it can have none.
On some file systems, the file record can be as big as 4096 bytes or the size of a whole block on modern hard drives. If a volume contains 200 million files, this means the file table can be 800GB in size and a file’s record could be located anywhere within that table. Finding a single file might require reading in the whole table which can take a couple minutes on the fastest SSD or more than 25 times as long on the fastest hard drive. Caching that whole table in memory so that many different queries can be fast, can tax even the most expensive computers.
A Didget record, on the other hand, is only 64 bytes so the table for 200 million Didgets requires less than 13GB. This can be read in under 2 seconds on a fast SSD and less than a minute on a hard drive. Even many budget computers have enough memory to cache the whole table so many queries can occur without swapping out to disk. The table itself can be partitioned into smaller segments and only certain types of Didgets stored within each segment. For example, all the records for photos might be within the same segment. If only a few million photos were among the 200 million Didgets, all of them could be found by reading in a small part of the table.
File records in file systems generally do not have a way to classify files. In order to determine if a file is a photo or a document; the file extension to the file name, which is stored separately, must be examined. This can be a slow, tedious process in conventional systems. A Didget record is different in that it has a small bit field that specifies the type and subtype for the Didget. Photos can be distinguished from documents or other kinds of files without any string operations. A million photos can be found among 200 million files in just a second or two by examining just the table.
File systems also do not have a way to distinguish between classes of files. Didgets uses a 4 class system (public, private, semi-private, and system) to distinguish each Didget. Public Didgets contain data that the user downloaded or copied from an external source. Software, photos, or documents you get off the Internet will have this designation. Pictures that you take, documents you write, or videos you record will have a ‘Private’ designation. ‘Semi-Private’ data was created on the user’s machine, but was done so by an application to store things like configuration or state. Browser cookies or bookmarks would fall into this category. System Didgets contain internal data used to manage the system.
The classification system allows operations like backup or security to be much more efficient. You can backup all your private and semi-private data without using extra space for all the applications and data you downloaded and which could be easily re-copied. If you are looking for photos, you can distinguish between the ones you took with your own camera or phone and the ones that your operating system uses for icons, buttons, or other graphics.
For file systems, each file’s unique identifier is its full path name. This is a string of characters that represents the names of each folder in the hierarchical tree and the file name itself. Any change to the file’s name or one of its folder names (such as moving the file to a different folder) will affect the unique identifier (ID). This includes translating any names to a different language. If you have stored that ID in a file or database, then it is invalid the next time you try to use it. Like broken links on the Internet, it is a source of frustration when a ‘file not found’ message comes up.
Didgets on the other hand, use a 64-bit number as the ID for each Didget. This ID never changes over the life of the Didget. Moving the file to a different folder or changing its name (a name is just another tag), does not change its ID. A set of Didget IDs can be stored and will always be valid unless a Didget is deleted.
Finally, tagging files with extra meta-data to enable faster and more intelligent search has always been a challenge. Some file systems have had ‘extended attributes’ used for this purpose, even though it is still not enabled by default on many systems. These attributes are generally stored within the file record which bloats its size and require a lot of processing to find specific tags. For these reasons, tagging files is not as popular as it should be.
Didgets has a robust tagging system that is fast and easy. Up to 255 tags can be attached to any Didget. A tag is a simple Key-Value pair that is stored within special ‘Tag Didgets’. Each tag is like a column in a relational database table. Just like a database such as Postgres can “SELECT * FROM <table> WHERE state = ‘Texas’;” and return a million rows out of a 100 million row table very quickly; Didgets can find all “PHOTOS WHERE .device.Camera = ‘Cannon EOS’ and .event.Wedding LIKE ‘%Jack%’;” to find in seconds all of Jack’s wedding pictures taken with that camera.
These are just a few of the main differences between Didgets and conventional file systems. There are many other improvement features built into Didgets and are too numerous to explain in detail in this post. For those curious, we have a demo video:
While improvements continue to be made to make the database features more appealing to a wide audience, we also need beta testers who will load in large numbers of files to test out the file system features. An open beta is available at our website