If you have ever forgotten where you stored a file in a file system, you know that finding it can be a real challenge. File systems let you store any file in any folder; regardless of whether the folder path is appropriate for the file. File systems were built to mimic old filing systems that had drawers with folders in them. Just like when a secretary would accidentally put the contract for XYZ company in the folder for ABC company; someone using a file system can save a document to their pictures folder or a downloaded program to their music folder.
If you can’t remember where you put a file, a brute force search operation may be required using the file name or some other search pattern. It can take a long time to search through a large folder tree looking for a file, even if you can remember with certainty something specific about it. If you are looking for just one file, the search will stop when it finds it. If you are looking for all documents, the search must look through every folder on the system before it finishes.
Documents, photos, and other common types of files often have several different formats (and the file extensions that accompany them) so searching by file extension may or may not turn up the file(s) you were looking for. Searching for all files of a specific type (e.g. photos) will only find them all if you specify all the different extensions for that type.
File systems are also designed to know only about the files in their managed volume, so even if you have multiple volumes of the same file system type (e.g. NTFS or Ext4) stored on the same physical hard disk drive; you must search each volume independently.
The time it takes to find a file is often dependent upon how many other files are present in the system. Like finding a needle in a haystack, it really depends upon the size of the haystack when figuring out how long it might take to find the needle.
File systems were invented decades ago when the largest physical drives were measured in megabytes and could only store a few hundred files on them. Today’s hard drives (HDDs) and solid state drives (SDDs) are measured in terabytes (millions of times bigger). HDD manufacturers have recently announced drives that will hold more than 20TB. If the average size of a file is 100,000 bytes; this means you can store about 200 million files before the drive is full. That is one big haystack!
Numerous improvements have been made to file systems over the years; but they are all still based on the same basic architecture from decades ago. They are not designed to classify files and quickly search for a single file or groups of files easily. Applications that help with searching are still required to do a hierarchical tree traversal using sequential search functions (e.g. findNext or readdir) which are slow by nature.
Having a separate indexing service such as Microsoft’s Windows Search or Apple’s Spotlight, can greatly speed up searching; but these are not an integral part of the file system so they have to store their indexing information in a separate database. It is easy for the database to become out of synchronization with the file system. Also, to speed up the indexing process, users often only index a portion of the file system so using the index might not turn up the file(s) you were looking for.
Fast searching is just one of the problems that plague today’s file systems. As someone who has worked with file systems and databases since the 1980s, I have come up with a long laundry list of things a data storage system should do better. Many of the problems cannot be solved with just minor changes to the existing file system architecture. I believe that the time has come to completely replace file systems with something better!
Rather than list all the problems in this article, I will limit it to my ‘Top 5’ problems for brevity. I have designed and developed a new system called ‘Didgets’ that I think solves these problems and many others. Didgets (short for data widgets) are intelligent data objects that can efficiently manage large amounts of unstructured or structured data. Whether file systems are replaced by Didgets or some other similar system, the problems will persist until these issues are addressed:
The fixed-size metadata record for each file is too big. Reading in and caching the entire file table takes too long and uses too much memory.
The metadata record does not have a file classification system. To determine what is in a file, the file name or the data stream must be examined.
File systems do not have a uniform tagging system that is easily and quickly searchable by applications.
Each file’s unique identifier is its full path name. If the name changes or the file is moved to a different folder, any stored references to it become invalid.
Files cannot be protected against malicious code. Virus detection software must examine every single file to ensure the system is safe.
Every file system has a file table that stores a record for each file. The size of this record in popular file systems can range from 256 bytes (Ext4) to 4096 bytes (NTFS). This means that if you have 200 million records in the table; then between 50GB and 800GB of data must be read in and cached if you want to do lots of fast searches. Disk transfer speeds have certainly increased (especially for SSDs) and memory is cheaper than ever, but that is still a lot of data. With Didgets, each record is only 64 bytes which means a table with 200 million records is less than 13GB total, which is much more manageable.
With Didgets, there is a small field in its metadata record that tells whether the file is a photo or a document or a video or some other type. Searches can be exceptionally fast when only the table records must be examined. On my development machine, I can find all 20 million photos (out of 200 million files) in under 2 seconds. Fast searches like this are impossible if you have to compare file name extensions.
File systems use folder names as a general way to organize data. Some file systems also support things like extended attributes that let you attach tags to your files. None of them make finding files based off metadata tags quick and easy. Didgets let you attach up to 255 tags to any Didget and find all of them that share a common tag in seconds.
The unique identifier for a Didget is a 64 bit number. It remains consistent throughout the life of that Didget. It doesn’t change when you assign it a different name (names are just another tag) or put it in a different list (e.g. folder). Any stored references to the Didget remain valid until it is deleted.
Unlike a file, a Didget’s data stream can be permanently immutable. The file system read-only attribute is just a suggestion to any application, which can ignore it. For Didgets, the immutable attribute is enforced by the system with no way around it. No application can modify the contents of a read-only Didget no matter what user permissions it might have. Malware cannot just insert malicious code into a critical system file. If that file needs to be updated, a new version of the file must be created (with its own unique ID) and the operating system must then use the new Didget.
If HDD and SDD capacities continue on their current trajectory, storage systems may exceed 100TB for the average user within the next decade. Files will be kept forever and finding a single file or a group of them will become harder and more time consuming if file systems are not replaced with something better.
Watch a 4 minute demo video:
I’ve been saying this for 20 years.
The obvious alternative is to store data in relations. If you start thinking about that, you get into some pretty interesting territory. I’ll write a blog post about it and send a link.
I've never encountered any of the problems you mentioned. If I want to find a file quickly, I use "plocate". So... nah, I'm quite happy with using a file system to organize my data.