For the past couple of days, I’ve been thinking and dreaming about hashes, file hashes to be exact. My fascination with hashes started with an idea to host a universal hash database that could be used to identify every file in the world. If you think that wasn’t bold enough, then I think I might have just come up with a system that in hindsight, abstracts data distribution across networks and protocols.
The main reason I’m publishing this idea, and not ringing a patent lawyer, is because I want as many eyeballs scrutinizing this idea as possible. Even though I’ve already put a lot of thought into it and got some talented people to help validate it, nothing would be worse than blindly driving down a dead end. On the other hand if the idea holds up, I really need the motivation and support of the wider community to realize it.
Now to cut to the chase, my idea can be said to be a mashup of many existing technologies, some have been technically documented and others have even been implemented in existing distribution methods. But, I believe the idea as a whole – and can only work as a whole – has never been publically discussed or realized.
As a starting point, it is my understanding that currently the most popular file distribution methods – notably HTTP and BitTorrent among others – are all arguably “filename”-based. For example, to distribute a file from a HTTP server, you specify a Uniform Resource Locator (URL) such as “
http://server.com/file.txt“, and using BitTorrent, you create a .torrent file which specifies the filename “
file.txt” among other things.
A disadvantage to the “filename”-based distribution is that it does not consistently identify unique files. For example, a simple text file can have as many names as an appropriate file system supports. On that note, this is where my idea begins.
My good friend the hash, and specifically those created by the SHA-1 hashing algorithm (or SHA-512) can be used to identify unique files. If you hash a file using a hash function, the hash you generate will not change unless the contents of that file changes. Why this is useful for file distribution is that it normalizes the identifier of the file. After all, all we really want is the contents of a file; it doesn’t really matter what it is called. Of course, you could argue that the file extension is useful, if not important, in which case you’d be glad to know these are not lost in the endgame.
The skeptical will be thinking, how do you prevent hash collisions? Well I can’t. By definition it’s part of a good hash design that the chances of collisions are minimized and so far SHA-1 has lived up to expectations. In the event that SHA-1 is compromised, the system could very well accommodate any new hashing algorithm that becomes the standard, or multiple hashing algorithms simultaneously. Update: Another practical idea to reduce the chance of collisions even further is to concatenate two hashes generated by two different hashing algorithms.
Owning up to prior art however, the aforementioned idea is not new to the world of file distribution. Magnet links, as they are called, is a Uniform Resource Identifier (URI) scheme which does exactly what I described. Most popularly used for peer-to-peer distribution networks and notably eDonkey, it creates an “URL” that is identified by a hash.
For example, “
magnet:?xt=urn:fakehash:123ABC&dn=file.txt” describes a file whose hash using the “
fakehash” hash function is “
123ABC” and is suggested to be named “
Now with a method of identifying unique files via hashes, the next step is to figure out how the file transfer transaction actually occurs. If I may say so myself, this is where the idea really shines and in true Web 2.0 mashup-spirit too.
To answer the question with a question if you will, “what are the best methods to transfer files today?” And the answer to that question is that there is no best way. Notably different types of networks (server-client, peer-to-peer) and their respective protocols (HTTP, FTP, BitTorrent, etc) are all unique, and each has a set of advantages and disadvantages that complement each other. Using these in synergy would maximize the speed and redundancy the infrastructure has to offer.
To get all these protocols to work in unity, this is where a central web service comes into play. (For the sake of simplicity, assume central does not mean one server at a single URL that is prone to failure, but a distributed system with redundancy.)
What this web service will do is abstract the resources on different networks. To put this into practice, let’s assume you are after a particular movie trailer video file; you come across a website which includes a magnet link to a high-definition 1080p version of the trailer. In this magnet link is among other things, the hash of the file and a suggested filename.
Now, using a client which associated itself with the magnet link, the application will launch and then send the hash to the web service. The web service will return you a set of claimed resources with the original content. This could be on Apple’s HTTP server as “
http://apple.com/trailers/movie_trailer.mp4“, on a movie trailer website as “
http://trailer.com/movies/19428.mp4“, on a FTP server as “
ftp://files.com/the_fantastic_movie.mp4“, and available on a specific BitTorrent tracker as “
A particular client may only support a small subset of the file transfer protocols available so the web service will only send back a list of resources that is applicable. Assuming this client supports HTTP and BitTorrent, it will now begin to download the segments of the file concurrently. How each client handles the actual data transfer is ultimately up to the application developer, allowing room for continued competition and innovation.
You might be asking, “how does the web service know where the resources for a hash are located?” It will ultimately be up to the users of this system – either manually or systematically – contribute to the resource “pool”.
Luckily, it just happens one part of the Magnet URI specification allows you to append an “alternative” resource link. In practice, it would mean different sites would append different alternate resources, which clients then send to the web service over time, taking advantage of existing infrastructures.
For example, the Apple site might have the magnet link “
magnet:?xt=urn:fakehash:123ABC&dn=file.txt&as=http://apple.com/trailers/movie_trailer.mp4“. The movie trailer website might have “
magnet:?xt=urn:fakehash:123ABC&dn=file.txt&as=http://trailer.com/movies/19428.mp4“. The BitTorrent site might have “
The pool will not be empty to begin with since one person must intend a file to be shared for it to exist in the web service database. Therefore it will have at least one resource location available. Over time the distribution becomes decentralized as more resources become available.
Of course as we all know, people can be evil little buggers. How do we trust that all resources in the pool are for the same hash? How do we ensure the resources have not tampered with? Also, it’s a reality that resources can disappear without warning, so what happens if they no longer exist?. In all of these cases, we treat the resource as an improper resource.
So how do you validate the proper resource from improper ones? By hashes of course! Granted, if the resource is a big download like a high-definition video file, it wouldn’t be the best user experience to find out a resource was improper after hashing the complete file. This is why we need to segment the files in order to avoid the situation.
The size of these segments will be calculated algorithmically so that it is consistent across the system. Assuming a 1GB file is divided into 10MB pieces, we then hash the individual 10MB piece. It turns out BitTorrent does this too, to ensure integrity at the block level. However to deliver one step beyond BitTorrent, we can use a hash tree to ensure that not only is there integrity at the block level and at the whole-file level, but that we also ensure integrity between these levels. That is to say, we can verify the individual block hashes do in fact add up to the root hash. Update: These block-level hashes will be stored in the web service database that can be pulled if or when a client requests them.
The paranoid ones will probably ask, what happens if a client mistakenly or with malicious-intent claim a resource is improper when it is proper, or vice versa? In response to this, I propose a karma-based system which ranks the “quality” of unique user’s claims in a specific time period. Other clients who verify a claim in turn boosts the karma, and others who disagree with a claim will reduces karma of that user.
I haven’t worked out the specifics of this but I welcome suggestions.
So that’s pretty much it for the technical implementation, but as we all know, technicalities is not all that matters. To implement such a system, transition is critical. Getting users and most importantly entities which distribute files to switch to such a system could admittedly be considered impossible without some sort of a bridge.
But as previously mentioned, since magnet links have native support for “alternate” links, links which work without any new clients or infrastructure, means that there is always a fallback option. A current HTTP download client could continue to support just HTTP and the same for BitTorrent. At this level, the clients can take advantage of multiple resources on the same protocol, sort of mirroring-on-steroids. And when the clients are ready to support multiple protocols, then they can simply take advantage of the new resources returned by the web service.
Preemptive frequently asked questions
To conclude, I want to preemptively answer a few burning questions some people might have.
What you’re suggesting would require figuratively a “centralized” web service which can handle the file distribution requests of a potentially mindblowing number of clients at any one time. Where would you find the resources?
It is true that this system will only work at maximum potential when there is a centralized database of the resources, but centralized does not necessarily mean one server either. DNS is a server-client system which depends on a centralized database, and technologies like distributed databases and scalable web services hosted on Windows Azure make it possible.
Would this mean that the web service could potentially log (and trace) every download?
Short answer, no. The web service only coordinates requests of resources; intent to seek potential resources of a specified hash is not proof of data transfer. However, this won’t stop individual resources such as those hosted on a HTTP server to log every request a client may subsequently make, or a BitTorrent tracker which records the peers of a file.
Are there any other applications for this beyond file downloads?
Yes. If you think about it, downloads does not necessarily have to be a user clicking a link in a browser, it is whenever a file is transferred between two clients on the Internet. Of course the benefit of this system is exponentially decreased the smaller the file is. However, as content becomes higher in quality, it is a trend that files will continue to be larger and harder to distribute faster and more reliably.
Will this prevent malicious files from being distributed?
Unfortunately, no. Because the system only knows hashes, it has no idea about the original contents of the hashed file. However, this is not to say a third party cannot identify hashes which are known to be of malicious files and flag them as such (which I propose in my hashDB project), but this is outside the scope of this system.
What’s your business model?
I’ll consult the Twitter guys. Just kidding :). Well because such a system would benefit both parties of a distribution, distributors and users, I’m sure there is some cost benefit somewhere that one can tap into to fund the system.
Of course if you’re an investor and are interested in helping me realize this, shoot me an email at firstname.lastname@example.org.
Ultimately if you’re looking for something, you no longer have to care what it’s called, where it’s hosted, if all the resources are available and safe, or even how the bits are transferred to you. You just get what the other party intended you to have.