July 10, 2009Long Zheng

Reimagining file distribution: universal downloads

Foreword

For the past couple of days, I’ve been thinking and dreaming about hashes, file hashes to be exact. My fascination with hashes started with an idea to host a universal hash database that could be used to identify every file in the world. If you think that wasn’t bold enough, then I think I might have just come up with a system that in hindsight, abstracts data distribution across networks and protocols.

The main reason I’m publishing this idea, and not ringing a patent lawyer, is because I want as many eyeballs scrutinizing this idea as possible. Even though I’ve already put a lot of thought into it and got some talented people to help validate it, nothing would be worse than blindly driving down a dead end. On the other hand if the idea holds up, I really need the motivation and support of the wider community to realize it.

Background

Now to cut to the chase, my idea can be said to be a mashup of many existing technologies, some have been technically documented and others have even been implemented in existing distribution methods. But, I believe the idea as a whole – and can only work as a whole – has never been publically discussed or realized.

As a starting point, it is my understanding that currently the most popular file distribution methods – notably HTTP and BitTorrent among others – are all arguably “filename”-based. For example, to distribute a file from a HTTP server, you specify a Uniform Resource Locator (URL) such as “http://server.com/file.txt“, and using BitTorrent, you create a .torrent file which specifies the filename “file.txt” among other things.

A disadvantage to the “filename”-based distribution is that it does not consistently identify unique files. For example, a simple text file can have as many names as an appropriate file system supports. On that note, this is where my idea begins.

Hashes

My good friend the hash, and specifically those created by the SHA-1 hashing algorithm (or SHA-512) can be used to identify unique files. If you hash a file using a hash function, the hash you generate will not change unless the contents of that file changes. Why this is useful for file distribution is that it normalizes the identifier of the file. After all, all we really want is the contents of a file; it doesn’t really matter what it is called. Of course, you could argue that the file extension is useful, if not important, in which case you’d be glad to know these are not lost in the endgame.

The skeptical will be thinking, how do you prevent hash collisions? Well I can’t. By definition it’s part of a good hash design that the chances of collisions are minimized and so far SHA-1 has lived up to expectations. In the event that SHA-1 is compromised, the system could very well accommodate any new hashing algorithm that becomes the standard, or multiple hashing algorithms simultaneously. Update: Another practical idea to reduce the chance of collisions even further is to concatenate two hashes generated by two different hashing algorithms.

Magnet links

Owning up to prior art however, the aforementioned idea is not new to the world of file distribution. Magnet links, as they are called, is a Uniform Resource Identifier (URI) scheme which does exactly what I described. Most popularly used for peer-to-peer distribution networks and notably eDonkey, it creates an “URL” that is identified by a hash.

For example, “magnet:?xt=urn:fakehash:123ABC&dn=file.txt” describes a file whose hash using the “fakehash” hash function is “123ABC” and is suggested to be named “file.txt“.

Transfer

Now with a method of identifying unique files via hashes, the next step is to figure out how the file transfer transaction actually occurs. If I may say so myself, this is where the idea really shines and in true Web 2.0 mashup-spirit too.

To answer the question with a question if you will, “what are the best methods to transfer files today?” And the answer to that question is that there is no best way. Notably different types of networks (server-client, peer-to-peer) and their respective protocols (HTTP, FTP, BitTorrent, etc) are all unique, and each has a set of advantages and disadvantages that complement each other. Using these in synergy would maximize the speed and redundancy the infrastructure has to offer.

Web service

To get all these protocols to work in unity, this is where a central web service comes into play. (For the sake of simplicity, assume central does not mean one server at a single URL that is prone to failure, but a distributed system with redundancy.)

What this web service will do is abstract the resources on different networks. To put this into practice, let’s assume you are after a particular movie trailer video file; you come across a website which includes a magnet link to a high-definition 1080p version of the trailer. In this magnet link is among other things, the hash of the file and a suggested filename.

Clients

Now, using a client which associated itself with the magnet link, the application will launch and then send the hash to the web service. The web service will return you a set of claimed resources with the original content. This could be on Apple’s HTTP server as “http://apple.com/trailers/movie_trailer.mp4“, on a movie trailer website as “http://trailer.com/movies/19428.mp4“, on a FTP server as “ftp://files.com/the_fantastic_movie.mp4“, and available on a specific BitTorrent tracker as “awesome-trailer.mp4“.

A particular client may only support a small subset of the file transfer protocols available so the web service will only send back a list of resources that is applicable. Assuming this client supports HTTP and BitTorrent, it will now begin to download the segments of the file concurrently. How each client handles the actual data transfer is ultimately up to the application developer, allowing room for continued competition and innovation.

Resources

You might be asking, “how does the web service know where the resources for a hash are located?” It will ultimately be up to the users of this system – either manually or systematically – contribute to the resource “pool”.

Luckily, it just happens one part of the Magnet URI specification allows you to append an “alternative” resource link. In practice, it would mean different sites would append different alternate resources, which clients then send to the web service over time, taking advantage of existing infrastructures.

For example, the Apple site might have the magnet link “magnet:?xt=urn:fakehash:123ABC&dn=file.txt&as=http://apple.com/trailers/movie_trailer.mp4“. The movie trailer website might have “magnet:?xt=urn:fakehash:123ABC&dn=file.txt&as=http://trailer.com/movies/19428.mp4“. The BitTorrent site might have “magnet:?xt=urn:fakehash:123ABC&dn=file.txt&as=http://torrentsite.com/t/cool_movie_trailer.torrent“.

The pool will not be empty to begin with since one person must intend a file to be shared for it to exist in the web service database. Therefore it will have at least one resource location available. Over time the distribution becomes decentralized as more resources become available.

Resource integrity

Of course as we all know, people can be evil little buggers. How do we trust that all resources in the pool are for the same hash? How do we ensure the resources have not tampered with? Also, it’s a reality that resources can disappear without warning, so what happens if they no longer exist?. In all of these cases, we treat the resource as an improper resource.

So how do you validate the proper resource from improper ones? By hashes of course! Granted, if the resource is a big download like a high-definition video file, it wouldn’t be the best user experience to find out a resource was improper after hashing the complete file. This is why we need to segment the files in order to avoid the situation.

The size of these segments will be calculated algorithmically so that it is consistent across the system. Assuming a 1GB file is divided into 10MB pieces, we then hash the individual 10MB piece. It turns out BitTorrent does this too, to ensure integrity at the block level. However to deliver one step beyond BitTorrent, we can use a hash tree to ensure that not only is there integrity at the block level and at the whole-file level, but that we also ensure integrity between these levels. That is to say, we can verify the individual block hashes do in fact add up to the root hash. Update: These block-level hashes will be stored in the web service database that can be pulled if or when a client requests them.

Trust

The paranoid ones will probably ask, what happens if a client mistakenly or with malicious-intent claim a resource is improper when it is proper, or vice versa? In response to this, I propose a karma-based system which ranks the “quality” of unique user’s claims in a specific time period. Other clients who verify a claim in turn boosts the karma, and others who disagree with a claim will reduces karma of that user.

I haven’t worked out the specifics of this but I welcome suggestions.

Transition

So that’s pretty much it for the technical implementation, but as we all know, technicalities is not all that matters. To implement such a system, transition is critical. Getting users and most importantly entities which distribute files to switch to such a system could admittedly be considered impossible without some sort of a bridge.

But as previously mentioned, since magnet links have native support for “alternate” links, links which work without any new clients or infrastructure, means that there is always a fallback option. A current HTTP download client could continue to support just HTTP and the same for BitTorrent. At this level, the clients can take advantage of multiple resources on the same protocol, sort of mirroring-on-steroids. And when the clients are ready to support multiple protocols, then they can simply take advantage of the new resources returned by the web service.

Preemptive frequently asked questions

To conclude, I want to preemptively answer a few burning questions some people might have.

What you’re suggesting would require figuratively a “centralized” web service which can handle the file distribution requests of a potentially mindblowing number of clients at any one time. Where would you find the resources?

It is true that this system will only work at maximum potential when there is a centralized database of the resources, but centralized does not necessarily mean one server either. DNS is a server-client system which depends on a centralized database, and technologies like distributed databases and scalable web services hosted on Windows Azure make it possible.

Would this mean that the web service could potentially log (and trace) every download?

Short answer, no. The web service only coordinates requests of resources; intent to seek potential resources of a specified hash is not proof of data transfer. However, this won’t stop individual resources such as those hosted on a HTTP server to log every request a client may subsequently make, or a BitTorrent tracker which records the peers of a file.

Are there any other applications for this beyond file downloads?

Yes. If you think about it, downloads does not necessarily have to be a user clicking a link in a browser, it is whenever a file is transferred between two clients on the Internet. Of course the benefit of this system is exponentially decreased the smaller the file is. However, as content becomes higher in quality, it is a trend that files will continue to be larger and harder to distribute faster and more reliably.

Will this prevent malicious files from being distributed?

Unfortunately, no. Because the system only knows hashes, it has no idea about the original contents of the hashed file. However, this is not to say a third party cannot identify hashes which are known to be of malicious files and flag them as such (which I propose in my hashDB project), but this is outside the scope of this system.

What’s your business model?

I’ll consult the Twitter guys. Just kidding :). Well because such a system would benefit both parties of a distribution, distributors and users, I’m sure there is some cost benefit somewhere that one can tap into to fund the system.

Of course if you’re an investor and are interested in helping me realize this, shoot me an email at [email protected].

Ultimately if you’re looking for something, you no longer have to care what it’s called, where it’s hosted, if all the resources are available and safe, or even how the bits are transferred to you. You just get what the other party intended you to have.

58 insightful thoughts

Bryant says:

July 10, 2009 at 4:09 pm

Regarding trust, try using public/private keypairs *as the client names* which are randomly generated on first run for each client. Clients somehow sharing the same public keys would be ignored since matching public keys would indicate one or both clients were maliciously compromised. The odds of a legitimate conflict in matching keypairs is slim and ignoring victims of said occurence is negligible collateral for this kind of implementation.

I’ve got an idea for decentralizing the whole system, but I’m saving that for private discussions 🙂
Bryant says:

July 10, 2009 at 4:23 pm

Also, your solution to our earlier collision discussion isn’t exactly adequate. In fact, it causes problems.

Switching to a new hash down the road just because of a few collisions adds all sorts of things of which you need to keep track now that you’d need to accomodate two hash-sets. Granted, it shouldn’t technically be an issue if the hashes themselves are of a different length and configuration, but if they’re syntactically similar while varying only in how the hash itself is generated, you could see a spike in collisions. Among other things, you’d need to determine which hash is of which kind, possibly maintain two separate databases (one for each hashtype)… it’ll turn into a mess.

In hindsight, your best bet would be to append a unique identifier to the ends of hashes which have collided before spitting out the fresh link to the user, which is similar to how sql servers of modern day handle it. That should make collision handling a bit easier.
Bryant says:

July 10, 2009 at 4:29 pm

(apologies for triple-commenting) It’s somewhat a lie to say that hashes shouldn’t collide. The nature of hashes is that they will collide, but that said collisions are to be minimized whenever possible. Think about it logically and you’ll see what I mean: you’re representing a 2GB file with a 160bit value. Having collisions with different files is impossible when it’s the full filesizes being compared, but when it’s an ubersmall digest… yeah, collisions are rare, but they will eventually happen.

Your best approach might actually be to avoid hashes as the identifier altogether but to instead follow a unique ID generated incrementally the same way it’s done for URLs on URL shorteners.
Long Zheng says:

July 10, 2009 at 4:33 pm

@Bryant: No it has to be a hash, otherwise the whole system doesn’t work because it reintroduces identifiers.

A SHA-1 hash has 2^160 possibilities, which is 1.46E+48. That’s at least a 1 followed by 48 zeroes.
Bryant says:

July 10, 2009 at 4:36 pm

Yes, I did the math, but the fact that private SQL servers have workarounds for collisions means you should too, especially since you’re considering playing with fire.

Also, just because that’s the number of possibilities doesn’t mean that’s the number it takes before a collision happens. That’s the number it’ll take before a collision has a probability of 1 (certain).

You’re likely to run into collisions at much smaller amounts. Your system has a noticeable chance of colliding somewhere after 1 billion files have been hashed.
Long Zheng says:

July 10, 2009 at 4:38 pm

@Bryant: We could move to SHA-512 I guess, which is 2^512 or 1.34E+154. I think that’s more than the number of atoms in the universe.
Bryant says:

July 10, 2009 at 4:40 pm

…or you could just look at how current implementations of SQL deal with collisions. =)
Long Zheng says:

July 10, 2009 at 4:42 pm

@Bryant: Alright I’ll look into how SQL deals with collisions, thanks.
Matt Sharpe says:

July 10, 2009 at 4:47 pm

Sounds cool!!! Keep us posted! Are you going to make a community project of it?
Fowl says:

July 10, 2009 at 4:58 pm

Sounds like Gnutella.
Fowl says:

July 10, 2009 at 5:01 pm

I’m sorry, I meant Gnutella2, the one with more metadata.
Long Zheng says:

July 10, 2009 at 5:04 pm

@Fowl: Thanks for the heads up. I had a quick look at Gnutella (and Gnutella2) and correct me if I’m wrong, but this sounds like a closed-loop system. That is, it only identifies locations of resources within the “Gnutella network” – among its peers and some sort of Gnutella server?

If that is the case, then I really think that’s why it didn’t take off, since transitioning from one distribution method to another is more difficult.

If it’s not and is very similar to what I’ve described, then I might just have to admit they came up with it first 🙂
Jack Ukleja says:

July 10, 2009 at 5:18 pm

Ok, I’ll be the first to admit I don’t get it. 🙂

From what I can tell you are talking about having a centralized database of file hashes. Fine. Could be useful, especially if servers register the fact they have that file for download.

You are then talking about building a protocol and client/server to orchestrate this somehow? And allow people to use many protocol you like to transfer the files. Okay. But this sounds like a lot of work – I am unsure about the problem you are trying to solve. Can you give a use-case or two?

Also surely the Universal Hash Database could not be guaranteed to identify every file in the world uniquely – there is surely some chance (not sure how big) that two files could end up with the same hash…?
Long Zheng says:

July 10, 2009 at 5:36 pm

@Jack Ukleja: As people have previously stated, there is a chance, a very small chance but a chance nonetheless, that two files may end up with the same hash. But perhaps in the interest of simplicity, let’s assume that’s not happening.

I’ll illustrate what I’m proposing with an example. Say you want to download an ISO of Ubuntu 10 for x86 PCs. Most people might start at the Ubuntu website, but since its freely distributed, they might go to other sources such as a file mirror or even BitTorrent.

Fortunately with something like an Ubuntu ISO its filename is more consistent, but lets say each of these locations rename the file slightly differently, enough so that you can’t guarantee the file is the same file. On the other hand, one of these resources might even be corrupt.

To be able to treat all possible resources as one, you must first normalize the file. To do this, we can use a hashing algorithm that generates a hash based on the contents of the file. If hashes match, it is the same file. Furthermore, block-level hashes can allow us to download segments of the same file from different resources, whilst each resources integrity can be assured.

Resources gets contributed to a central database which maintains a list of possible resources for a particular hash. It does not need to orchestrate the transaction between the resources and the client.

A fundamental factor of my idea that I believe helps in adoption is that it utilities existing protocols. Clients, which anyone could develop, will ultimately decide which protocols it can support and “orchestrate” how they work in synergy. This is very possible as many protocols support segment-based downloads.

Does that help?
Sam says:

July 10, 2009 at 5:37 pm

So effectively, it’s a protocol-independent, mirror URLs service, with built-in file integrity and trust measures. As long as I have a file hash, all I need to do is hit up your service and I’ll get a list of URLs for various protocols which the client could use to obtain the file.

In terms of transition, I don’t think it will be that hard, and will necessarily require content creators/distributors to participate. Because you’re leveraging their existing distribution methods anyway (unless they have some evil scheme of unique URLs or security) what you really need is enough users for it to contribute links and verify integrity, in order to gain critical mass. BT never really had the blessing of content creators, yet has taking off enormously.

It’s interesting to see this problem being tackled from the other angle – rather than trying to distribute file requests all coming via a single protocol and URL to different servers so that they are all served efficiently and with integrity, you’re using the existing distributed nature of the net itself to download the file and check integrity.

I’m not completely convinced it is worth it though – most user ‘mirrors’ are via BT, and BT already has this in-built. Mirroring over protocols such as FTP and HTTP is out of reach for most people; those that do often have already gained the blessing of the originating server so trust exists already (e.g. linux distro mirror sites).

Maybe I’m not thinking far into the future enough – will we all be running web servers over our connections, like the Opera Unite vision?
wanga says:

July 10, 2009 at 5:37 pm

I second Jack, a use case would clear things up.

The way of the web is decentralisation, so architecturally what you are suggesting raises lots of hurdles that would take more than a bridge but require applications to change their behaviour.

Collisions will occur much more frequently than you assume, after all it is the real world. Have you also considered computational performance, eg. SHA 512, and taking into account storage sub-system bottlenecks, etc – it may not be as light weight as you think.

Keep us posted, best of luck!
Jack Ukleja says:

July 10, 2009 at 5:56 pm

I am still unsure of the problem to be solved.

If its download speeds (as alluded by the talk about file segments) then I do not personally have problems in this area these days. I either use Bittorrent which often maxes my pipe, or locate it on my ISP’s mirrors and get as as fast as my pipe will allow.

If its file location+integrity, then I can see the hash database being useful. Really all we are talking about is moving to a system where instead of using imprecise file names to identify resources we use the hashes. Makes sense and a lot of the value here can be derived without any client software being used – a website/service alone would be useful, assuming you could live without the multi-origin file segments. Even a ‘movement’ to standardize/register the hashes that vendors often already provide would be useful.

The REALLY difficult bit with an idea like this is how to get the ball rolling – its one of those things that needs critical mass to be useful (surely?). Until I can a) Find out a file’s hash b) Find where that file is hosted c) download it smoothly and everything wrapped up in a nice UX (even if thats just logos and convention) its going to be a hard sell.
Yert says:

July 10, 2009 at 6:00 pm

The second your hash system is broken (and it will be broken) you have to be able to move to a new hash system while supporting older versions of the software. Be aware of this while designing the protocol you have in mind for it.

Trust based systems with karma systems are somewhat easy to zerg, as the case that even the most diligent social bookmarking sites getting spam on their front page can attest. However, by building the reporting system into the client, you can have a client that has downloaded a file (which would be difficult as your system has redundancy for the smaller pieces) which is wrong (or pieces of the file) automatically delete the file that was incorrect after submitting the hash to the system along side the hash that should have been entered. Getting so many reports of a bad server (like three to five), especially from established users, might work.

On this note, you should also be building on reliability of a website. So a website like Microsoft.com would be more likely to be serving the right files over a website like somebadsite.com (which you might want to get back, some domain name parker is using the link from your post from yesteryear), so take reporting accordingly.

Another way to help would be to have a comments page about the reliability of a file, and vs separate services, something most bit torrent trackers do. But I think you’ll find certain websites will naturally be used as mirrors more based on their reliability and fame.
Long Zheng says:

July 10, 2009 at 6:05 pm

@Jack Ukleja: Let’s take BitTorrent and your ISP mirror for example. Both of these have unique set of advantages and disadvantages.

The advantage of BitTorrent is that it is decentralized, so as long as one peer on the P2P network has it, you’ll have no problems getting it. The disadvantage of this is that when seeds are low, it can be very troublesome to download the file at a reasonable speed.

BitTorrent has recently tried to overcome this problem by adding in support for web seeding in the .torrent file, which takes HTTP resources to help with the initial file distribution. This in a way is partially what my idea also does, but do it across all protocols and is self-sustaining.

On the other hand HTTP distribution has a much more reliable speed, but does not handle high volumes well. In addition, because it is a server-client network, if the server no longer hosts the resource, it introduces difficulties for the client to locate alternate sources.

To get the ball rolling I think is very possible since by using magnet links (which I support), you can specify alternate sources. That way, it takes very little effort for applications to get started on this system, a HTTP downloader can stay just as a HTTP downloader but with magnet link support. The same goes for distributors too, it can be a magnet link alongside their existing HTTP download link or .torrent link. As the whole system gains more momentum, applications can then add support for more features.
Yert says:

July 10, 2009 at 6:13 pm

Anyways, this download method would only really take off if you made it so chunks of files could be downloaded from multiple sources. Otherwise you are just linking different mirrors and methods much like distros and popular browsers do, then verifying the file like any end user could. The only major breakthrough I really see here is if you could take bits from different sources and then snap it together.

Additionally, if you want this to take off, you’ll built it on top off other technologies, not beside them. The reason the World Wide Web took off is because it built off of the Internet, much like a front end.
Long Zheng says:

July 10, 2009 at 6:59 pm

@Yert: I think you might have overlooked some of the details since what you’re describing are essential components to the idea already.
Guillaume says:

July 10, 2009 at 7:35 pm

The idea of a universal hash database is really interesting.

If you need some help, or want to go further with this idea … contact me. It should be great to work on this project.
Dario Solera says:

July 10, 2009 at 7:51 pm

I’m a bit dubious about hash collisions. In my experience they happen quite often with real-world data. What you can do is to try to hash all files in your PC and record any collision.
Long Zheng says:

July 10, 2009 at 8:08 pm

@Dario Solera: Thanks for the feedback. Here’s an idea. Maybe we can concatenate two hashes together, generated by two different hashing algorithms? That way we can still keep the algorithmic quality of the hashes which are critical to the system.
VasiS says:

July 10, 2009 at 8:21 pm

“Assuming this client supports HTTP and BitTorrent, it will now begin to download the segments of the file concurrently”

How does the client split up pieces of the file that it wants to download and assign it to different protocols? and how does all these protocols be able to provide you such granularity of downloading that particular piece of the file? You seem to have let out this part of the details from your explanation. Is it that simple that you need not explain, am i missing something obvious?
VasiS says:

July 10, 2009 at 8:23 pm

I believe part of my concern is already brought up by @Yert and I still didnot get your answer to his question.
Long Zheng says:

July 10, 2009 at 8:32 pm

@VasiS: That is ultimately up to the design of the client, but one particular implementation might just equally distribute the file blocks (who’s size will be calculated algorithmically) across all available protocols.

Most protocols, and in this case both HTTP and BitTorrent, support transfers that doesn’t have to start at the first bit of the file and end at the last bit of the file.

If you look at a HTTP download application which uses multiple threads, you’d notice it can request transfers begin and end at any position in the file.

The same is for BitTorrent. A standard BitTorrent client can and will download any part of the file, smallest of which defined by a block size specified in the .torrent file.

Different clients may approach this problem differently. Some may attempt to download the first segments using HTTP with more probability of a minimum transfer speed, and BitTorrent to start at the other end. Others may mix and match protocols depending on what they “sense” if the fastest approach.

I hope that answers it for you.
PortalCake says:

July 10, 2009 at 9:46 pm

BTW, Long, how is the “query” going to give you a result of all matches found for the given hash?
And how large do you think a query database this’ll take? Seems like a good idea but it seems like it’ll take a heavy hit on the server in case someone tries to overload it with queries. I’m imagining a list of “hash list peers” to make the querying faster (like RAID) and then limit the number of queries per MAC/IP per second.
Greisha says:

July 10, 2009 at 10:11 pm

Isn’t this what happens in serverless p2p overlay networks like Kademila or even Bittorrent’s DHT? The entire list of files participating in the network is indexed by their hashes. As long as you operate within the confines of the protocol, you get all the benefits you are trying to achieve.

As I understand it, you are trying to make the index resolution process work in a protocol agnostic way? That would be easy for protocols that already have a distributed hash table-based indexing built in (Kademilla and BitTorrent DHT amongst others) if you can get them to agree on the hashing algorithms used and then use the Magnet link to inform the client of the hashes which will then choose the protocol based on the protocol(s) it supports. As for HTTP and FTP, you need the resolution service. But having a single service (may be its load balanaced and all, but its still one URL) doesn’t seem a robust idea. May be going the DHT way here would be a better idea. Open up the DHT spec and prod the server manufacturers to support the DHT network by automatically keeping it updated.
Brandon says:

July 10, 2009 at 11:04 pm

Someone could you know… implement this for free with no expectation of a return.
Long Zheng says:

July 10, 2009 at 11:09 pm

@Brandon: It’s possible. DNS is a massive-scale internet service that is maintained without cost to most users.
Stephen Edgar says:

July 10, 2009 at 11:10 pm

@LongZheng Have you read any of the posts on http://blogs.msdn.com/p2p/ regarding DHT’s and DRT’s ? Much food for thought.
Simon says:

July 11, 2009 at 12:02 am

I feel I have to comment on the hash collision probability discussion.

In short? The people arguing that because a hash *can* happen, it is likely to happen, haven’t worked out the probabilities.

For a perfect 128-bit hash, the number of different (random) documents you’d have to hash before the probability of a collision rises to 10^-18 (i.e. one in a billion billion) is about 26 billion. For a 256-bit hash, that number rises to a 500 billion billion billion documents. (A 160-bit hash like SHA-1 will be somewhere between those two). For a 512-bit hash, it is about 1.6×10^68: 160,000 billion billion billion billion billion billion billion. (All numbers from http://en.wikipedia.org/wiki/Birthday_attack )

And remember, all that’s for a probability of a collision of 10^-18. So in particular, the statement that “Your system has a noticeable chance of colliding somewhere after 1 billion files have been hashed” is nonsense.

The reason SQL servers have workarounds for collisions is because, AIUI, they use the first 20 bytes of the hash, not the whole thing. In which case of course collisions are likely.

In short, don’t waste your time looking at how SQL servers handle collisions, because if you use the whole hash, the probability of a collision happening in the lifetime of this universe is negligible.

Of course, all this is the probability of random collisions; if the hash is broken, then someone working algorithmically might be able to generate two documents with the same hash. This is an orthogonal issue from strength to random document generation (e.g. A CRC-type “hash” is trivially vulnerable to the former, though with the same strength against the latter). MD5 is broken. SHA-1 is *also* broken — though not yet to the extent that anyone has yet found *any* two things with the same hash. But nevertheless, it is broken, so I suggest using a newer one, such as SHA-256 or -512.

Regarding your idea of concatenating two hashes together, generated by two different hashing algorithms: this is a bad idea. The strength of the result will be as strong as the stronger of the hash functions you use, but not much stronger. So e.g. if you concatenate MD5 and SHA-256, the result will be slightly harder to crack than SHA-256 alone, but *not* as strong as using those 384 bits for SHA-384 would be. (Of course, this is strength against deliberate hacking; it is precisely as strong as SHA-384 against random document generation). It’s only a good idea as a hedge against future hash brokenness, if you won’t be able to easily change the hash algorithm later: if you concatenate two currently unbroken hashes, you know that it will be at least as strong as the stronger of the two. (Source: http://snurl.com/hashconcat [Wikipedia])
Long Zheng says:

July 11, 2009 at 12:13 am

@Simon: Thanks for the insight about hashes.

Since you sound like you know quite a bit about hashing, do you know from experience how much performance difference there is between the SHA-1 hash function and SHA-512?
Antgiant says:

July 11, 2009 at 12:35 am

@Long Zheng I think looking into the Birthday Problem (http://en.wikipedia.org/wiki/Birthday_problem) may help you understand the collision problem referenced by @Bryant. Also, this sounds similar to many of the Distributed File Systems (DFS) that exist. Even if they are not quite what you had in mind their architecture may give some additional insight into the challenges of this problem.
Joshua Morgan says:

July 11, 2009 at 12:43 am

The size of these segments will be calculated algorithmically so that it is consistent across the system. Assuming a 1GB file is divided into 10MB pieces, we then hash the individual 10MB piece. It turns out BitTorrent does this too, to ensure integrity at the block level. However to deliver one step beyond BitTorrent, we can use a hash tree to ensure that not only is there integrity at the block level and at the whole-file level, but that we also ensure integrity between these levels. That is to say, we can verify the individual block hashes do in fact add up to the root hash.

Doesn’t this imply that this system would need not one hash but 1 + (FileSize/BlockSize)? More actually if you wanted a tree of hashes. How would the block hashes get into the system? If you plan to put them all in the url you probably run afoul of most servers max url limit on larger files.
Bryant says:

July 11, 2009 at 12:45 am

@Simon, “noticeable” doesn’t mean “high.”
Dorian Muthig says:

July 11, 2009 at 12:49 am

File hash, File size and type of hashing algorithm should suffice to identify a file. Though, at sizes below 1024 bytes, generating files that match a certain criteria becomes very easy. Furthermore, since file sharing systems already do this kind of thing, they never really took off because of their inconvenient usage. Just for illegal file sharing they end up as a ‘last resort’ because the decentralized file distribution cannot be easily blocked. Legal files are very rare in file sharing networks, because direct HTTP downloads are faster and more convenient in most cases.
Bryant says:

July 11, 2009 at 12:58 am

Well, I suppose it doesn’t matter anyway. A quick look over Simon’s link re: the birthday attack cleared up a lot, so feel free to proceed with SHA-1 without covering for collisions if you want.
jmurphy says:

July 11, 2009 at 1:38 am

Something similar has been around for years,.

bitzi

http://www.bitzi.com

http://en.wikipedia.org/wiki/Bitzi

John Murphy
Long Zheng says:

July 11, 2009 at 1:44 am

@jmurphy: Interesting John. I’ve never heard of Bitzi. It definitely shares a lot of similarities with what I’ve described, but I think one major difference is that Bitzi is very much centric in P2P, and at the individual protocol level. It doesn’t seem to allow for the abstraction of networks and protocols.
Pete Smith says:

July 11, 2009 at 1:46 am

Regarding the issue of resource trust, rather than a karma-based system I’d propose something a little different.

For each resource that a file hash request returns, you could look at how many users report a resource as improper out of the total populous that requested the resources for that file hash. If you have a rolling time frame of say an hour, you could check the report percentage and if it reaches a certain threshold you could flag the resource as having a higher possibility of being improper. If the resource’s improper risk (for want of a better word), becomes too high in any 24 hour period then the resource can be removed.

Since a high proportion of the total users requesting a resource must consistently flag it as improper throughout the day, the chances of mistaken or malicious claims would be greatly diminished. If a distributed attempt was made to flag the resource as improper was made, the process would still be exploitable however.
Long Zheng says:

July 11, 2009 at 1:50 am

@Pete Smith: Thanks for the suggestion. The idea of a ratio-based (how many users report out of total pool of users) system did cross my mind before, but I think it might be more susceptible to gaming since someone with malicious intent could try and flood the network with as many “users” as he could that makes false reports.

I don’t know if this is applicable in the real environment, but it seems possible to overwhelm the ratio easily.
Simon says:

July 11, 2009 at 2:30 am

@Long Zheng:
> Since you sound like you know quite a bit about hashing, do you know from experience
> how much performance difference there is between the SHA-1 hash function and SHA-512?

I don’t have much actual experience with hash functions, I just like reading about them ;-). But a quick unscientific test on a (highly random) 7MB file gives ~75ms for SHA-1, ~115ms for SHA-256, and ~280ms for SHA-512.

Personally, I’d go for SHA-256: significantly increased security over SHA-1, without that much performance difference.
Long Zheng says:

July 11, 2009 at 2:42 am

@Simon: Thanks for the imprompt test. There seems to be a direct relations between the SHA hash function “size” and the processing time. SHA-256 seems to be a good compromise.
Pete Smith says:

July 11, 2009 at 2:52 am

@Long Zheng: Building on the scenario I suggested – If you restricted each user so they could only report a resource once every 24 hours, a distributed attack would be pretty hard to achieve. For the malicious user to succeed they would not only need enough unique users to exceed the hourly ratio, but also enough unique users to consistently exceed the hourly ratio for the full 24 hours. Given that in a real world scenario, a malicious user is only going to be interested in attacking a file that a lot of people wish to download, you’d need a very co-ordinated group of users or some kind of botnet to realise the attack.

From what I can see, any system that involves the user in being able to report a resource as improper is going to be exploitable by a large scale distributed attack.
Xepol says:

July 11, 2009 at 6:24 pm

I don’t see turning a simple URL into a complicated URL with obscure, random alphanumerics contained there in as ever becoming a usefull public facing method of communication.

Ya, I don’t ever, EVER see it happening. Where I have seen it used, it is typically more of a hidden behind the scenes detail to allow 2 pieces of software to communicate search terms more efficently. (WinMX for example, can search for identical files regardless of their name)

Where this could be used? Perhaps as a detail in a .download file that mimics the purpose of a .torrent file -> namely hashes, filenames, sizes, and where to go to get the files. But that seems VERY redundant. Maybe the web could have embeded URLs with that info and hides the primary details, but both approaches would have to make the web significantly more static – a return to the old days of hand maintained websites with frequent broken links. (in fact, .torrent files are frequently broken links, but since most torrents are typically very timely in nature, it doesn’t upset those who use them too much)

I just can’t see it myself. That said, if you do decide to develop the idea further, you should include a file size with the hash. 2 files of the same size should generate different hashes, but once you start dealing with different size files, hash collision goes up dramatically – adding the size will help with that.
Long Zheng says:

July 11, 2009 at 6:52 pm

@Xepol: Thanks for the comment, but I believe you’re assuming the user-experience is the technicalities at its bare, without any sort of user-friendliness. Although it is true I haven’t presented any practical examples of an ideal user-experience with this system, but it doesn’t mean one can’t exist.

For example, RSS is a very ugly technology. At the heart of it is an XML document which is not “pretty” at all. Then if you think about, it’s all come together with a simple icon and beautiful software we know as “feed readers”.
alirobe says:

July 12, 2009 at 3:25 am

I don’t really see the point of this centralisation. In terms of releasing free software, Isn’t it better just to sign the files (i.e. like commercial software)? That way the verification is not subject to man-in-the-middle attacks, there is no central point of failure, and less database maintainence/hardware is required. I can’t imagine any use cases for this technology where there isn’t some form of public/private key provisioning model that wouldn’t do a better job.
Long Zheng says:

July 12, 2009 at 9:41 am

@alirobe: This isn’t a code-signing or security idea. It would appear you stopped reading after the first couple of paragraphs when I was just discussing hashes.
alirobe says:

July 12, 2009 at 1:47 pm

I didn’t stop reading, I’m really at a loss as to what else this could achieve that couldn’t be achieved better with some sort of file certification scheme. OSes already have code-signing built in for applications, so extending that functionality to file transfers would probably just require a little thought and standardization (e.g. for things like block level hashes). With integrated file-signing, file naming wouldn’t be an issue, the service would be more reliable/open because of decentralization, it would be more secure because it relies on private/public key infrastructure, and you’d be providing authentication as well as file integrity.

Here’s one example of a free signed file container that could be made a standard: http://www.pkware.com/software-data-security/windows-file-encryption
Long Zheng says:

July 12, 2009 at 4:01 pm

@alirobe: I’m sorry you’ve come to that conclusion, but none of this is about file certification or file signing. Nothing in this system checks the contents of a file. Its purpose is to make distribution more flexible.

The functions you’re referring to wraps around content, and ultimately modifies the file. What I’m proposing does not change files at all. In fact every file on the internet today can be utilised.
alirobe says:

July 12, 2009 at 5:30 pm

@long – Gotcha. I guess this is what happens when you read technical proposals at 3 AM! Especially after seeing hashDB.

The concept strikes me as the insertion of another point of failure for reasons of rather pointless pedantry – but I’m not a mass content provider, so perhaps I’m not someone who’d see the benefits from this.
Judah Himango says:

July 14, 2009 at 4:15 am

Hey Long,

I think this is a really cool idea. I’m a C# dev (mostly desktop app stuff) with several years of experience, email me if you need any help developing it.

-Judah
Xepol says:

July 14, 2009 at 4:28 pm

@Long Zheng – Actually, I have used a number of RSS readers and feeds. I find the experience to be ugly on all levels from the RSS feed all the way to the readers. The only good user experience I have run across is DelphiFeeds, and many would argue that it is frequently awash with off topic posts.

I suspect that part of the reason it is such an mess starts all the way from the bottom – the RSS feed standard – the design reflects its way all the way up the chain.

Torrents do a great job of avoiding the end UI issues, but it would be the exception not the rule and there are a fair share of bad torrent UIs from the early days – in fact, the only thing that really saved it was how easy it was for third parties to enter the market and improve on it.

This technology would have to be, at its heart, baked in just a few select pieces of software where most of the variation comes not from competition but rather different platforms. That means we are likely to see poor UI design from the masters: Microsoft, Firefox and my personal favorite for worst UI experiences if you aren’t a Unix Savant – Google/Chrome. (A notable mention goes out to Apple for making their Windows browser an alien experience to every single target user by writing copious volumns of code to make it look and act just like the mac OS)
Alec Perkins says:

July 15, 2009 at 12:36 am

This is a really interesting idea. While it’s not good to just create new protocols and such haphazardly, I think this is fundamentally different and useful enough to be worth pursuing. Are you familiar with the Git version management system? It uses the SHA hashes as the identifier for each revision. Instead of an arbitrary identifier (or filename in your case), the identifier is effectively intrinsic in the content. If someone wants to pull a specific revision, they just type enough of the identifier to make it unique within the given project. I think it’d be a good idea to look into how Git handles things.
"R" says:

July 17, 2009 at 3:46 pm

Why on earth would I want to report where I obtained some illegal mp3/movie? So my sources can be easily traced and cut off sooner??

And if it’s legal content we’re talking about, why on earth would a sane capitalistic company (forget Open Source for now) that pays beaucoup $$ for http web serving would be interested in you grabbing their content without first going through their heavily-commercialized website? So you can download the wrong file or call them up later complaining that your PC crashed when you downloaded and installed a sheitty old version of the file? Had you visited their website first, you’d have known that a new stable release was just made 2 weeks ago… But no, you grabbed the wrong file cuz it came to you through “other channels”, say, some guy gave you a hash (to the then-current-release) in email 3 weeks ago… Not to mentioned they just lost sales because you didn’t get the “in your face” ad for their newest game/biz-app/subscription-service/whatever. All companies that have websites WANT YOU TO BE ON THEIR WEBSITE. In other words, they’ll do everything they can (long term) so you cannot bypass their control. You won’t be able to hash their http content simply because it’ll become dynamic URLs (as many companies already do)…

In other words, I see very little practical use for the concept of global hashes. I see much more use for hashing files on all decentralized (or quasi-decentralized) networks (bittorrent, emule, etc) a la Bitzi that was mentioned, for the purpose of finding clean (commented) illegal content easily.

Bitzi is a more powerful and practical idea than global hashing. If taken seriously, then you will know problems/tips about what you are downloading right as you are downloading it (if web servers start giving hashes out before full download), or just after you finished downloading (if you have to do the hashing yourself). In my example above, just after you finished downloading the file, you’ll get a message from the devs telling you “file you just downloaded (who knows from where) is an old vers with known bugs. Trust this msg, it is digitally signed by us, the makers of the original file you’ve downloaded”.

I think commenting on URLs and unique files (hashes) is the most powerful concept yet fully realized, and not the directory of where to obtain those unique files from…
Jack says:

September 15, 2009 at 11:58 pm

Just came across this.. might be similar to your idea…
http://www.metalinker.org/

Comments are closed.