Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For perfectly uniform random-errors, it doesn't matter how you hash / error check. But if there's a certain class of errors you expect, you can do far better than just random.

In particular: CRC are theoretically perfect/ideal for burst errors. A CRC32 can detect all burst errors of size 32 (aka: if the error is within 32-bits, the CRC32 will always catch it). A CRC extended out to 256-bits would catch all 256-bit bursts, which SHA1 cannot do.

EDIT: "Burst errors" are incredibly common in communications. If there's some kind of electrical noise (static shock or the like) would destroy maybe 20-bits all in a row. Aka: a 20-bit burst error.

Similarly, CD-ROMs were designed to primarily protect against burst errors, because people assumed that CDs would get scratched. I forget exactly how the calculations were done, but some number of "millimeters of damage" were the assumed damage (maybe 4mm or 5mm) as typical, and then CD-ROMs were designed to be resilient even in the face of scratches of that size (or less).



You are right, but I have not seen any published data about the properties of some concrete 128-bit or 256-bit CRC's that could be chosen.

For 32-bit or smaller CRC's there are a large number of publications analyzing various useful polynomials and there is also some published data about a few 64-bit polynomials.

So if you want to use a CRC larger than 64-bit, you must first spend some time to learn and understand very well the theory of CRCs, then search for some suitable polynomials and test them.

It is unlikely that this would be worthwhile in most cases.

In practice, CRCs of up to 64-bit are very useful to detect errors in data packets of limited length, like in the data packets sent and received through some communication protocol, or in the headers or other components of limited size of some structured kind of data file.

CRCs of the available lengths are usually not suitable to check the integrity of entire data files, because in modern computers the files have become too large and too many.

Therefore a file utility like the traditional UNIX cksum has become obsolete when used to compute CRC values for files.

(The latest GNU cksum has become a generic utility, where you can specify other hash algorithms instead of the default CRC-32)

While this is expected due to the short length of CRC-32, I have actually verified that when computing the CRC-32 values for all the files on my 2 TB SSD I have obtained about seven hundreds of CRC collisions, i.e. pairs of completely different files that had identical CRC-32 values.

Even with the obsolete 128-bit MD5 hash there were no collisions, but as another poster mentioned, on modern CPUs with hardware support MD5 is slower than SHA-1 or SHA-256, so there is no reason to ever use it.

While SHA-1 is insecure for digital signatures and other applications for which an opponent must not be able to find collisions, because SHA-1 remains faster than SHA-256, it remains useful for error detection in many cases, e.g. for detecting whole file corruption.


> You are right, but I have not seen any published data about the properties of some concrete 128-bit or 256-bit CRC's that could be chosen.

I mean, sure. But what published data do you have about SHA1's error detection properties?

SHA1 doesn't have a hamming distance analysis associated with it, not at all... nor is there any kind of "burst error" analysis (or the like) associated with it. The only analysis you got on SHA1 is differential cryptography, and other such maths that have little or no relevance with regards to the kinds of errors we expect to see in data.

Its not too hard to pick a 128-bit burst error CRC-128. In fact, all CRC-128 would detect 128-bit burst errors. The "high quality CRC" is about the difficult question of hamming distances and other such properties.

-----

IMO: picking SHA1 is basically saying you don't care about those details. Which is a fine choice to make. But it seems odd that you'd criticize CRC for "not having enough analysis" when SHA1 barely has any at all.


It's been a while, but is it not the other way around?

CRCs can detect (or correct for y smaller than x) up to x bits in a w bit message. If there are more errors in a burst CRC will not catch it. SHA will detect any error of any size, but you can't correct it so for many applications such as cd's it's not very useful.

IIRC in CDs they solved this by simply spreading the RS correction codes and data around so bursts aren't bursts anymore.


> If there are more errors in a burst CRC will not catch it.

*Might* not catch it. Just like how SHA1 might not catch a 3-bit error if you find a collision in just the right spots.

Consider a 1024-bit message, and 1023-bits are in error. The CRC would trivially, and obviously, catch the error (because this would be "obviously" a 1-bit parity error, and the CRC's bottom bit is always a simple 1-bit parity)

In fact, there's all sorts of errors CRC are guaranteed to catch. The "burst error" is the one that most people care about, but "odd errors" (ex: 511-bit or 4095-bit errors, thanks to the bottom parity bit) are guaranteed to be caught as well.

---------

CRC is "so regular", it is very easy to prove the specific kinds of errors that can, or cannot, be caught. According to classical theory of communications, there are really only two errors to care about:

1. Random errors -- your methodology doesn't matter. Simple checksum or even XOR is provably optimal vs random errors (!!), and indistinguishable from a SHA1.

2. Burst errors -- Your connection goes dead for a time: this leads to 10 bits all in a row in error (lets say some electrical storm or static-shock, or other such phenomenon causes the antenna / line / whatever to be dead for 10-microseconds, on a 1MHz baud rate connection. That's 10-bits that disappear, aka a 10-bit burst error).

Because #1 doesn't care "which methodology" you use, its all about optimizing against #2.


> A CRC extended out to 256-bits would catch all 256-bit bursts, which SHA1 cannot do.

This is technically true, but misses the practical point. The chance of a contiguous 256-bit burst causing a SHA1 hash collision with the original unmodified byte string is 1 in 2^160, which is about 1 per 10^48 for a given input value. This is 1 in ( 1 trillion )^4. Or one per a trillion-trillion-trillion-trillion.

(Remember that the "birthday paradox" doesn't apply: we're not comparing a large set of values against all of the others, we're comparing a value with its may-be-corrupted copy only.)

For reference, typical computer hardware has 1 bit error per hour for each 10^11 to 10^30 bits stored or processed.[1] Just cosmic rays or background radiation alone will cause bit flips in all hardware at a rate comparable to this. While ECC memory and ECC processor caches reduce the error rate, they don't eliminate it. Typical improvements for contiguous errors in ECC DRAM are a factor of 256 (8-bit parity).

This means that even if you had a perfect checksum, the boolean returned by the "crc256()" function itself is a physical bit in memory subject to this corruption! Hence, there's a "noise floor" that no checksum can improve upon, because the output storage is also noisy. It might correctly detect the error, but then its output will be flipped back to "no error" by an errant relativistic particle!

The SHA1 hash collision rate is 10^16 times better than even the optimistic 10^30 error rate for reliable components even with an 8-bit ECC thrown on top! In other words, even on really, really good hardware, you'll get a thousand trillion times more false detections caused by bit flips than missed detections caused by the choice of hash algorithm. That's with SHA1. If you're paranoid, SHA512 is in the league of "not if every atom in the universe was a computer" levels of assurance.

As long as the hash is significantly better than this "system-wide" bit error rate, it is effectively perfect when implemented by practical, physical computers. Further software improvements are pointless without accompanying improvements to the hardware.

[1] https://en.wikipedia.org/wiki/ECC_memory#Research

PS: This kind of "analog" thinking related to digital computer reliability becomes important at large scales. For example, filesystems like ZFS are sensitive to bit flips. It has a known failure mode where a corruption of certain regions of its Merkle tree will corrupt the whole disk array. Similarly, I've heard of algorithms being redesigned to be tolerant of bit flips, returning at least partially valid data. E.g.: instead of maintaining a count and incrementing/decrementing it over time, prefer to measure the count from scratch every time. That way it'll be correct most of the time, instead of becoming permanently corrupt and staying that way.


And my point: assuming a random error, your SHA1 is no better than a simple checksum.

So what's your error model? You make a big post about the nature of errors but you fail to specify the distribution of errors. If your distribution of errors is uniformly random, then a 128-bit checksum is optimal.

The reason why checksum isn't used, is because in practice, there are long-strings of "0" in data and "0" for errors / communications. Checksum fails to distinguish between 20x 0s in a row and 200x 0s in a row. Is that a problem you expect to find in files?

If your distribution of errors is bursty, then 128-bit CRC is optimal.

What distribution of errors are you assuming for SHA1? What kind of file errors are you expecting that makes SHA1 better than other methodologies?

---------

If all errors are random, and you want to distinguish from the "multiple 0s" problem, then the Adler32-like checks are sufficient at fixing both those problems, and the Adler32 methodology obviously extends out to 128-bits (just have two 64-bit sums)

CRC, specifically, is designed to optimally check for burst errors of its full length. (32-bit CRC finds all 32-bit burst errors. 128-bit CRC would find all 128-bit burst errors). That all CRC is, and that's all CRC is designed / math'd out to do.


The key thing about cryptographic hashes (not mere checksums) like SHA1 is that the distribution of the errors doesn't matter. Effectively, they're all the same. That's the point. If mere "runs" of bits were sufficient to trigger a collision, then the hash wouldn't be strong enough for cryptography!

This means that you can simply throw out any such modelling, as it is no longer relevant. You "care" only about bit-error-rates and hash collision rates, but even mere SHA1 is so thoroughly past the BER that it is essentially perfect. That is, it is indistinguishable from perfect on all pratical computers for all intents and purposes.

CRC codes have their uses, but if you just need to detect any corruption of a large-ish file (over 10KB), then cryptographic hashes are both fast and "perfect" in this physical sense. You will never get a collision with SHA256 or SHA512, even including adversarial, crafted inputs. The same "strength" attribute is not valid for CRC codes, they're vulnerable to deliberate corruption by an attacker.

So in that sense, SHA hashes are stronger that CRC checksums.


> Effectively, they're all the same.

No they're not.

Birthday attack says that a 160-bit perfect cryptographic hash will have a collision with just 80-bits, on the average. This means that an 80-bit burst-error would probabilistically contain a potential SHA1 collision. (80-bits burst error doesn't mean that all the bits are flipped btw: it means that 80-bits have been randomized)

In contrast, CRC is designed specifically against burst errors. CRC is "regular" and "tweaked" in such a way that a 160-bit CRC would be immune to 160-bit burst errors of any and all kinds!

So if you care about burst errors, then CRC is in fact, better, than crypto-level hashes. And in practice, burst errors are the primary error that occurs in practice (scratches on a CD-ROM, bad sectors on a hard drive, lightning storm cuts out a few microseconds of WiFi, etc. etc.)

That is: noise isn't random in the real world. Noise is "clustered" around bursty events in practice.

--------

If burst-error is king, you can do far, far better than random methodologies. CRC is proof of that. That's why error distributions matter.


I did say that the birthday attack doesn't apply!

It only applies if you're comparing a large set of samples against each other. An example would be a "content-based indexing" system where a database primary key is the hash. Every insert then compares the hash against every entry that already exists. If there are 1 billion stored items, each 1 insert can have a potential collision with all 1 billion.

For validation, you have 1 input being compared against 1 valid value (or its hash/crc). There's no "billion inputs" in this scenario... just 1 potentially corrupt vs 1 known good.

Hence, no birthday attack.

It's the difference between two random people meeting and having the same birthday, versus any two people in a room full of people having the same birthday. Not the same scenario!

In practice, cryptographic hashes are always superior to checksums, once both have more than 128 bits. They're both strong enough, but the cryptographic has is resistant to deliberate attacks. The CRC won't be.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: