While working on a tool to eliminate duplicate photos from old hard drives, I wrote this little snippet of code that's worth saving:
It's a command line one-liner that generates MD5 hashes for two files, compares them and states if they are the same or different.
For those unfamiliar with MD5, it's a "cryptographic hash function that produces a 128-bit hash value." The useful part for this snippet is that MD5 can be fed a file of any size and the results is a 32 character string. Most importantly, the same input will always produce the same output and any difference (no matter how minor) creates a large difference in the result. For example, any computer can run MD5 on a file with the contents "asdfasdf-1" and it will produce the hash signature:
If you change the one to a two (i.e. "asdfasdf-2") the signature changes to:
More about this will come in another post, but what this means for a duplicate photo finder is that MD5 hashes can be generated for every photo and then compared. Any two files with the same hash signature are the same* and can be pared down. That is done with a larger program. The little snippet of code is used for verification. It's also useful enough to be broken out to its own.
**Note: For the tech/cryptography minded folks out there, I know that MD5 can have collisions. For what I'm doing, the chances are so small that I'm not worried about it.*