Slide 1

Slide 1 text

hash functions and you

Slide 2

Slide 2 text

or, why breaking SHA-1 is a thing

Slide 3

Slide 3 text

this deep-dive is on hash functions not cryptography in general

Slide 4

Slide 4 text

Google figured out how to generate two PDF files with the same SHA-1 hash

Slide 5

Slide 5 text

why does this matter? what should npm do about it?

Slide 6

Slide 6 text

the term hash like so many computer terms is metaphorical

Slide 7

Slide 7 text

hash functions chop up data and hash it up ha ha get it? (except we're not sure of the origin)

Slide 8

Slide 8 text

what's a hash function?

Slide 9

Slide 9 text

hash function Any function that can be used to map data of arbitrary size to data of fixed size. -- the wikipedia entry on hash functions

Slide 10

Slide 10 text

data ➜ function ➜ number

Slide 11

Slide 11 text

data ➜ fn ➜ number message ➜ hash ➜ digest/hash/code

Slide 12

Slide 12 text

really simple hash function, from Knuth If k is an integer key, and n is the number of buckets, not a power of 2: f(k) = k(k+3) mod n const buckets = 19; function knuth(k) { return (k * (k + 3)) % buckets; }

Slide 13

Slide 13 text

mostly we hash character data... So we do this, roughly: • initialize to 0 • take the input in chunks of the size that you want • do something to the chunk & XOR it into the result • when out of chunks, return The "something" can either be very simple, like the Knuth division hash, or very complex.

Slide 14

Slide 14 text

the hash value is (usually) a lot smaller than the data!

Slide 15

Slide 15 text

a hash value is a representation of your data related to its content that isn't a full copy

Slide 16

Slide 16 text

hash tables! Suppose you have a lot of data that you want to stick into memory for fast access by a key. You use a hash function to map the keys into buckets, one for every output number, and put the matching data for a key into the key's bucket. There might be more than one thing in the bucket, but that's fine because you can look inside the bucket to find your item in a small collection, instead of having to search the whole thing. This is the data structure underlying associative arrays, caches, and a million other things in software.

Slide 17

Slide 17 text

data integrity! Alice publishes a package on npm. Bob wants to know if the package he's downloading is the data Alice published, because he's worried about data corruption or tampering.

Slide 18

Slide 18 text

Alice creates a checksum of the data: 㱺 md5 < package.tgz 0030be42121988078dca0ec982d04f72 She gives that output number to Bob, which he compares to the md5 sum of his download to see if he has the data she meant to give him.

Slide 19

Slide 19 text

features of a good hash function • output is deterministic • output is uniformly distributed, not clustered • output value has a fixed size • usually: fast • sometimes: similar inputs produce nearby outputs • sometimes: similar inputs produce distant outputs (avalanche effect)

Slide 20

Slide 20 text

avalanche effect a small change in input large change in output

Slide 21

Slide 21 text

hashing cat and car with MD5 㱺 echo cat | md5 54b8617eca0e54c7d3c8e6732c6b687a 㱺 echo car | md5 5cd3e81fb747479797b62794c6bf6aaf Even the battered md5 has the avalanche effect.

Slide 22

Slide 22 text

locality-sensitive hashes or similarity hashes do exactly the opposite

Slide 23

Slide 23 text

copyright violation detection on youtube will have similarity hashing behind it

Slide 24

Slide 24 text

mostly we want the avalanche effect sometimes called the butterfly effect

Slide 25

Slide 25 text

often we want cryptographic hashes

Slide 26

Slide 26 text

a good cryptographic hash • deterministic • fast to compute • has the avalanche effect • reconstructing the original message from the hash is infeasible • finding collisions is infeasible

Slide 27

Slide 27 text

collisions mean we can't tell that two inputs are different deliberate collisions would be an attack

Slide 28

Slide 28 text

Back to the classic example! Carol wishes to trick Bob into running her npm package instead of the one Alice wrote. Because Alice used the weak MD5 algorithm to sign her data, Carol is able to craft a tarball that has the same MD5 digest but different data inside. She man-in-the-middle attacks Bob and serves him her cleverly-crafted package.tgz instead of Alice's. Bob is now pwned.

Slide 29

Slide 29 text

collision-resistance is crucial for verifying data integrity This is why "breaking" SHA-1 matters.

Slide 30

Slide 30 text

Collision-resistance might not always matter

Slide 31

Slide 31 text

non-cryptographic hashes • usually a lot faster than cryptographic hashes • finding collisions might be feasible • ditto reconstructing the original • use when speed matters • use when defense against malicious input doesn't matter

Slide 32

Slide 32 text

uses for non-cryptographic hashes • bloom filters ! • lookup tables • sharding data uniformly

Slide 33

Slide 33 text

hashing functions to know • MurmurHash • CityHash • HighwayHash • xxHash • seahash • FNV, or Fowler-Noll-Vo

Slide 34

Slide 34 text

you can design your own non-crypto hash with a little math cryptographic hashes are harder

Slide 35

Slide 35 text

cryptographic hashes are the workhorses of security

Slide 36

Slide 36 text

use a cryptographic hash • to verify message integrity • to verify passwords without knowing them • to identify data

Slide 37

Slide 37 text

choose a cryptographic hash to defend against malicious input or attack

Slide 38

Slide 38 text

hashes suitable for passwords have some unusual properties

Slide 39

Slide 39 text

hashing passwords • salt + password ➜ hashing function ➜ output • store the output only • repeat the transformation when checking the password to see if you get the same result An attacker who can run that transformation frequently & quickly is one who can brute-force attack your users' passwords.

Slide 40

Slide 40 text

password hashes are tunably expensive

Slide 41

Slide 41 text

slow to run, use a lot of memory to slow down attackers

Slide 42

Slide 42 text

password hash algorithms • bcrypt • scrypt • argon2 • PBKDF2

Slide 43

Slide 43 text

some cryptographic hashes you should know! • md5 • the SHA family • blake2 • siphash

Slide 44

Slide 44 text

MD5 • designed in 1991 • blown apart by 1996 • really fast • do not use as a cryptographic hash • can use for data bucketing if you trust the input

Slide 45

Slide 45 text

the SHA family Secure Hashing Algorithm NIST standards

Slide 46

Slide 46 text

standardization means SHA algos are widely available & widely used

Slide 47

Slide 47 text

SHA-1 • 160-bit result (40 hex digits) • very widely used (git shasums!) • designed by NSA in 1995, 1st attack in 2005 • better attack in 2015 • collision at > brute force speed by Google in 2017 • do not use as a cryptographic hash

Slide 48

Slide 48 text

npm uses SHA-1 for data integrity checks for tarballs

Slide 49

Slide 49 text

Back to that classic example! Alice is using SHA-1 to sign her package tarballs. Carol is an employee of Google or maybe of a nation-state's spy agency. Carol has a lot of computing power available, and really wants to pwn Bob. She cleverly crafts a tarball that has the same shasum as Alice's, and serves it to Bob. In ten years, Carol will be able to do this with a Raspberry Pi 7 the size of her thumbnail instead of a fleet of cloud computers.

Slide 50

Slide 50 text

SHA-2 • comes in 224, 256, 384 or 512 bit variants • SHA-256 & SHA-512 are the most-used • designed by NSA in 2001 • no feasible attacks known • do use freely • replace SHA-1 with this, generally

Slide 51

Slide 51 text

SHA-3 • comes in 224, 256, 384 or 512 bit variants • won a competition to choose next SHA standard • adopted as standard in 2015 • chosen for its differences from SHA-2 • no feasible attacks known • do use freely

Slide 52

Slide 52 text

blake2 • a SHA-3 finalist • 32 bit word variant (produces 256-bit results) • 64 bit word variant (produces 512-bit results) • faster than SHA-3 selection • do use freely

Slide 53

Slide 53 text

hash flooding attacks Hash-flooding DOS attacks use collisions to mount denial-of-service attacks by attacking a language's underlying hash implementation. • send crafted data to an app, which stores it in a hash table • the keys are designed to cause collisions • all the data goes into one hash bucket • hash table performance collapses into merely a linked list • which is, as you know, Bob, a truly awful data structure

Slide 54

Slide 54 text

somebody did this to perl in 2003 in 2011 to other languages by attacking MurmurHash

Slide 55

Slide 55 text

siphash • designed to defend against hash flooding attacks • optimized for speed with small input • used in hash table implementations in many languages • use it for your own hash table

Slide 56

Slide 56 text

which one should npm adopt? • SHA-2, SHA-3, and BLAKE2 are all fine choices • SHA-2 is safer because of implementation availability • we are not size-sensitive about the output • we care more about picking an algo that will last • SHA-512 is a solid, safe choice

Slide 57

Slide 57 text

I heard git uses SHA-1. What's up with that?

Slide 58

Slide 58 text

git commit ids are SHA-1 hashes. collisions do bad things to your tree. Linus doesn't think it's a big deal.

Slide 59

Slide 59 text

John Gilmore is eating popcorn I tried to fix this when git was young, when it would've been easy. Linus rejected the suggestion and didn't seem to understand the threat. He wired assumptions about SHA1 deeply into git. In the next few years, nasty people will teach him the threat model, with ungentle manipulations of his and many other peoples' source trees. --Gilmore

Slide 60

Slide 60 text

summary • stop using MD5 • stop using SHA-1 • do use SHA-2 and newer hashes • use bcrypt to store passwords • don't invent unless you're an expert • in which case you should be giving this talk not me

Slide 61

Slide 61 text

Questions? More-of-a-comment-reallys?

Slide 62

Slide 62 text

all the links Every link in this presentation, in one place: Hash function wiki page • bcrypt scrypt • PBKDF2 • argon2 • MD5 • SHA-1 • SHA-2 • SHA-3 • BLAKE2 • seahash • xxhash • siphash • designing your own • Hash-flooding • git & sha-1 • Gilmore eats popcorn Go! Learn more! love, @ceejbot