Slide 1

Slide 1 text

Aaron Meurer 2020-08-28 Testing with Hypothesis

Slide 2

Slide 2 text

What is hypothesis and property based testing?

Slide 3

Slide 3 text

# Test def test_sum(): assert sum(1, 2) == 3 assert sum(1, -1) == 0 This is what typical tests look like # Function def sum(x, y): return x + y Inputs Outputs

Slide 4

Slide 4 text

Property tests are different • Instead writing several assert f(input) == output, we write tests for properties • A property is any (testable) statement about f that should hold true for all input/output pairs. • We don’t worry about finding input/output pairs that check the property. Instead, assume we have a magical program that tells us if it is actually true for the code we are testing.

Slide 5

Slide 5 text

Examples of properties • Mathematical properties: • f is commutative: assert f(x, y) == f(y, x) • f is idempotent: assert f(f(x)) == f(x) • If g is the inverse of f: assert g(f(x)) == x; assert f(g(y)) == y (f and g “round-trip”) • f satisfies some more advanced condition • “Code” Properties • f returns the correct type • f does not raise an unexpected exception

Slide 6

Slide 6 text

# Test def test_sum(x, y): # sum() is commutative assert sum(x, y) == sum(y, x) # 0 is identity assert sum(x, 0) == x Example property test # Function def sum(x, y): return x + y Inputs and outputs are not specified explicitly

Slide 7

Slide 7 text

from hypothesis import given from hypothesis.strategies import integers def sum(x, y): return x + y @given(integers(), integers()) def test_sum(x, y): # sum() is commutative assert sum(x, y) == sum(y, x) # 0 is identity assert sum(x, 0) == x Example property test >>> test_sum() >>> # Passed

Slide 8

Slide 8 text

How to write a property test • Tell Hypothesis what the types of the inputs are, and it will automatically generate examples. • This is done with the @given decorator and “strategies”. • Hypothesis has built-in strategies for all the common Python datatypes, as well as strategies for NumPy datatypes. • Hypothesis strategies are easy to compose. • Example: json = recursive(none() | booleans() | floats() | text(string.printable), lambda children: lists(children, 1) | dictionaries(text(string.printable), children, min_size=1))

Slide 9

Slide 9 text

from hypothesis import given from hypothesis.strategies import floats def sum(x, y): return x + y @given(floats(), floats()) def test_sum(x, y): # sum() is commutative assert sum(x, y) == sum(y, x) # 0 is identity assert sum(x, 0) == x Example property test Now try with floats instead of integers >>> test_sum() Falsifying example: test_sum( x=0.0, y=nan, ) Traceback (most recent call last): File "", line 1, in File "", line 5, in test_sum File "/Users/aaronmeurer/anaconda3/lib/python3.7 1142, in wrapped_test raise the_error_hypothesis_found File "", line 7, in test_sum AssertionError Hypothesis found an example that failed the test

Slide 10

Slide 10 text

Example property test Now try with floats instead of integers >>> test_sum() Falsifying example: test_sum( x=0.0, y=nan, ) Traceback (most recent call last): File "", line 1, in File "", line 5, in test_sum File "/Users/aaronmeurer/anaconda3/lib/python3.7 1142, in wrapped_test raise the_error_hypothesis_found File "", line 7, in test_sum AssertionError Hypothesis found an example that failed the test • We can fix it by doing one of three things: • Fix sum() to make the property hold (the function has a bug). • Fix the property to account for the bad example (the tested property is wrong). • Filter the bad example from the input strategies (we don’t care if this corner case invalidates the tested property).

Slide 11

Slide 11 text

• For this case, we might want to filter out NaNs, which looks like this: from hypothesis import given from hypothesis.strategies import floats def sum(x, y): return x + y @given(floats(allow_nan=False), floats(allow_nan=False)) def test_sum(x, y): # sum() is commutative assert sum(x, y) == sum(y, x) # 0 is identity assert sum(x, 0) == x Example property test Now try with floats instead of integers >>> test_sum() >>> # Passed

Slide 12

Slide 12 text

More advanced example

Slide 13

Slide 13 text

More advanced example

Slide 14

Slide 14 text

More advanced example

Slide 15

Slide 15 text

Demo: Hypothesis testing in ndindex

Slide 16

Slide 16 text

How Hypothesis works • Hypothesis is not the same as random testing. • Think of it as adversarial testing. It is random, but also always tries examples that are known to trigger corner cases (like NaN in our example). This includes a database of previous known failures. • It’s similar to fuzzing, if you are familiar with that concept.* * The Hypothesis developers assert that “property based testing” and “fuzzing" are not exactly the same thing.

Slide 17

Slide 17 text

When does Hypothesis work well? • Any situation where the behavior of code is defined by easily testable properties. • Mathematical functions or mathematical-like objects (like arrays) • Even non-mathematical code has desirable properties that can be tested, e.g., doesn’t raise exceptions, doesn’t take too long to run, etc. • Ideal situation: the code is defined by a spec or reference implementation • Can write tests for exactly what the spec says • assert our_function(x) == reference_function(x)

Slide 18

Slide 18 text

Benefits of Hypothesis • You will find bugs you wouldn’t have found otherwise (though your users might have). • Writing tests as high level properties helps to keep your code consistent, even in corner cases. • Your tests become better the more they are run.

Slide 19

Slide 19 text

Benefits of Hypothesis • Writing code against a Hypothesis test feels almost like solving a puzzle or playing a programming game. “That’s close, but you didn’t consider this corner case.” • You can write a test that tests code that isn’t written yet. Just ignore NotImplementedError or omit cases from the input strategies. When you implement it, the test will automatically start testing it. • Shrinking • Hypothesis will automatically try to make a counterexample as “small” as possible. e.g., if a function has a bug with odd integers it will give 1 as the failing example input instead of 5728175.

Slide 20

Slide 20 text

Benefits of Hypothesis • Generally very user friendly: • Helpful error messages. • Verbose flags to help you see what is going on internally • Built-in health checks check for common errors or other problems. • Very helpful developer community.

Slide 21

Slide 21 text

Writing property tests is hard • It’s not obvious what properties should hold for a given function. • What is the correct way to test things like equality (especially a problem for tests involving floats)? • Hypothesis may help you find out that a property that you thought should always be true actually isn’t. • You are forced to think, and actually understand the code you are testing. Raymond Hettinger has a great set of tweets on this topic raymondh/status/1292548482109067265 Downsides Caveats of Hypothesis

Slide 22

Slide 22 text

All the usual caveats of random tests. • Sometimes a failing example is only in a very small part of the search space. You may not find it right away. • Might find a failure much later on CI in an unrelated PR. • But: • You can seed the tests with --hypothesis-seed or @seed • You can add explicit examples with @example • Previous failures are recorded and saved in a local example database, so they will show up in future test runs. • Hypothesis supplements, not replaces classical tests. • These are exactly the sorts of bugs you want to find. Downsides Caveats of Hypothesis

Slide 23

Slide 23 text

Downsides Caveats of Hypothesis Upstream Bugs • You might will find bugs in upstream dependencies. • Case in point: with ndindex I’ve found half a dozen bugs in NumPy. About half of those were not known to the NumPy developers before I reported them. Others were deprecated behavior. • You will have to work around these (or submit upstream fixes). • You will also, obviously, find bugs in your own code. This can make adding hypothesis to an existing codebase difficult.

Slide 24

Slide 24 text

Downsides Caveats of Hypothesis The threshold problem • Sometimes shrinking makes a bug seem much less interesting than it actually is. • For example, if a test tests assert abs(value) < 1e-15, Hypothesis may find an example that makes value == 1.000000000000001e-15. It doesn’t necessarily mean the 1e-15 threshold is slightly too small, just that hypothesis made value as small as possible to still fail the test. • It also likes to find examples involving “trivial” cases, like empty lists, empty arrays, etc. Anything that isn’t relevant to the failure will be shrunk away. • You can easily be tricked into thinking the bug only occurs in empty or corner cases. • It’s just something to be aware of when using Hypothesis. • See for more information.

Slide 25

Slide 25 text

Downsides Caveats of Hypothesis Search space • If the search space for the examples is too large, it may have a hard time finding meaningful examples. • Can be fixed by building custom strategies that generate useful examples more often. • Can also use target(), which attempts to do this automatically via an optimization algorithm.

Slide 26

Slide 26 text

Downsides Caveats of Hypothesis Search space • Example: in ndindex, I have strategies to generate random indices idx and random arrays a to test a[idx]. If idx is a boolean array index, a[idx] is an IndexError unless its shape matches the shape of a. • If Hypothesis generates the two randomly, the chances they will match is slim. • It wasn’t finding simple bugs, because it couldn’t generate examples that didn’t trivially give an IndexError, even after running millions of examples. • Solution: The shape for idx can be made to be related to the shape of a using the shared() strategy.

Slide 27

Slide 27 text

Downsides Caveats of Hypothesis Search space shapes = tuples(integers(0, 10)) shared_shapes = shared(shapes) @composite def subsequences(draw, sequence): seq = draw(sequence) start = draw(integers(0, max(0, len(seq)-1))) stop = draw(integers(start, len(seq))) return seq[start:stop] boolean_arrays = arrays(intp, subsequences(shared_shapes))

Slide 28

Slide 28 text

Downsides Caveats of Hypothesis Search space • The important thing is to remember that Hypothesis isn’t magic. It may take some work in more advanced cases to make it useful. • A Hypothesis test passing may mean your code works, or it may mean Hypothesis just hasn’t found the bugs in it (yet). • You should regularly “test” that your tests work (e.g., introduce a known bug and see if they catch it). • Hypothesis runs ~100 examples per test by default, but you can manually run a lot of examples by setting a max_examples flag. • e.g., run a million or billion examples overnight to see if it finds anything

Slide 29

Slide 29 text

Downsides Caveats of Hypothesis Hard to ignore bugs • Sometimes Hypothesis finds a bug that you don’t really care about. • Or at least don’t want to care about right now. • But if an example fails a test, Hypothesis will always find it. The only way to shut it up is to either fix the bug, or modify the test or strategies to ignore certain cases (in other words, make the test “wrong”). It’s not like traditional tests where you can just XFAIL the specific test case and ignore it. • This is ultimately a good thing. It leads to correct code. But it can make Hypothesis difficult to use, especially for already existing codebases.

Slide 30

Slide 30 text
