Aaron Meurer
August 28, 2020
82

# Testing with Hypothesis

These slides go over the basics of testing with the Python Hypothesis library. They list some of the advantages of using Hypothesis, as well as some of the most common caveats to be aware of.

Unfortunately, SpeakerDeck does not allow you to click on links in decks. Here are the links from the deck:

- Hypothesis main webpage from slide 1: https://hypothesis.readthedocs.io/

- Computer Science Fact tweet from slide 12: https://twitter.com/CompSciFact/status/1294265334187384835

- The Raymond Hettinger tweet from slide 21: https://twitter.com/raymondh/status/1292548482109067265.

- The article about the threshold problem from slide 24: https://hypothesis.works/articles/threshold-problem/.

- Hypothesis with ndindex: https://quansight-labs.github.io/ndindex/#testing-and-correctness

Feel free to email me if you have any questions [email protected]

August 28, 2020

## Transcript

3. ### # Test def test_sum(): assert sum(1, 2) == 3 assert

sum(1, -1) == 0 This is what typical tests look like # Function def sum(x, y): return x + y Inputs Outputs
4. ### Property tests are different • Instead writing several assert f(input)

== output, we write tests for properties • A property is any (testable) statement about f that should hold true for all input/output pairs. • We don’t worry about ﬁnding input/output pairs that check the property. Instead, assume we have a magical program that tells us if it is actually true for the code we are testing.
5. ### Examples of properties • Mathematical properties: • f is commutative:

assert f(x, y) == f(y, x) • f is idempotent: assert f(f(x)) == f(x) • If g is the inverse of f: assert g(f(x)) == x; assert f(g(y)) == y (f and g “round-trip”) • f satisﬁes some more advanced condition • “Code” Properties • f returns the correct type • f does not raise an unexpected exception
6. ### # Test def test_sum(x, y): # sum() is commutative assert

sum(x, y) == sum(y, x) # 0 is identity assert sum(x, 0) == x Example property test # Function def sum(x, y): return x + y Inputs and outputs are not speciﬁed explicitly
7. ### from hypothesis import given from hypothesis.strategies import integers def sum(x,

y): return x + y @given(integers(), integers()) def test_sum(x, y): # sum() is commutative assert sum(x, y) == sum(y, x) # 0 is identity assert sum(x, 0) == x Example property test >>> test_sum() >>> # Passed
8. ### How to write a property test • Tell Hypothesis what

the types of the inputs are, and it will automatically generate examples. • This is done with the @given decorator and “strategies”. • Hypothesis has built-in strategies for all the common Python datatypes, as well as strategies for NumPy datatypes. • Hypothesis strategies are easy to compose. • Example: json = recursive(none() | booleans() | floats() | text(string.printable), lambda children: lists(children, 1) | dictionaries(text(string.printable), children, min_size=1))
9. ### from hypothesis import given from hypothesis.strategies import floats def sum(x,

y): return x + y @given(floats(), floats()) def test_sum(x, y): # sum() is commutative assert sum(x, y) == sum(y, x) # 0 is identity assert sum(x, 0) == x Example property test Now try with ﬂoats instead of integers >>> test_sum() Falsifying example: test_sum( x=0.0, y=nan, ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 5, in test_sum File "/Users/aaronmeurer/anaconda3/lib/python3.7 1142, in wrapped_test raise the_error_hypothesis_found File "<stdin>", line 7, in test_sum AssertionError Hypothesis found an example that failed the test
10. ### Example property test Now try with ﬂoats instead of integers

>>> test_sum() Falsifying example: test_sum( x=0.0, y=nan, ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 5, in test_sum File "/Users/aaronmeurer/anaconda3/lib/python3.7 1142, in wrapped_test raise the_error_hypothesis_found File "<stdin>", line 7, in test_sum AssertionError Hypothesis found an example that failed the test • We can ﬁx it by doing one of three things: • Fix sum() to make the property hold (the function has a bug). • Fix the property to account for the bad example (the tested property is wrong). • Filter the bad example from the input strategies (we don’t care if this corner case invalidates the tested property).
11. ### • For this case, we might want to ﬁlter out

NaNs, which looks like this: from hypothesis import given from hypothesis.strategies import floats def sum(x, y): return x + y @given(floats(allow_nan=False), floats(allow_nan=False)) def test_sum(x, y): # sum() is commutative assert sum(x, y) == sum(y, x) # 0 is identity assert sum(x, 0) == x Example property test Now try with ﬂoats instead of integers >>> test_sum() >>> # Passed

16. ### How Hypothesis works • Hypothesis is not the same as

random testing. • Think of it as adversarial testing. It is random, but also always tries examples that are known to trigger corner cases (like NaN in our example). This includes a database of previous known failures. • It’s similar to fuzzing, if you are familiar with that concept.* * The Hypothesis developers assert that “property based testing” and “fuzzing" are not exactly the same thing.
17. ### When does Hypothesis work well? • Any situation where the

behavior of code is deﬁned by easily testable properties. • Mathematical functions or mathematical-like objects (like arrays) • Even non-mathematical code has desirable properties that can be tested, e.g., doesn’t raise exceptions, doesn’t take too long to run, etc. • Ideal situation: the code is deﬁned by a spec or reference implementation • Can write tests for exactly what the spec says • assert our_function(x) == reference_function(x)
18. ### Benefits of Hypothesis • You will ﬁnd bugs you wouldn’t

have found otherwise (though your users might have). • Writing tests as high level properties helps to keep your code consistent, even in corner cases. • Your tests become better the more they are run.
19. ### Benefits of Hypothesis • Writing code against a Hypothesis test

feels almost like solving a puzzle or playing a programming game. “That’s close, but you didn’t consider this corner case.” • You can write a test that tests code that isn’t written yet. Just ignore NotImplementedError or omit cases from the input strategies. When you implement it, the test will automatically start testing it. • Shrinking • Hypothesis will automatically try to make a counterexample as “small” as possible. e.g., if a function has a bug with odd integers it will give 1 as the failing example input instead of 5728175.
20. ### Benefits of Hypothesis • Generally very user friendly: • Helpful

error messages. • Verbose ﬂags to help you see what is going on internally • Built-in health checks check for common errors or other problems. • Very helpful developer community.
21. ### Writing property tests is hard • It’s not obvious what

properties should hold for a given function. • What is the correct way to test things like equality (especially a problem for tests involving ﬂoats)? • Hypothesis may help you ﬁnd out that a property that you thought should always be true actually isn’t. • You are forced to think, and actually understand the code you are testing. Raymond Hettinger has a great set of tweets on this topic https://twitter.com/ raymondh/status/1292548482109067265 Downsides Caveats of Hypothesis
22. ### All the usual caveats of random tests. • Sometimes a

failing example is only in a very small part of the search space. You may not ﬁnd it right away. • Might ﬁnd a failure much later on CI in an unrelated PR. • But: • You can seed the tests with --hypothesis-seed or @seed • You can add explicit examples with @example • Previous failures are recorded and saved in a local example database, so they will show up in future test runs. • Hypothesis supplements, not replaces classical tests. • These are exactly the sorts of bugs you want to ﬁnd. Downsides Caveats of Hypothesis
23. ### Downsides Caveats of Hypothesis Upstream Bugs • You might will

ﬁnd bugs in upstream dependencies. • Case in point: with ndindex I’ve found half a dozen bugs in NumPy. About half of those were not known to the NumPy developers before I reported them. Others were deprecated behavior. • You will have to work around these (or submit upstream ﬁxes). • You will also, obviously, ﬁnd bugs in your own code. This can make adding hypothesis to an existing codebase diﬃcult.
24. ### Downsides Caveats of Hypothesis The threshold problem • Sometimes shrinking

makes a bug seem much less interesting than it actually is. • For example, if a test tests assert abs(value) < 1e-15, Hypothesis may ﬁnd an example that makes value == 1.000000000000001e-15. It doesn’t necessarily mean the 1e-15 threshold is slightly too small, just that hypothesis made value as small as possible to still fail the test. • It also likes to ﬁnd examples involving “trivial” cases, like empty lists, empty arrays, etc. Anything that isn’t relevant to the failure will be shrunk away. • You can easily be tricked into thinking the bug only occurs in empty or corner cases. • It’s just something to be aware of when using Hypothesis. • See https://hypothesis.works/articles/threshold-problem/ for more information.
25. ### Downsides Caveats of Hypothesis Search space • If the search

space for the examples is too large, it may have a hard time ﬁnding meaningful examples. • Can be ﬁxed by building custom strategies that generate useful examples more often. • Can also use target(), which attempts to do this automatically via an optimization algorithm.
26. ### Downsides Caveats of Hypothesis Search space • Example: in ndindex,

I have strategies to generate random indices idx and random arrays a to test a[idx]. If idx is a boolean array index, a[idx] is an IndexError unless its shape matches the shape of a. • If Hypothesis generates the two randomly, the chances they will match is slim. • It wasn’t ﬁnding simple bugs, because it couldn’t generate examples that didn’t trivially give an IndexError, even after running millions of examples. • Solution: The shape for idx can be made to be related to the shape of a using the shared() strategy.
27. ### Downsides Caveats of Hypothesis Search space shapes = tuples(integers(0, 10))

shared_shapes = shared(shapes) @composite def subsequences(draw, sequence): seq = draw(sequence) start = draw(integers(0, max(0, len(seq)-1))) stop = draw(integers(start, len(seq))) return seq[start:stop] boolean_arrays = arrays(intp, subsequences(shared_shapes))
28. ### Downsides Caveats of Hypothesis Search space • The important thing

is to remember that Hypothesis isn’t magic. It may take some work in more advanced cases to make it useful. • A Hypothesis test passing may mean your code works, or it may mean Hypothesis just hasn’t found the bugs in it (yet). • You should regularly “test” that your tests work (e.g., introduce a known bug and see if they catch it). • Hypothesis runs ~100 examples per test by default, but you can manually run a lot of examples by setting a max_examples ﬂag. • e.g., run a million or billion examples overnight to see if it ﬁnds anything
29. ### Downsides Caveats of Hypothesis Hard to ignore bugs • Sometimes

Hypothesis ﬁnds a bug that you don’t really care about. • Or at least don’t want to care about right now. • But if an example fails a test, Hypothesis will always ﬁnd it. The only way to shut it up is to either ﬁx the bug, or modify the test or strategies to ignore certain cases (in other words, make the test “wrong”). It’s not like traditional tests where you can just XFAIL the speciﬁc test case and ignore it. • This is ultimately a good thing. It leads to correct code. But it can make Hypothesis diﬃcult to use, especially for already existing codebases.