Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Testing with Hypothesis

Testing with Hypothesis

These slides go over the basics of testing with the Python Hypothesis library. They list some of the advantages of using Hypothesis, as well as some of the most common caveats to be aware of.

Unfortunately, SpeakerDeck does not allow you to click on links in decks. Here are the links from the deck:

- Hypothesis main webpage from slide 1: https://hypothesis.readthedocs.io/

- Computer Science Fact tweet from slide 12: https://twitter.com/CompSciFact/status/1294265334187384835

- The Raymond Hettinger tweet from slide 21: https://twitter.com/raymondh/status/1292548482109067265.

- The article about the threshold problem from slide 24: https://hypothesis.works/articles/threshold-problem/.

- Hypothesis with ndindex: https://quansight-labs.github.io/ndindex/#testing-and-correctness

Feel free to email me if you have any questions [email protected]

Aaron Meurer

August 28, 2020
Tweet

More Decks by Aaron Meurer

Other Decks in Technology

Transcript

  1. # Test def test_sum(): assert sum(1, 2) == 3 assert

    sum(1, -1) == 0 This is what typical tests look like # Function def sum(x, y): return x + y Inputs Outputs
  2. Property tests are different • Instead writing several assert f(input)

    == output, we write tests for properties • A property is any (testable) statement about f that should hold true for all input/output pairs. • We don’t worry about finding input/output pairs that check the property. Instead, assume we have a magical program that tells us if it is actually true for the code we are testing.
  3. Examples of properties • Mathematical properties: • f is commutative:

    assert f(x, y) == f(y, x) • f is idempotent: assert f(f(x)) == f(x) • If g is the inverse of f: assert g(f(x)) == x; assert f(g(y)) == y (f and g “round-trip”) • f satisfies some more advanced condition • “Code” Properties • f returns the correct type • f does not raise an unexpected exception
  4. # Test def test_sum(x, y): # sum() is commutative assert

    sum(x, y) == sum(y, x) # 0 is identity assert sum(x, 0) == x Example property test # Function def sum(x, y): return x + y Inputs and outputs are not specified explicitly
  5. from hypothesis import given from hypothesis.strategies import integers def sum(x,

    y): return x + y @given(integers(), integers()) def test_sum(x, y): # sum() is commutative assert sum(x, y) == sum(y, x) # 0 is identity assert sum(x, 0) == x Example property test >>> test_sum() >>> # Passed
  6. How to write a property test • Tell Hypothesis what

    the types of the inputs are, and it will automatically generate examples. • This is done with the @given decorator and “strategies”. • Hypothesis has built-in strategies for all the common Python datatypes, as well as strategies for NumPy datatypes. • Hypothesis strategies are easy to compose. • Example: json = recursive(none() | booleans() | floats() | text(string.printable), lambda children: lists(children, 1) | dictionaries(text(string.printable), children, min_size=1))
  7. from hypothesis import given from hypothesis.strategies import floats def sum(x,

    y): return x + y @given(floats(), floats()) def test_sum(x, y): # sum() is commutative assert sum(x, y) == sum(y, x) # 0 is identity assert sum(x, 0) == x Example property test Now try with floats instead of integers >>> test_sum() Falsifying example: test_sum( x=0.0, y=nan, ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 5, in test_sum File "/Users/aaronmeurer/anaconda3/lib/python3.7 1142, in wrapped_test raise the_error_hypothesis_found File "<stdin>", line 7, in test_sum AssertionError Hypothesis found an example that failed the test
  8. Example property test Now try with floats instead of integers

    >>> test_sum() Falsifying example: test_sum( x=0.0, y=nan, ) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 5, in test_sum File "/Users/aaronmeurer/anaconda3/lib/python3.7 1142, in wrapped_test raise the_error_hypothesis_found File "<stdin>", line 7, in test_sum AssertionError Hypothesis found an example that failed the test • We can fix it by doing one of three things: • Fix sum() to make the property hold (the function has a bug). • Fix the property to account for the bad example (the tested property is wrong). • Filter the bad example from the input strategies (we don’t care if this corner case invalidates the tested property).
  9. • For this case, we might want to filter out

    NaNs, which looks like this: from hypothesis import given from hypothesis.strategies import floats def sum(x, y): return x + y @given(floats(allow_nan=False), floats(allow_nan=False)) def test_sum(x, y): # sum() is commutative assert sum(x, y) == sum(y, x) # 0 is identity assert sum(x, 0) == x Example property test Now try with floats instead of integers >>> test_sum() >>> # Passed
  10. How Hypothesis works • Hypothesis is not the same as

    random testing. • Think of it as adversarial testing. It is random, but also always tries examples that are known to trigger corner cases (like NaN in our example). This includes a database of previous known failures. • It’s similar to fuzzing, if you are familiar with that concept.* * The Hypothesis developers assert that “property based testing” and “fuzzing" are not exactly the same thing.
  11. When does Hypothesis work well? • Any situation where the

    behavior of code is defined by easily testable properties. • Mathematical functions or mathematical-like objects (like arrays) • Even non-mathematical code has desirable properties that can be tested, e.g., doesn’t raise exceptions, doesn’t take too long to run, etc. • Ideal situation: the code is defined by a spec or reference implementation • Can write tests for exactly what the spec says • assert our_function(x) == reference_function(x)
  12. Benefits of Hypothesis • You will find bugs you wouldn’t

    have found otherwise (though your users might have). • Writing tests as high level properties helps to keep your code consistent, even in corner cases. • Your tests become better the more they are run.
  13. Benefits of Hypothesis • Writing code against a Hypothesis test

    feels almost like solving a puzzle or playing a programming game. “That’s close, but you didn’t consider this corner case.” • You can write a test that tests code that isn’t written yet. Just ignore NotImplementedError or omit cases from the input strategies. When you implement it, the test will automatically start testing it. • Shrinking • Hypothesis will automatically try to make a counterexample as “small” as possible. e.g., if a function has a bug with odd integers it will give 1 as the failing example input instead of 5728175.
  14. Benefits of Hypothesis • Generally very user friendly: • Helpful

    error messages. • Verbose flags to help you see what is going on internally • Built-in health checks check for common errors or other problems. • Very helpful developer community.
  15. Writing property tests is hard • It’s not obvious what

    properties should hold for a given function. • What is the correct way to test things like equality (especially a problem for tests involving floats)? • Hypothesis may help you find out that a property that you thought should always be true actually isn’t. • You are forced to think, and actually understand the code you are testing. Raymond Hettinger has a great set of tweets on this topic https://twitter.com/ raymondh/status/1292548482109067265 Downsides Caveats of Hypothesis
  16. All the usual caveats of random tests. • Sometimes a

    failing example is only in a very small part of the search space. You may not find it right away. • Might find a failure much later on CI in an unrelated PR. • But: • You can seed the tests with --hypothesis-seed or @seed • You can add explicit examples with @example • Previous failures are recorded and saved in a local example database, so they will show up in future test runs. • Hypothesis supplements, not replaces classical tests. • These are exactly the sorts of bugs you want to find. Downsides Caveats of Hypothesis
  17. Downsides Caveats of Hypothesis Upstream Bugs • You might will

    find bugs in upstream dependencies. • Case in point: with ndindex I’ve found half a dozen bugs in NumPy. About half of those were not known to the NumPy developers before I reported them. Others were deprecated behavior. • You will have to work around these (or submit upstream fixes). • You will also, obviously, find bugs in your own code. This can make adding hypothesis to an existing codebase difficult.
  18. Downsides Caveats of Hypothesis The threshold problem • Sometimes shrinking

    makes a bug seem much less interesting than it actually is. • For example, if a test tests assert abs(value) < 1e-15, Hypothesis may find an example that makes value == 1.000000000000001e-15. It doesn’t necessarily mean the 1e-15 threshold is slightly too small, just that hypothesis made value as small as possible to still fail the test. • It also likes to find examples involving “trivial” cases, like empty lists, empty arrays, etc. Anything that isn’t relevant to the failure will be shrunk away. • You can easily be tricked into thinking the bug only occurs in empty or corner cases. • It’s just something to be aware of when using Hypothesis. • See https://hypothesis.works/articles/threshold-problem/ for more information.
  19. Downsides Caveats of Hypothesis Search space • If the search

    space for the examples is too large, it may have a hard time finding meaningful examples. • Can be fixed by building custom strategies that generate useful examples more often. • Can also use target(), which attempts to do this automatically via an optimization algorithm.
  20. Downsides Caveats of Hypothesis Search space • Example: in ndindex,

    I have strategies to generate random indices idx and random arrays a to test a[idx]. If idx is a boolean array index, a[idx] is an IndexError unless its shape matches the shape of a. • If Hypothesis generates the two randomly, the chances they will match is slim. • It wasn’t finding simple bugs, because it couldn’t generate examples that didn’t trivially give an IndexError, even after running millions of examples. • Solution: The shape for idx can be made to be related to the shape of a using the shared() strategy.
  21. Downsides Caveats of Hypothesis Search space shapes = tuples(integers(0, 10))

    shared_shapes = shared(shapes) @composite def subsequences(draw, sequence): seq = draw(sequence) start = draw(integers(0, max(0, len(seq)-1))) stop = draw(integers(start, len(seq))) return seq[start:stop] boolean_arrays = arrays(intp, subsequences(shared_shapes))
  22. Downsides Caveats of Hypothesis Search space • The important thing

    is to remember that Hypothesis isn’t magic. It may take some work in more advanced cases to make it useful. • A Hypothesis test passing may mean your code works, or it may mean Hypothesis just hasn’t found the bugs in it (yet). • You should regularly “test” that your tests work (e.g., introduce a known bug and see if they catch it). • Hypothesis runs ~100 examples per test by default, but you can manually run a lot of examples by setting a max_examples flag. • e.g., run a million or billion examples overnight to see if it finds anything
  23. Downsides Caveats of Hypothesis Hard to ignore bugs • Sometimes

    Hypothesis finds a bug that you don’t really care about. • Or at least don’t want to care about right now. • But if an example fails a test, Hypothesis will always find it. The only way to shut it up is to either fix the bug, or modify the test or strategies to ignore certain cases (in other words, make the test “wrong”). It’s not like traditional tests where you can just XFAIL the specific test case and ignore it. • This is ultimately a good thing. It leads to correct code. But it can make Hypothesis difficult to use, especially for already existing codebases.