Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Experiences debugging memory bloat and high CPU...

Pycon ZA
October 07, 2016

Experiences debugging memory bloat and high CPU consumption in python by Alexandre Hardy

This talk is targeted at python developers who develop long running services, which are susceptible to memory issues or unacceptable CPU usage (as determined by the developer or operations teams). We focus on debugging techniques that we have used in constrained environments (production like environments) where installation of additional software packages is not permitted, and techniques which we were able to use to debug a python process without restarting it (to avoid state loss).

Topics covered in the talk: 1. An example debugging session in which memory bloat is diagnosed. 2. An example debugging session in which high CPU usage is diagnosed. 3. Tools which can be used in debugging these issues, and their limitations. 4. Some thoughts about improving how we think about python, and assumptions made by python developers. 5. Projects we are (slowly) working on to help us debug python processes in production, or production like environments.

The audience should learn about some useful standard python libraries which can be used to debug memory and CPU usage related issues, and how they were effectively used to solve problems in an enterprise product. The audience will also be encouraged to think differently about python programming, and encouraged to think about what it means to select python (or any other programming language) for a given task.

Pycon ZA

October 07, 2016
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Experiences debugging memory bloat and high CPU consumption in python Alexandre Hardy [email protected] Oracle Cloud Development Centre, Cape Town PyConZA 2016 Alexandre [email protected] Python GC and CPU debugging
  2. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Table of Contents 1 Introduction 2 A RESTful service with problems 3 Work in progress 4 Conclusion 5 Questions? Alexandre [email protected] Python GC and CPU debugging
  3. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Introduction ... or what we will present Simplified debugging examples based on real debugging sessions. This is not production code, and serves only to instruct. I am not a python memory management expert, so there will be some hand waving and avoidance of detail. Mistakes are made on purpose (mostly). Debug memory leak. Debug memory fragmentation. Debug reference cycles. Debug high CPU (tight loop actually) issue. Lessons learned. Alexandre [email protected] Python GC and CPU debugging
  4. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Why do we care? Originally we had a control plane co-existing with VMs. We want the memory for VMs. A large memory footprint can impact other operations (for example copying page tables when forking the process). The process can become slower due to the large memory footprint (cache coherence?). The Twisted HTTP server did not have a graceful shutdown. Requests were terminated abruptly. (So we don’t really want to kill the process to rectify the situation). Alexandre [email protected] Python GC and CPU debugging
  5. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Production (like) restrictions All packages must be vetted by a security team, and other organizational structures. Customer experience is paramount, service disruption for debugging is not acceptable. Our QA environments are setup with similar restrictions, since other teams are our customers, and we need to give them a production experience. Luckily in QA environments we are given more time to debug. Alexandre [email protected] Python GC and CPU debugging
  6. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Frameworks used Python 2.7 (we want specific memory management behaviour for the demos) Twisted framework 16.0.0 (asynchronous IO etc). Twisted manhole. Python’s own libraries are used for debugging. Alexandre [email protected] Python GC and CPU debugging
  7. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Table of Contents 1 Introduction 2 A RESTful service with problems 3 Work in progress 4 Conclusion 5 Questions? Alexandre [email protected] Python GC and CPU debugging
  8. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? A RESTful service to tell us about PyConZA participants ... to the code! Get it at https://github.com/alexandrehardy/ahardy pyconza2016 gc1.py model.py ... let’s send a couple of 100 requests to the service. Alexandre [email protected] Python GC and CPU debugging
  9. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Debugging a memory leak ... using third party tools (Guppy) Guppy / heapy for computing heap deltas (https://pypi.python.org/pypi/guppy/) >>> from guppy import hpy; hp=hpy() >>> before = hp.heap() >>> after = hp.heap() >>> leak = after - before >>> leak Partition of a set of 48477 objects. Total size = 3265516 bytes. Index Count % Size % Cumulative % Kind (class / dict of class) 0 25773 53 1612820 49 1612820 49 str 1 11699 24 483960 15 2096780 64 tuple 2 174 0 241584 7 2338364 72 dict of module 3 3478 7 222592 7 2560956 78 types.CodeType 4 3296 7 184576 6 2745532 84 function 5 401 1 175112 5 2920644 89 dict of class 6 108 0 81888 3 3002532 92 dict (no owner) 7 114 0 79632 2 3082164 94 dict of type 8 117 0 51336 2 3133500 96 type 9 667 1 24012 1 3157512 97 __builtin__.wrapper_descriptor <76 more rows. Type e.g. ’_.more’ to view.> Alexandre [email protected] Python GC and CPU debugging
  10. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Debugging a memory leak ... using third party tools (objgraph) Objgraph for graphing reference chains (https://pypi.python.org/pypi/objgraph) >>> # Forward references >>> import objgraph >>> x = [] >>> y = [x, [x], dict(x=x)] >>> objgraph.show_refs([y], filename=’sample-graph.png’) Graph written to ....dot (... nodes) Image generated as sample-graph.png >>> # Backward references >>> objgraph.show_backrefs([x], filename=’sample-backref-graph.png’) ... Graph written to ....dot (8 nodes) Image generated as sample-backref-graph.png Alexandre [email protected] Python GC and CPU debugging
  11. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Debugging a memory leak I ... using the python gc module Python 2.6 / 2.7 uses reference counting with generational collection less frequently. First guess: reference is not zero, so not garbage collected. Introducing our own heap delta analysis module: debug.py Alexandre [email protected] Python GC and CPU debugging
  12. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Debugging a memory leak II ... using the python gc module Alexandre [email protected] Python GC and CPU debugging
  13. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Debugging a memory leak III ... using the python gc module Alexandre [email protected] Python GC and CPU debugging
  14. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Debugging a memory leak IV ... using the python gc module Alexandre [email protected] Python GC and CPU debugging
  15. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Let’s find the leak ... Alexandre [email protected] Python GC and CPU debugging
  16. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? And now fix the leak ... I model.py @@ -7,20 +7,13 @@ database = ’db’ table = ’table’ - _field_cache = {} def __init__(self): pass - @staticmethod - def get_fields(cls): - if cls not in DBModel._field_cache: - DBModel._field_cache[cls] = cls.get_fields() - return DBModel._field_cache[cls] - @classmethod def get(cls, identifier): result = cls() deferred = defer.Deferred() - fields = DBModel.get_fields(cls) + fields = cls.get_fields() for field in fields: @@ -33,5 +26,5 @@ Alexandre [email protected] Python GC and CPU debugging
  17. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? And now fix the leak ... II def to_primitive(self): - fields = DBModel.get_fields(self.__class__) + fields = self.get_fields() primitive = {} for field in fields: @@ -41,5 +34,5 @@ class Attendee(DBModel): - database = ’pyconza’ + database = ’pyconza2016’ table = ’attendees’ fields = { Alexandre [email protected] Python GC and CPU debugging
  18. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? The leak is now fixed, but ... I ... we’ve added something and the RSS is still high! gc2.py @@ -46,4 +46,20 @@ [email protected] +def audit(): + work = [] + i = 0 + while True: + i = i + 1 + work.append(’work %s’ % i) + deferred = defer.Deferred() + reactor.callLater(0.0001, deferred.callback, True) + yield deferred + + if len(work) > 10000: + # remove every second item + work = work[::2] + + if __name__ == ’__main__’: site = server.Site(Demo()) @@ -53,4 +69,5 @@ manhole.password = ’demo’ Alexandre [email protected] Python GC and CPU debugging
  19. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? The leak is now fixed, but ... II ... we’ve added something and the RSS is still high! reactor.listenTCP(20000, manhole) + reactor.callLater(0, audit) print ’Running web server, pid=%s’ % os.getpid() open(’gc.pid’, ’w’).write(’%s’ % os.getpid()) Alexandre [email protected] Python GC and CPU debugging
  20. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Analyzing memory layout Alexandre [email protected] Python GC and CPU debugging
  21. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Analyzing memory layout I ... heat maps heatmap.py: Sort of. Visualize memory usage. Fragmentation. Let’s reduce fragmentation. Alexandre [email protected] Python GC and CPU debugging
  22. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Analyzing memory layout ... where do the pointers come from? Alexandre [email protected] Python GC and CPU debugging
  23. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Reduce fragmentation ... I gc2.py @@ -25,16 +25,31 @@ return NewAttendee.get(registration_number) + def get_db_cursor(self): + # Pretend to be a cursor! + for registration in xrange(0, 1000): + yield self.get_attendee_from_db(registration) + @defer.inlineCallbacks def get_attendees(self): - attendees = [(yield self.get_attendee_from_db(registration)) for registration in xrange(0, 1000)] - defer.returnValue(attendees) + # Pretend to contact the database, need this to be a generator + yield True + defer.returnValue(self.get_db_cursor()) @defer.inlineCallbacks def render_list(self, request): db_results = yield self.get_attendees() - self.log(’Got results %s’ % len(db_results)) - results = [r.to_primitive() for r in db_results] - self.log(’Shipped results %s’ % len(results)) - request.write(json.dumps(results, indent=4)) + self.log(’Got results’) Alexandre [email protected] Python GC and CPU debugging
  24. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Reduce fragmentation ... II + request.write(’[\n’) + + # Get the first value. We should be checking for StopIteration here ... + first = yield db_results.next() + request.write(json.dumps(first.to_primitive(), indent=4)) + + for deferred_result in db_results: + result = yield deferred_result + request.write(’,\n%s’ % json.dumps(result.to_primitive(), indent=4)) + + request.write(’\n]\n’) + self.log(’Shipped results’) request.finish() self.log(’All done’) Alexandre [email protected] Python GC and CPU debugging
  25. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Affect of circular references Easier than you might think... (mro) Harder to reproduce than you might think (I took the easy way out ...) Alexandre [email protected] Python GC and CPU debugging
  26. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Third example I gc3.py @@ -1,2 +1,3 @@ +from copy import deepcopy import json import os @@ -11,4 +12,31 @@ +class Address(object): + def __init__(self, attendee): + self.attendee = attendee + + def __del__(self): + # Remove a reference to the parent + self.attendee = None + + +def AttendeeWithDiet(diet=[’none’, ’vegetarian’]): + class AttendeeWithDietClass(Attendee): + fields = dict(dietary_preference=[’none’, ’vegetarian’], **deepcopy(Attendee.fields)) + + def __init__(self): + self.address = Address(self) + + @classmethod Alexandre [email protected] Python GC and CPU debugging
  27. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Third example II + def get_fields(cls): + return AttendeeWithDietClass.fields + + def __del__(self): + # Remove the reference to the fields we have + self.address = None + + return AttendeeWithDietClass + + class Demo(resource.Resource): isLeaf = True @@ -21,7 +49,5 @@ def get_attendee_from_db(self, registration_number): - class NewAttendee(Attendee): - database = ’pyconza2016’ - return NewAttendee.get(registration_number) + return AttendeeWithDiet().get(registration_number) def get_db_cursor(self): @@ -62,17 +88,12 @@ @defer.inlineCallbacks -def audit(): - work = [] - i = 0 Alexandre [email protected] Python GC and CPU debugging
  28. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Third example III +def audit(rate): while True: - i = i + 1 - work.append(’work %s’ % i) + to_audit = [AttendeeWithDiet().get(i) for i in xrange(0, 1000)] deferred = defer.Deferred() - reactor.callLater(0.0001, deferred.callback, True) + reactor.callLater(rate, deferred.callback, True) yield deferred - - if len(work) > 10000: - # remove every second item - work = work[::2] + for entry in to_audit: + entry.done = True @@ -84,5 +105,5 @@ manhole.password = ’demo’ reactor.listenTCP(20000, manhole) - reactor.callLater(0, audit) + reactor.callLater(0, audit, 0.1) print ’Running web server, pid=%s’ % os.getpid() open(’gc.pid’, ’w’).write(’%s’ % os.getpid()) Alexandre [email protected] Python GC and CPU debugging
  29. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Finding circular references Alexandre [email protected] Python GC and CPU debugging
  30. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Tracing a python program ... because we can So what other modules does python provide that we can use for debugging? highcpu.py strace doesn’t offer too much information. trace.py - python can tell us exactly which functions are invoked. Alexandre [email protected] Python GC and CPU debugging
  31. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Tight loop ... I highcpu.py @@ -1,2 +1,3 @@ +from copy import deepcopy import json import os @@ -9,4 +10,5 @@ from model import Attendee +from trace import install_trace_handler @@ -21,7 +23,5 @@ def get_attendee_from_db(self, registration_number): - class NewAttendee(Attendee): - database = ’pyconza2016’ - return NewAttendee.get(registration_number) + return Attendee.get(registration_number) def get_db_cursor(self): @@ -36,4 +36,7 @@ defer.returnValue(self.get_db_cursor()) + def include_result(self, result): + return False + Alexandre [email protected] Python GC and CPU debugging
  32. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Tight loop ... II @defer.inlineCallbacks def render_list(self, request): @@ -42,11 +45,16 @@ request.write(’[\n’) - # Get the first value. We should be checking for StopIteration here ... - first = yield db_results.next() - request.write(json.dumps(first.to_primitive(), indent=4)) - - for deferred_result in db_results: - result = yield deferred_result - request.write(’,\n%s’ % json.dumps(result.to_primitive(), indent=4)) + done = False + all_results = [] + next_item = yield db_results.next() + while not done: + if self.include_result(next_item): + all_results.append(next_item) + # Forgot to get a new value for next_item + # A loop without a function call won’t be picked up by tracing. + # Tracing only picks up function calls. + + for item in all_results: + request.write(json.dumps(item.to_primitive(), indent=4)) request.write(’\n]\n’) @@ -86,5 +94,6 @@ Alexandre [email protected] Python GC and CPU debugging
  33. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Tight loop ... III reactor.callLater(0, audit) print ’Running web server, pid=%s’ % os.getpid() - open(’gc.pid’, ’w’).write(’%s’ % os.getpid()) + open(’cpu.pid’, ’w’).write(’%s’ % os.getpid()) + install_trace_handler() reactor.run() print ’Stopped!’ Alexandre [email protected] Python GC and CPU debugging
  34. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Tracing a python program Alexandre [email protected] Python GC and CPU debugging
  35. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Table of Contents 1 Introduction 2 A RESTful service with problems 3 Work in progress 4 Conclusion 5 Questions? Alexandre [email protected] Python GC and CPU debugging
  36. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Debug tools in progress / complete Turn debug logging on (and off) without service restart. Turn the twisted manhole on (and off) without service restart. Turn on (and off) process tracing with an intermittent consumer (unix pipe). Python memory dump (for when all else fails) with analysis tools. ... all with security in mind. Alexandre [email protected] Python GC and CPU debugging
  37. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Table of Contents 1 Introduction 2 A RESTful service with problems 3 Work in progress 4 Conclusion 5 Questions? Alexandre [email protected] Python GC and CPU debugging
  38. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Lessons learned You still have to understand memory management in garbage collected languages. Garbage collector implementations can be suprising. The advantage of garbage collection is that you can write a program quickly, with minimal concerns. The cost of garbage collection is that you no longer have an explicit way to control memory management. You sometimes have to convince the garbage collection to behave the way you want it to. A decent understanding is required of any programming language selected for a task. Alexandre [email protected] Python GC and CPU debugging
  39. Introduction A RESTful service with problems Work in progress Conclusion

    Questions? Table of Contents 1 Introduction 2 A RESTful service with problems 3 Work in progress 4 Conclusion 5 Questions? Alexandre [email protected] Python GC and CPU debugging