Testing in production: A Gentle Walk on Software Monitoring Research

Testing in production: A Gentle Walk on Software Monitoring Research

"Testing in production" used to be a joke among developers. However, given the complexity of the large and distributed systems that take care of important parts of our lives, "testing in development", or, in other words, prevention, might not be enough anymore. In this talk, I'll discuss the importance of systems monitoring, logging, and log analysis to modern software systems. I'll reflect on the current state-of-the-art in industry and research fields, as well as the current open challenges. A great part of this talk is based on the research we conducted at Adyen, a large-scale payment company, that serves companies such as Facebook, Uber, and Spotify.

77cc77ccdaf329705ce14a9d3ff1e8a8?s=128

Mauricio Aniche

November 22, 2019
Tweet

Transcript

  1. Maurício F. Aniche M.F.Aniche@tudelft.nl @mauricioaniche “Testing in Production”: A gentle

    walk on software monitoring research
  2. https://xkcd.com/1428/ Contemporary (tech) companies: • Move fast • Are highly

    complex
  3. None
  4. Monitoring as a first-class citizen!

  5. None
  6. • 3,500 customers, including eBay, Uber, Spotify, and Facebook. •

    In 2015, Adyen achieved a valuation of $2.3 billion, making it the sixth largest European unicorn, and the largest Dutch unicorn. • In April 2017, the company was granted a European banking license. • 1000+ employees from 80+ countries. • Adyen has processed more than 159 billion euros in payments.
  7. 1 1 1 2 1 2 1 1 1 3

    1 1 5 1 6 3 1 5 2 4 2 8 1 4 2 2 6 1 3 6 5 2 5 5 1 '17 '16 '09 '18 '12 '14 '03 '04 '13 '15 '11 '10 '07 '05 '08 '98 '00 '06 '92 Year Publication Type industry track paper journal research track paper
  8. /RJ 3URFHVVRU /RJ,QIUDVWUXFWXUH 5XQQLQJ$SSOLFDWLRQ %XLOGDQG'HSOR\PHQW ½»¼ÄŠ»ÀŠ½¾¼Àˆ»¿   @O>CDIB?<« ½»¼ÄŠ»ÀŠ½¾¼Àˆ»¿I<=G@OJA«

    ½»¼ÄŠ»ÀŠ½¾¼Àˆ»¿  @NOJMDIBN« /RJGDWD ORJ JHQHUDWLRQ ([HFXWLRQ(QYLURQPHQW 6RXUFHFRGH ORJ WUDQVPLVVLRQ 6WRUDJH 23V(QJLQHHU /RJ(QJLQHHULQJ /RJ$QDO\VLV 'DVKERDUGV 'HYHORSHU @NKJIN@?<O<Ó« DAŸ?<O<ÓÓIPGG £ JB‰@MMJMŸ¯I<=G@OJA« /RJ3ODWIRUPV 4XHU\ £¯HNB¯ˆ¯I<=G@OJA@O« ¯ODH@NO<HK¯ˆ¯½»¼ÄŠ»À« ¯N@Q@MDOT¯ˆ¯¯¤ /RJ*HQHUDWLRQ /RJ&RQVXPSWLRQ Jeanderson Candido, Maurício Aniche, Arie van Deursen. Contemporary Software Monitoring: A Systematic Literature Review. 2019. In submission (Transactions on Software Engineering).
  9. Contemporary Logging Framework Log Engineering Log Infrastructure Log Analysis Log

    Platforms Anti-patterns in logging code Implementation of log statements Empirical studies Parsing Storage Anomaly detection Security and privacy Root cause analysis Failure prediction Software testing Model inference and invariant mining Reliability and dependability End-to-end analysis tools
  10. Log Engineering • Deals with the logging decisions from the

    development side. • What to log? • Where to log? • How severe should it be? • 3 – 6% of code are related to logs. • Those decisions matter a lot in practice! • Research focus has been: • Anti-patterns in logging code • Implementation of log statements • Empirical Studies in Log Engineering /RJ 3URFHVVRU /RJ,QIUDVWUXFWXUH 5XQQLQJ$SSOLFDWLRQ %XLOGDQG'HSOR\PHQW ½»¼ÄŠ»ÀŠ½¾¼Àˆ»¿   @O>CDIB?<« ½»¼ÄŠ»ÀŠ½¾¼Àˆ»¿I<=G@OJA« ½»¼ÄŠ»ÀŠ½¾¼Àˆ»¿  @NOJMDIBN« /RJGDWD ORJ JHQHUDWLRQ ([HFXWLRQ(QYLURQPHQW 6RXUFHFRGH ORJ WUDQVPLVVLRQ 6WRUDJH 23V(QJLQHHU /RJ(QJLQHHULQJ /RJ$QDO\VLV 'DVKERDUGV 'HYHORSHU @NKJIN@?<O<Ó« DAŸ?<O<ÓÓIPGG £ JB‰@MMJMŸ¯I<=G@OJA« /RJ3ODWIRUPV 4XHU\ £¯HNB¯ˆ¯I<=G@OJA@O« ¯ODH@NO<HK¯ˆ¯½»¼ÄŠ»À« ¯N@Q@MDOT¯ˆ¯¯¤ /RJ*HQHUDWLRQ /RJ&RQVXPSWLRQ
  11. Common log engineering problems • Improper log message • Improper

    log levels • Runtime issues (e.g., logging a NULL variable) • Configuration errors • Missing log statements • Overwhelming volume of data Shang, W., Nagappan, M., Hassan, A. E., & Jiang, Z. M. (2014, September). Understanding log lines using development knowledge. In 2014 IEEE International Conference on Software Maintenance and Evolution (pp. 21-30). IEEE. Hassani, M., Shang, W., Shihab, E., & Tsantalis, N. (2018). Studying and detecting log-related issues. Empirical Software Engineering, 23(6), 3248-3280.
  12. Can ML help? Can we learn ‘where’ to log? Zhu,

    J., He, P., Fu, Q., Zhang, H., Lyu, M. R., & Zhang, D. (2015, May). Learning to log: Helping developers make informed logging decisions. In Proceedings of the 37th International Conference on Software Engineering-Volume 1 (pp. 415-425). IEEE Press.
  13. Can DL help? • Can we learn not only from

    the structure, but also from the semantics of code? • Like DeepBugs or Code2Vec. • First exploration shows that a “simple language models” won’t work. (Best we got was balanced accuracy=.71, recall=.48, but this was a first exploration by our MSc students; more to come)
  14. Log Infrastructure • The tool support necessary to make the

    further analysis feasible. • Logs are highly unstructured and in massive quantities. • Research has been focusing on: • Log Parsing • Log Storage and Compression • We need more work here! /RJ 3URFHVVRU /RJ,QIUDVWUXFWXUH 5XQQLQJ$SSOLFDWLRQ %XLOGDQG'HSOR\PHQW ½»¼ÄŠ»ÀŠ½¾¼Àˆ»¿   @O>CDIB?<« ½»¼ÄŠ»ÀŠ½¾¼Àˆ»¿I<=G@OJA« ½»¼ÄŠ»ÀŠ½¾¼Àˆ»¿  @NOJMDIBN« /RJGDWD ORJ JHQHUDWLRQ ([HFXWLRQ(QYLURQPHQW 6RXUFHFRGH ORJ WUDQVPLVVLRQ 6WRUDJH 23V(QJLQHHU /RJ(QJLQHHULQJ /RJ$QDO\VLV 'DVKERDUGV 'HYHORSHU @NKJIN@?<O<Ó« DAŸ?<O<ÓÓIPGG £ JB‰@MMJMŸ¯I<=G@OJA« /RJ3ODWIRUPV 4XHU\ £¯HNB¯ˆ¯I<=G@OJA@O« ¯ODH@NO<HK¯ˆ¯½»¼ÄŠ»À« ¯N@Q@MDOT¯ˆ¯¯¤ /RJ*HQHUDWLRQ /RJ&RQVXPSWLRQ
  15. log.info(“Customer “ + customer + “ paying ” + paymentValue);

    [2019-02-03 15:43:24] [MagicPayment.java][L456] Customer Maurício paying 235.67 Class name Line number Schipper, D., Aniche, M., & van Deursen, A. (2019). Tracing Back Log Data to its Log Statement: From Research to Practice. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR): Proceedings (pp. 545-549). [8816773] IEEE. Short paper.
  16. None
  17. In practice… • Large companies can’t really log the class

    and the line number that originates a log message. • Developers ‘grep’ the source code • Can we automatically detect where origin of the log lines? • W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 117–132. ACM, 2009.
  18. log.info(“Customer “ + customer + “ ID “ + id);

    à Customer .* ID \d*
  19. RQ1: Accuracy • We collect 100k messages from a week

    day. • These 100k messages pointed to 676 different locations in the source code. • 95% CL, 5% CI sample = 245 links. • Manual investigation = 97.6% accuracy (239 out of 245 correct links) • 99.8% accuracy in the original paper (two projects: HDFS and Darkstar, millions of log lines, # of log statements not specified).
  20. RQ2: Performance • Related work: not really reported Building ASTs

    of the entire codebase is the most expensive part of the process.
  21. When does it fail? • JSON-based logs • The developer

    logs a JSON (that is created in runtime) and thus the log statement is simply “log.info(json)”, making our template inaccurate. • Unknown logging method • Developers create their own logging methods, which our tool can't recognize. • Log strings created on-the-fly • Some log messages are too complex and developers create them by means of multiple line of code (e.g. String log = “content1”; log = log “content 2”, ...), which makes the analysis harder. • Related work: “Almost all of these (failed) messages contain long string variables.” => same for us.
  22. Logs of Complex API Integrations • Payment APIs are complex

    • Integration faults are easily made • Merchant needs assistance with API usage • Merchant may not notice mistakes • 2.5M http error responses per month (back in 2016!) • What can we learn from them? 22 Aué, J., Aniche, M., Lobbezoo, M., & van Deursen, A. (2018). An Exploratory Study on Faults in Web API Integration in a Large-Scale Payment Company. In ICSE-SEIP '18: 40th International Conference on Software Engineering: Software Engineering in Practice Track (pp. 13-22). Association for Computing Machinery (ACM).
  23. 11 Common Causes for API Error Reponses

  24. 11 Common Causes for API Error Reponses Integrators are the

    main responsible for API integration problems!
  25. 11 Common Causes for API Error Reponses Maybe these are

    the exceptions one cares more about?
  26. Log Analysis • Knowledge acquisition from log data for a

    specific purpose. • The most explored area in the topic: • Anomaly detection • Security and privacy • Root cause analysis • Failure prediction • Software testing • Model inference and invariant mining • Reliability and dependability /RJ 3URFHVVRU /RJ,QIUDVWUXFWXUH 5XQQLQJ$SSOLFDWLRQ %XLOGDQG'HSOR\PHQW ½»¼ÄŠ»ÀŠ½¾¼Àˆ»¿   @O>CDIB?<« ½»¼ÄŠ»ÀŠ½¾¼Àˆ»¿I<=G@OJA« ½»¼ÄŠ»ÀŠ½¾¼Àˆ»¿  @NOJMDIBN« /RJGDWD ORJ JHQHUDWLRQ ([HFXWLRQ(QYLURQPHQW 6RXUFHFRGH ORJ WUDQVPLVVLRQ 6WRUDJH 23V(QJLQHHU /RJ(QJLQHHULQJ /RJ$QDO\VLV 'DVKERDUGV 'HYHORSHU @NKJIN@?<O<Ó« DAŸ?<O<ÓÓIPGG £ JB‰@MMJMŸ¯I<=G@OJA« /RJ3ODWIRUPV 4XHU\ £¯HNB¯ˆ¯I<=G@OJA@O« ¯ODH@NO<HK¯ˆ¯½»¼ÄŠ»À« ¯N@Q@MDOT¯ˆ¯¯¤ /RJ*HQHUDWLRQ /RJ&RQVXPSWLRQ
  27. Open/data challenges • An avalanche of different techniques • Supervised

    and unsupervised ML, passive learning and state machines, graphs, ... • More recently, deep learning and NLP techniques. • There is no single technique that works everywhere… Gunter, D., Tierney, B. L., Brown, A., Swany, M., Bresnahan, J., & Schopf, J. M. (2007, September). Log summarization and anomaly detection for troubleshooting distributed systems. In 2007 8th IEEE/ACM International Conference on Grid Computing (pp. 226-234). IEEE. • Lack of more representative datasets • Lots of HPC logs • No industry/enterprise logs • Starting point: https://github.com/logpai/loghub • Building labeled datasets
  28. Beschastnikh, Ivan, et al. "Inferring models of concurrent systems from

    logs of their behavior with CSight." Proceedings of the 36th International Conference on Software Engineering. ACM, 2014. Beschastnikh, I., Brun, Y., Ernst, M. D., Krishnamurthy, A., & Anderson, T. E. (2012). Mining temporal invariants from partially ordered logs. ACM SIGOPS Operating Systems Review, 45(3), 39-46. Model inference
  29. Passive learning Identifying system behavior from observations, and representing it

    in the smallest possible model. 20170101160001 Adyen version: ****** 20170101160002 Starting TX/amt=10001/currency=978 20170101160003 Starting EMV 20170101160004 EMV started 20170101160005 Magswipe opened 20170101160006 CTLS started 20170101160007 Transaction initialised 20170101160008 Run TX as EMV transaction 20170101160009 Application selected app:****** 20170101160010 read_application_data succeeded 20170101160011 data_authentication succeeded 20170101160012 validate 0 20170101160013 DCC rejected 20170101160014 terminal_risk_management succeeded 20170101160015 verify_card_holder succeeded 20170101160016 generate_first_ac succeeded 20170101160017 Authorizing online 20170101160018 Data returned by the host succeeded 20170101160019 Transaction authorized by card 20170101160020 Approved receipt printed 20170101160021 pos_result_code:APPROVED 20170101160022 Final status: Approved 20170101160001 Adyen version: ****** 20170101160002 Starting TX/amt=10001/currency=978 20170101160003 Starting EMV 20170101160004 EMV started 20170101160005 Magswipe opened 20170101160006 CTLS started 20170101160007 Transaction initialised 20170101160008 Run TX as EMV transaction 20170101160009 Application selected app:****** 20170101160010 read_application_data succeeded 20170101160011 data_authentication succeeded 20170101160012 validate 0 20170101160013 DCC rejected 20170101160014 terminal_risk_management succeeded 20170101160015 verify_card_holder succeeded 20170101160016 generate_first_ac succeeded 20170101160017 Authorizing online 20170101160018 Data returned by the host succeeded 20170101160019 Transaction authorized by card 20170101160020 Approved receipt printed 20170101160021 pos_result_code:APPROVED 20170101160022 Final status: Approved 20170101160001 Adyen version: ****** 20170101160002 Starting TX/amt=10001/currency=978 20170101160003 Starting EMV 20170101160004 EMV started 20170101160005 Magswipe opened 20170101160006 CTLS started 20170101160007 Transaction initialised 20170101160008 Run TX as EMV transaction 20170101160009 Application selected app:****** 20170101160010 read_application_data succeeded 20170101160011 data_authentication succeeded 20170101160012 validate 0 20170101160013 DCC rejected 20170101160014 terminal_risk_management succeeded 20170101160015 verify_card_holder succeeded 20170101160016 generate_first_ac succeeded 20170101160017 Authorizing online 20170101160018 Data returned by the host succeeded 20170101160019 Transaction authorized by card 20170101160020 Approved receipt printed 20170101160021 pos_result_code:APPROVED 20170101160022 Final status: Approved 20170101160001 Adyen version: ****** 20170101160002 Starting TX/amt=10001/currency=978 20170101160003 Starting EMV 20170101160004 EMV started 20170101160005 Magswipe opened 20170101160006 CTLS started 20170101160007 Transaction initialised 20170101160008 Run TX as EMV transaction 20170101160009 Application selected app:****** 20170101160010 read_application_data succeeded 20170101160011 data_authentication succeeded 20170101160012 validate 0 20170101160013 DCC rejected 20170101160014 terminal_risk_management succeeded 20170101160015 verify_card_holder succeeded 20170101160016 generate_first_ac succeeded 20170101160017 Authorizing online 20170101160018 Data returned by the host succeeded 20170101160019 Transaction authorized by card 20170101160020 Approved receipt printed 20170101160021 pos_result_code:APPROVED 20170101160022 Final status: Approved 20170101160001 Adyen version: ****** 20170101160002 Starting TX/amt=10001/currency=978 20170101160003 Starting EMV 20170101160004 EMV started 20170101160005 Magswipe opened 20170101160006 CTLS started 20170101160007 Transaction initialised 20170101160008 Run TX as EMV transaction 20170101160009 Application selected app:****** 20170101160010 read_application_data succeeded 20170101160011 data_authentication succeeded 20170101160012 validate 0 20170101160013 DCC rejected 20170101160014 terminal_risk_management succeeded 20170101160015 verify_card_holder succeeded 20170101160016 generate_first_ac succeeded 20170101160017 Authorizing online 20170101160018 Data returned by the host succeeded 20170101160019 Transaction authorized by card 20170101160020 Approved receipt printed 20170101160021 pos_result_code:APPROVED 20170101160022 Final status: Approved https://automatonlearning.net/ DFASAT / FlexFringe Heule & Verwer, ICGI 2010 Rick Wieman, Maurício Aniche, Willem Lobbezoo, Sicco Verwer and Arie van Deursen. An Experience Report on Applying Passive Learning in a Large-Scale Payment Company. ICSME Industry Track, 2017
  30. Use Inferred Models to Analyze: Bugs in Test Phase •

    Terminal asked for PIN • AND asked for signature • Domain expert noted this unwanted behavior in inferred model. • Fixed before it went into production
  31. Use Inferred Models to Analyze: Differences Between Card Brands Twice

    as many chip errors Informed merchant about issue.
  32. Use Inferred Models to Analyze: Time out problems Timeout Improved

    performance under network instability by adding targeted retry mechanism
  33. Scalability is needed! Distributed system Log collector Message Broker Trace

    aggregator Prefix tree generator Prefix tree aggregator Prefix tree database Collects the logs generated by the many systems Aggregates traces in prefix trees Aggregates individual trees in a complete one Groups log events per trace Rick Proost’s MSc thesis. Under development. Next steps: • Implementation (almost CHECK) • Evaluation • Case study • Future work: State merging (tricky!!)
  34. Domain-based Fuzzing with Equivalence Partitioning for Supervised Learning of Anomaly

    Detection in a Cyber Physical System. Herman Nuradhy Wijaya. MSc Thesis. SUTD, 2019. Short paper under development. • Hackers might get control of the sensors • Domain testing to determine attack models • Data: • Normal traces: 4.5h of normal execution • Abnormal traces: 3 hours of 30s attacks + 60s of rest • Feature vector = 90 sec • Supervised Machine Learning • Extra Trees, accuracy of 99.1% • Lots of future work to be done! SUTD’s testbed Cyber Physical Systems (CPS)
  35. Zhang, X., Xu, Y., Lin, Q., Qiao, B., Zhang, H.,

    Dang, Y., ... & Chen, J. (2019, August). Robust log-based anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 807-817). ACM. The state-of-the- art, IMHO. (I like that it tackles the reality that log lines evolve…)
  36. Log Platforms • Focus on the integration of all the

    dimensions, in an end-to-end log analysis tool. • Not so many papers here. • Quite tool focused.
  37. Monitoring-aware IDEs Winter, J., Aniche, M., Cito, J., & van

    Deursen, A. (2019). Monitoring-aware IDEs. In ESEC/FSE 2019 : Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 420-431). Association for Computing Machinery (ACM). How it is How it should be
  38. A Monitoring-Aware IDE prototype • Timely Integrated Feedback • Activity

    of log statements in production • Traceability • Link from monitoring to source • Link from source to monitoring • Search Capability • Searchable monitoring information YRLGPHWKRG ^  ORJHUURU HUURURFFXUUHGYDU  LI  ^  ORJZDUQ ZDUQYDU   ` WKURZQHZ5XQWLPH([FHSWLRQ  ` N N N $MDYDNQHZH[FHSWLRQV %MDYDNQHZH[FHSWLRQV &MDYDNQHZH[FHSWLRQV 7IEVGLEFPI QSRMXSVMRK MRJSVQEXMSRSJXLI IRXMVIWSJX[EVI W]WXIQ 5YMGOP]MRHMGEXIW LS[SJXIREPSK WXEXIQIRXSV I\GITXMSRLETTIRW MRTVSHYGXMSR (IXEMPIHMRJSVQEXMSR EFSYXXLEXWXEXIQIRX 0EWXLSYVW  ?+SXSXLIQSRMXSVMRKA
  39. 0 200 400 600 P1 P2 P3 P4 P5 P6

    P7 P8 P9 P10 P11 P12 Participant Number of files with monitoring information Week 4 Week 3 Week 2 Week 1 Developers use it!
  40. 0 5 10 15 20 25 P1 P2 P3 P4

    P5 P6 P7 P8 P9 P10 P11 P12 Participant Qty of actions impact no impact 2 (4%) 2 (4%) 1 (2%) 2 (4%) 3 (7%) 1 (2%) 2 (4%) 3 (7%) 5 (11%) 9 (20%) 15 (33%) L3: Removed log line L4: Added log line I1: Fixed a bug L1: Improved log message L2: Changed log severity O4: Identified issue in log code I2: Improve code quality O1: Identified a bug O2: Identified performance issue O6: Understood implementation stability O5: Understood business process 0 5 10 15 20 Quantity Actions
  41. Developers’ Perceptions • Less context switching • Saves time •

    Reduce cognitive load
  42. Opportunities for Future Work • Log Engineering • AI for

    a better log engineering
  43. Opportunities for Future Work • Log Engineering • AI for

    a better log engineering • Log Infrastructure • Scalability and real-time analysis as first-class citizens (think of billions, trillions of logs a month!)
  44. Opportunities for Future Work • Log Engineering • AI for

    a better log engineering • Log Infrastructure • Scalability and real-time analysis as first-class citizens (think of billions, trillions of logs a month!) • Log Analysis • Need for systematic comparisons among the different approaches • Embrace unsupervised learning and unlabeled data
  45. Opportunities for Future Work • Log Engineering • AI for

    a better log engineering • Log Infrastructure • Scalability and real-time analysis as first-class citizens (think of billions, trillions of logs a month!) • Log Analysis • Need for systematic comparisons among the different approaches • Embrace unsupervised learning and unlabeled data • Log platforms • Closing the gap between research and industry by means of better standards
  46. Thanks to all my colleagues! Jeanderson Barros Joop Aué Jos

    Winter Rick Wieman Rick Proost Daan Schipper Arie van Deursen Jürgen Cito Sicco Verwer Herman Wijaya Asterios Katsifodimos
  47. Summary • Monitoring is a first-class citizen in contemporary software

    development. • Log research can be classified into log engineering, log analysis, log infrastructure, and log platforms. And we are moving fast! • Open challenges: unstable, unstructured, and unlabeled data, need for scalability, better standards. • There is lots of engineering work to be done, before good research can happen. • Maurício Aniche, M.FinavaroAniche@tudelft.nl, @mauricioaniche
  48. License • You can use and share this material, but

    you should not use it for any commercial purposes. • You must give credits to the original author of this presentation. • You agree not to sell it or make profit in any way with this. • Papers, tools, and everything else I refer to have their own license.
  49. Images in this presentation • Cover photos by Chris Liverani,

    Stephen Dawson, and Luke Chesser (Unsplash) • DevOps picture, by Temitope Oteyowo: https://medium.com/tech- tajawal/devops-in-a-scaling-environment-9d5416ecb928 • Dev productivity, by undraw.co. • Diagrams and charts made by the authors. • You should not reuse authors’ photos for your own purpose.