Testing in production: A Gentle Walk on Software Monitoring Research

Maurício F. Aniche [email protected] @mauricioaniche “Testing in Production”: A gentle
walk on software monitoring research

https://xkcd.com/1428/ Contemporary (tech) companies: • Move fast • Are highly
complex

Monitoring as a first-class citizen!

• 3,500 customers, including eBay, Uber, Spotify, and Facebook. •
In 2015, Adyen achieved a valuation of $2.3 billion, making it the sixth largest European unicorn, and the largest Dutch unicorn. • In April 2017, the company was granted a European banking license. • 1000+ employees from 80+ countries. • Adyen has processed more than 159 billion euros in payments.

1 1 1 2 1 2 1 1 1 3
1 1 5 1 6 3 1 5 2 4 2 8 1 4 2 2 6 1 3 6 5 2 5 5 1 '17 '16 '09 '18 '12 '14 '03 '04 '13 '15 '11 '10 '07 '05 '08 '98 '00 '06 '92 Year Publication Type industry track paper journal research track paper

/RJ 3URFHVVRU /RJ,QIUDVWUXFWXUH 5XQQLQJ$SSOLFDWLRQ %XLOGDQG'HSOR\PHQW ½»¼Ä»À½¾¼À»¿ @O>CDIB?<« ½»¼Ä»À½¾¼À»¿I<=G@OJA«
½»¼Ä»À½¾¼À»¿ @NOJMDIBN« /RJGDWD ORJ JHQHUDWLRQ ([HFXWLRQ(QYLURQPHQW 6RXUFHFRGH ORJ WUDQVPLVVLRQ 6WRUDJH 23V(QJLQHHU /RJ(QJLQHHULQJ /RJ$QDO\VLV 'DVKERDUGV 'HYHORSHU @NKJIN@?<O<Ó« DA?<O<ÓÓIPGG £ JB@MMJM¯I<=G@OJA« /RJ3ODWIRUPV 4XHU\ £¯HNB¯¯I<=G@OJA@O« ¯ODH@NO<HK¯¯½»¼Ä»À« ¯N@Q@MDOT¯¯¯¤ /RJ*HQHUDWLRQ /RJ&RQVXPSWLRQ Jeanderson Candido, Maurício Aniche, Arie van Deursen. Contemporary Software Monitoring: A Systematic Literature Review. 2019. In submission (Transactions on Software Engineering).

Contemporary Logging Framework Log Engineering Log Infrastructure Log Analysis Log
Platforms Anti-patterns in logging code Implementation of log statements Empirical studies Parsing Storage Anomaly detection Security and privacy Root cause analysis Failure prediction Software testing Model inference and invariant mining Reliability and dependability End-to-end analysis tools

Log Engineering • Deals with the logging decisions from the
development side. • What to log? • Where to log? • How severe should it be? • 3 – 6% of code are related to logs. • Those decisions matter a lot in practice! • Research focus has been: • Anti-patterns in logging code • Implementation of log statements • Empirical Studies in Log Engineering /RJ 3URFHVVRU /RJ,QIUDVWUXFWXUH 5XQQLQJ$SSOLFDWLRQ %XLOGDQG'HSOR\PHQW ½»¼Ä»À½¾¼À»¿ @O>CDIB?<« ½»¼Ä»À½¾¼À»¿I<=G@OJA« ½»¼Ä»À½¾¼À»¿ @NOJMDIBN« /RJGDWD ORJ JHQHUDWLRQ ([HFXWLRQ(QYLURQPHQW 6RXUFHFRGH ORJ WUDQVPLVVLRQ 6WRUDJH 23V(QJLQHHU /RJ(QJLQHHULQJ /RJ$QDO\VLV 'DVKERDUGV 'HYHORSHU @NKJIN@?<O<Ó« DA?<O<ÓÓIPGG £ JB@MMJM¯I<=G@OJA« /RJ3ODWIRUPV 4XHU\ £¯HNB¯¯I<=G@OJA@O« ¯ODH@NO<HK¯¯½»¼Ä»À« ¯N@Q@MDOT¯¯¯¤ /RJ*HQHUDWLRQ /RJ&RQVXPSWLRQ

Common log engineering problems • Improper log message • Improper
log levels • Runtime issues (e.g., logging a NULL variable) • Configuration errors • Missing log statements • Overwhelming volume of data Shang, W., Nagappan, M., Hassan, A. E., & Jiang, Z. M. (2014, September). Understanding log lines using development knowledge. In 2014 IEEE International Conference on Software Maintenance and Evolution (pp. 21-30). IEEE. Hassani, M., Shang, W., Shihab, E., & Tsantalis, N. (2018). Studying and detecting log-related issues. Empirical Software Engineering, 23(6), 3248-3280.

Can ML help? Can we learn ‘where’ to log? Zhu,
J., He, P., Fu, Q., Zhang, H., Lyu, M. R., & Zhang, D. (2015, May). Learning to log: Helping developers make informed logging decisions. In Proceedings of the 37th International Conference on Software Engineering-Volume 1 (pp. 415-425). IEEE Press.

Can DL help? • Can we learn not only from
the structure, but also from the semantics of code? • Like DeepBugs or Code2Vec. • First exploration shows that a “simple language models” won’t work. (Best we got was balanced accuracy=.71, recall=.48, but this was a first exploration by our MSc students; more to come)

Log Infrastructure • The tool support necessary to make the
further analysis feasible. • Logs are highly unstructured and in massive quantities. • Research has been focusing on: • Log Parsing • Log Storage and Compression • We need more work here! /RJ 3URFHVVRU /RJ,QIUDVWUXFWXUH 5XQQLQJ$SSOLFDWLRQ %XLOGDQG'HSOR\PHQW ½»¼Ä»À½¾¼À»¿ @O>CDIB?<« ½»¼Ä»À½¾¼À»¿I<=G@OJA« ½»¼Ä»À½¾¼À»¿ @NOJMDIBN« /RJGDWD ORJ JHQHUDWLRQ ([HFXWLRQ(QYLURQPHQW 6RXUFHFRGH ORJ WUDQVPLVVLRQ 6WRUDJH 23V(QJLQHHU /RJ(QJLQHHULQJ /RJ$QDO\VLV 'DVKERDUGV 'HYHORSHU @NKJIN@?<O<Ó« DA?<O<ÓÓIPGG £ JB@MMJM¯I<=G@OJA« /RJ3ODWIRUPV 4XHU\ £¯HNB¯¯I<=G@OJA@O« ¯ODH@NO<HK¯¯½»¼Ä»À« ¯N@Q@MDOT¯¯¯¤ /RJ*HQHUDWLRQ /RJ&RQVXPSWLRQ

log.info(“Customer “ + customer + “ paying ” + paymentValue);
[2019-02-03 15:43:24] [MagicPayment.java][L456] Customer Maurício paying 235.67 Class name Line number Schipper, D., Aniche, M., & van Deursen, A. (2019). Tracing Back Log Data to its Log Statement: From Research to Practice. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR): Proceedings (pp. 545-549). [8816773] IEEE. Short paper.

In practice… • Large companies can’t really log the class
and the line number that originates a log message. • Developers ‘grep’ the source code • Can we automatically detect where origin of the log lines? • W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, pages 117–132. ACM, 2009.

log.info(“Customer “ + customer + “ ID “ + id);
à Customer .* ID \d*

RQ1: Accuracy • We collect 100k messages from a week
day. • These 100k messages pointed to 676 different locations in the source code. • 95% CL, 5% CI sample = 245 links. • Manual investigation = 97.6% accuracy (239 out of 245 correct links) • 99.8% accuracy in the original paper (two projects: HDFS and Darkstar, millions of log lines, # of log statements not specified).

RQ2: Performance • Related work: not really reported Building ASTs
of the entire codebase is the most expensive part of the process.

When does it fail? • JSON-based logs • The developer
logs a JSON (that is created in runtime) and thus the log statement is simply “log.info(json)”, making our template inaccurate. • Unknown logging method • Developers create their own logging methods, which our tool can't recognize. • Log strings created on-the-fly • Some log messages are too complex and developers create them by means of multiple line of code (e.g. String log = “content1”; log = log “content 2”, ...), which makes the analysis harder. • Related work: “Almost all of these (failed) messages contain long string variables.” => same for us.

Logs of Complex API Integrations • Payment APIs are complex
• Integration faults are easily made • Merchant needs assistance with API usage • Merchant may not notice mistakes • 2.5M http error responses per month (back in 2016!) • What can we learn from them? 22 Aué, J., Aniche, M., Lobbezoo, M., & van Deursen, A. (2018). An Exploratory Study on Faults in Web API Integration in a Large-Scale Payment Company. In ICSE-SEIP '18: 40th International Conference on Software Engineering: Software Engineering in Practice Track (pp. 13-22). Association for Computing Machinery (ACM).

11 Common Causes for API Error Reponses

11 Common Causes for API Error Reponses Integrators are the
main responsible for API integration problems!

11 Common Causes for API Error Reponses Maybe these are
the exceptions one cares more about?

Log Analysis • Knowledge acquisition from log data for a
specific purpose. • The most explored area in the topic: • Anomaly detection • Security and privacy • Root cause analysis • Failure prediction • Software testing • Model inference and invariant mining • Reliability and dependability /RJ 3URFHVVRU /RJ,QIUDVWUXFWXUH 5XQQLQJ$SSOLFDWLRQ %XLOGDQG'HSOR\PHQW ½»¼Ä»À½¾¼À»¿ @O>CDIB?<« ½»¼Ä»À½¾¼À»¿I<=G@OJA« ½»¼Ä»À½¾¼À»¿ @NOJMDIBN« /RJGDWD ORJ JHQHUDWLRQ ([HFXWLRQ(QYLURQPHQW 6RXUFHFRGH ORJ WUDQVPLVVLRQ 6WRUDJH 23V(QJLQHHU /RJ(QJLQHHULQJ /RJ$QDO\VLV 'DVKERDUGV 'HYHORSHU @NKJIN@?<O<Ó« DA?<O<ÓÓIPGG £ JB@MMJM¯I<=G@OJA« /RJ3ODWIRUPV 4XHU\ £¯HNB¯¯I<=G@OJA@O« ¯ODH@NO<HK¯¯½»¼Ä»À« ¯N@Q@MDOT¯¯¯¤ /RJ*HQHUDWLRQ /RJ&RQVXPSWLRQ

Open/data challenges • An avalanche of different techniques • Supervised
and unsupervised ML, passive learning and state machines, graphs, ... • More recently, deep learning and NLP techniques. • There is no single technique that works everywhere… Gunter, D., Tierney, B. L., Brown, A., Swany, M., Bresnahan, J., & Schopf, J. M. (2007, September). Log summarization and anomaly detection for troubleshooting distributed systems. In 2007 8th IEEE/ACM International Conference on Grid Computing (pp. 226-234). IEEE. • Lack of more representative datasets • Lots of HPC logs • No industry/enterprise logs • Starting point: https://github.com/logpai/loghub • Building labeled datasets

Beschastnikh, Ivan, et al. "Inferring models of concurrent systems from
logs of their behavior with CSight." Proceedings of the 36th International Conference on Software Engineering. ACM, 2014. Beschastnikh, I., Brun, Y., Ernst, M. D., Krishnamurthy, A., & Anderson, T. E. (2012). Mining temporal invariants from partially ordered logs. ACM SIGOPS Operating Systems Review, 45(3), 39-46. Model inference

Passive learning Identifying system behavior from observations, and representing it
in the smallest possible model. 20170101160001 Adyen version: ****** 20170101160002 Starting TX/amt=10001/currency=978 20170101160003 Starting EMV 20170101160004 EMV started 20170101160005 Magswipe opened 20170101160006 CTLS started 20170101160007 Transaction initialised 20170101160008 Run TX as EMV transaction 20170101160009 Application selected app:****** 20170101160010 read_application_data succeeded 20170101160011 data_authentication succeeded 20170101160012 validate 0 20170101160013 DCC rejected 20170101160014 terminal_risk_management succeeded 20170101160015 verify_card_holder succeeded 20170101160016 generate_first_ac succeeded 20170101160017 Authorizing online 20170101160018 Data returned by the host succeeded 20170101160019 Transaction authorized by card 20170101160020 Approved receipt printed 20170101160021 pos_result_code:APPROVED 20170101160022 Final status: Approved 20170101160001 Adyen version: ****** 20170101160002 Starting TX/amt=10001/currency=978 20170101160003 Starting EMV 20170101160004 EMV started 20170101160005 Magswipe opened 20170101160006 CTLS started 20170101160007 Transaction initialised 20170101160008 Run TX as EMV transaction 20170101160009 Application selected app:****** 20170101160010 read_application_data succeeded 20170101160011 data_authentication succeeded 20170101160012 validate 0 20170101160013 DCC rejected 20170101160014 terminal_risk_management succeeded 20170101160015 verify_card_holder succeeded 20170101160016 generate_first_ac succeeded 20170101160017 Authorizing online 20170101160018 Data returned by the host succeeded 20170101160019 Transaction authorized by card 20170101160020 Approved receipt printed 20170101160021 pos_result_code:APPROVED 20170101160022 Final status: Approved 20170101160001 Adyen version: ****** 20170101160002 Starting TX/amt=10001/currency=978 20170101160003 Starting EMV 20170101160004 EMV started 20170101160005 Magswipe opened 20170101160006 CTLS started 20170101160007 Transaction initialised 20170101160008 Run TX as EMV transaction 20170101160009 Application selected app:****** 20170101160010 read_application_data succeeded 20170101160011 data_authentication succeeded 20170101160012 validate 0 20170101160013 DCC rejected 20170101160014 terminal_risk_management succeeded 20170101160015 verify_card_holder succeeded 20170101160016 generate_first_ac succeeded 20170101160017 Authorizing online 20170101160018 Data returned by the host succeeded 20170101160019 Transaction authorized by card 20170101160020 Approved receipt printed 20170101160021 pos_result_code:APPROVED 20170101160022 Final status: Approved 20170101160001 Adyen version: ****** 20170101160002 Starting TX/amt=10001/currency=978 20170101160003 Starting EMV 20170101160004 EMV started 20170101160005 Magswipe opened 20170101160006 CTLS started 20170101160007 Transaction initialised 20170101160008 Run TX as EMV transaction 20170101160009 Application selected app:****** 20170101160010 read_application_data succeeded 20170101160011 data_authentication succeeded 20170101160012 validate 0 20170101160013 DCC rejected 20170101160014 terminal_risk_management succeeded 20170101160015 verify_card_holder succeeded 20170101160016 generate_first_ac succeeded 20170101160017 Authorizing online 20170101160018 Data returned by the host succeeded 20170101160019 Transaction authorized by card 20170101160020 Approved receipt printed 20170101160021 pos_result_code:APPROVED 20170101160022 Final status: Approved 20170101160001 Adyen version: ****** 20170101160002 Starting TX/amt=10001/currency=978 20170101160003 Starting EMV 20170101160004 EMV started 20170101160005 Magswipe opened 20170101160006 CTLS started 20170101160007 Transaction initialised 20170101160008 Run TX as EMV transaction 20170101160009 Application selected app:****** 20170101160010 read_application_data succeeded 20170101160011 data_authentication succeeded 20170101160012 validate 0 20170101160013 DCC rejected 20170101160014 terminal_risk_management succeeded 20170101160015 verify_card_holder succeeded 20170101160016 generate_first_ac succeeded 20170101160017 Authorizing online 20170101160018 Data returned by the host succeeded 20170101160019 Transaction authorized by card 20170101160020 Approved receipt printed 20170101160021 pos_result_code:APPROVED 20170101160022 Final status: Approved https://automatonlearning.net/ DFASAT / FlexFringe Heule & Verwer, ICGI 2010 Rick Wieman, Maurício Aniche, Willem Lobbezoo, Sicco Verwer and Arie van Deursen. An Experience Report on Applying Passive Learning in a Large-Scale Payment Company. ICSME Industry Track, 2017

Use Inferred Models to Analyze: Bugs in Test Phase •
Terminal asked for PIN • AND asked for signature • Domain expert noted this unwanted behavior in inferred model. • Fixed before it went into production

Use Inferred Models to Analyze: Differences Between Card Brands Twice
as many chip errors Informed merchant about issue.

Use Inferred Models to Analyze: Time out problems Timeout Improved
performance under network instability by adding targeted retry mechanism

Scalability is needed! Distributed system Log collector Message Broker Trace
aggregator Prefix tree generator Prefix tree aggregator Prefix tree database Collects the logs generated by the many systems Aggregates traces in prefix trees Aggregates individual trees in a complete one Groups log events per trace Rick Proost’s MSc thesis. Under development. Next steps: • Implementation (almost CHECK) • Evaluation • Case study • Future work: State merging (tricky!!)

Domain-based Fuzzing with Equivalence Partitioning for Supervised Learning of Anomaly
Detection in a Cyber Physical System. Herman Nuradhy Wijaya. MSc Thesis. SUTD, 2019. Short paper under development. • Hackers might get control of the sensors • Domain testing to determine attack models • Data: • Normal traces: 4.5h of normal execution • Abnormal traces: 3 hours of 30s attacks + 60s of rest • Feature vector = 90 sec • Supervised Machine Learning • Extra Trees, accuracy of 99.1% • Lots of future work to be done! SUTD’s testbed Cyber Physical Systems (CPS)

Zhang, X., Xu, Y., Lin, Q., Qiao, B., Zhang, H.,
Dang, Y., ... & Chen, J. (2019, August). Robust log-based anomaly detection on unstable log data. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 807-817). ACM. The state-of-the- art, IMHO. (I like that it tackles the reality that log lines evolve…)

Log Platforms • Focus on the integration of all the
dimensions, in an end-to-end log analysis tool. • Not so many papers here. • Quite tool focused.

Monitoring-aware IDEs Winter, J., Aniche, M., Cito, J., & van
Deursen, A. (2019). Monitoring-aware IDEs. In ESEC/FSE 2019 : Proceedings of the 27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 420-431). Association for Computing Machinery (ACM). How it is How it should be

A Monitoring-Aware IDE prototype • Timely Integrated Feedback • Activity
of log statements in production • Traceability • Link from monitoring to source • Link from source to monitoring • Search Capability • Searchable monitoring information YRLGPHWKRG ^ ORJHUURU HUURURFFXUUHGYDU LI ^ ORJZDUQ ZDUQYDU ` WKURZQHZ5XQWLPH([FHSWLRQ ` N N N $MDYDNQHZH[FHSWLRQV %MDYDNQHZH[FHSWLRQV &MDYDNQHZH[FHSWLRQV 7IEVGLEFPI QSRMXSVMRK MRJSVQEXMSRSJXLI IRXMVIWSJX[EVI W]WXIQ 5YMGOP]MRHMGEXIW LS[SJXIREPSK WXEXIQIRXSV I\GITXMSRLETTIRW MRTVSHYGXMSR (IXEMPIHMRJSVQEXMSR EFSYXXLEXWXEXIQIRX 0EWXLSYVW ?+SXSXLIQSRMXSVMRKA

0 200 400 600 P1 P2 P3 P4 P5 P6
P7 P8 P9 P10 P11 P12 Participant Number of files with monitoring information Week 4 Week 3 Week 2 Week 1 Developers use it!

0 5 10 15 20 25 P1 P2 P3 P4
P5 P6 P7 P8 P9 P10 P11 P12 Participant Qty of actions impact no impact 2 (4%) 2 (4%) 1 (2%) 2 (4%) 3 (7%) 1 (2%) 2 (4%) 3 (7%) 5 (11%) 9 (20%) 15 (33%) L3: Removed log line L4: Added log line I1: Fixed a bug L1: Improved log message L2: Changed log severity O4: Identified issue in log code I2: Improve code quality O1: Identified a bug O2: Identified performance issue O6: Understood implementation stability O5: Understood business process 0 5 10 15 20 Quantity Actions

Developers’ Perceptions • Less context switching • Saves time •
Reduce cognitive load

Opportunities for Future Work • Log Engineering • AI for
a better log engineering

a better log engineering • Log Infrastructure • Scalability and real-time analysis as first-class citizens (think of billions, trillions of logs a month!)

a better log engineering • Log Infrastructure • Scalability and real-time analysis as first-class citizens (think of billions, trillions of logs a month!) • Log Analysis • Need for systematic comparisons among the different approaches • Embrace unsupervised learning and unlabeled data

a better log engineering • Log Infrastructure • Scalability and real-time analysis as first-class citizens (think of billions, trillions of logs a month!) • Log Analysis • Need for systematic comparisons among the different approaches • Embrace unsupervised learning and unlabeled data • Log platforms • Closing the gap between research and industry by means of better standards

Thanks to all my colleagues! Jeanderson Barros Joop Aué Jos
Winter Rick Wieman Rick Proost Daan Schipper Arie van Deursen Jürgen Cito Sicco Verwer Herman Wijaya Asterios Katsifodimos

Summary • Monitoring is a first-class citizen in contemporary software
development. • Log research can be classified into log engineering, log analysis, log infrastructure, and log platforms. And we are moving fast! • Open challenges: unstable, unstructured, and unlabeled data, need for scalability, better standards. • There is lots of engineering work to be done, before good research can happen. • Maurício Aniche, [email protected], @mauricioaniche

License • You can use and share this material, but
you should not use it for any commercial purposes. • You must give credits to the original author of this presentation. • You agree not to sell it or make profit in any way with this. • Papers, tools, and everything else I refer to have their own license.

Images in this presentation • Cover photos by Chris Liverani,
Stephen Dawson, and Luke Chesser (Unsplash) • DevOps picture, by Temitope Oteyowo: https://medium.com/tech- tajawal/devops-in-a-scaling-environment-9d5416ecb928 • Dev productivity, by undraw.co. • Diagrams and charts made by the authors. • You should not reuse authors’ photos for your own purpose.

Testing in production: A Gentle Walk on Softwar...

Testing in production: A Gentle Walk on Software Monitoring Research

More Decks by Mauricio Aniche

Other Decks in Technology

Featured

Transcript