Slide 1

Slide 1 text

Osquery Performance @ Scale Strategies for managing osquery performance in production environments.

Slide 2

Slide 2 text

Zach has been involved with osquery since the inception of the project at Facebook in 2014. He serves as a member of the Osquery Technical Steering Committee, and works with a variety of clients to integrate osquery into their operations as the Principal Engineer at Dactiv LLC. Zach believes that we can make security accessible to everyone through open-source tools. Zach Wasserman Principal Engineer Dactiv LLC Page 2 @osqueryatscale @thezachw

Slide 3

Slide 3 text

An Update From the Technical Steering Committee This content represents my opinion and is not an official position of the TSC or the Linux Foundation. Page 3 @osqueryatscale @thezachw

Slide 4

Slide 4 text

Osquery in Transition ● Early 2014 - Osquery project is created by Mike Arpaia at Facebook. ● October 2014 - Facebook open-sources osquery. ● 2014-2019 - Facebook maintains osquery as an open-source project. ● June 2019 - Osquery project is handed to the Linux Foundation for a community support model. ● October 2019 - Osquery 4.0.2 becomes the first stable release of osquery as a project of the Linux Foundation. ● Current - Community is working to establish a regular release cycle, define a roadmap for the future, raise funds for the project, and help osquery grow. Page 4 @osqueryatscale @thezachw

Slide 5

Slide 5 text

Get Involved ● TSC members and the public meet biweekly to discuss status of the project and develop plans. ○ Next meeting: February 4, 10AM PST ○ Join #officehours in osquery Slack for announcements. ● Donate to the osquery project. ○ Funds are not yet earmarked, but raising funds for the project will allow us to improve testing infrastructure, and hire devs to work on osquery features and maintenance. ● Volunteer to test releases. ○ Organizations that can deploy beta versions of the agent can help ensure the quality of stable releases. Page 5 @osqueryatscale @thezachw

Slide 6

Slide 6 text

Osquery Performance @ Scale The main event. Page 6 @osqueryatscale @thezachw

Slide 7

Slide 7 text

Motivations ● As security practitioners, we need visibility into the state of the systems we manage. ● Resource utilization has real impact to the bottom line of our business. ○ Production: Resource utilization is multiplied over each production server. More performance impact = higher cost. ○ Workstations: We need to ensure that security workloads do not interfere with employees doing their jobs. ● Osquery is built for performance, but it is easy to schedule queries that will have significant performance impacts on the system. Page 7 @osqueryatscale @thezachw

Slide 8

Slide 8 text

Goals ● Limit the performance impact of osquery. ○ Osquery Watchdog ● Develop monitoring for resource consumption of queries. ○ The osquery_schedule table ● Deploy new queries in a controlled manner. ○ Host grouping and query sharding ● Investigate performance. ○ Profiling and SQLite explain query plan Page 8 @osqueryatscale @thezachw

Slide 9

Slide 9 text

Osquery Watchdog Limit the performance impact of osquery. Page 9 @osqueryatscale @thezachw

Slide 10

Slide 10 text

Osquery Watchdog The view from osqueryi osquery> SELECT path, pid, pgroup, parent, cmdline ...> FROM processes WHERE name = 'osqueryd'; path = /usr/local/bin/osqueryd pid = 45472 pgroup = 45472 parent = 8938 cmdline = /usr/local/bin/osqueryd osqueryd --flagfile=/osquery.flags path = /usr/local/bin/osqueryd pid = 45473 pgroup = 45472 parent = 45472 cmdline = /usr/local/bin/osqueryd osqueryd Page 10 @osqueryatscale @thezachw

Slide 11

Slide 11 text

Osquery Watchdog Background ● When we start osqueryd, we get two processes: ○ Parent process - The “watchdog” ○ Child process - The “worker” ● Potentially resource-intensive operations are performed in the worker process. ○ Run queries, output logs, etc. ● The watchdog process checks the utilization stats for the worker on an interval. ○ Resource utilization limits exceeded -> Watchdog kills/respawns worker Page 11 @osqueryatscale @thezachw

Slide 12

Slide 12 text

Osquery Watchdog Managing query execution What happens when a query runs on the worker? 1. Worker writes to RocksDB the name of the query being run. 2. Query executes. 3. Worker removes notation of running query. Page 12 @osqueryatscale @thezachw

Slide 13

Slide 13 text

Osquery Watchdog Managing query execution Now suppose the watchdog kills the worker during query execution. 1. Worker writes to RocksDB the name of the query being run. 2. Query begins executing. 3. Watchdog kills worker. 4. Worker respawns, reads RocksDB, and sees that the previous worker was in the middle of execution. 5. Worker logs the failure during query execution and “blacklists” the query. Page 13 @osqueryatscale @thezachw

Slide 14

Slide 14 text

Osquery Watchdog Managing query execution I0121 08:44:48.398947 270000128 scheduler.cpp:96] Executing scheduled query expensive_query: select 1 from users, users, users, users, users, users W0121 08:45:13.591068 127172608 watcher.cpp:331] osqueryd worker (71861) stopping: Maximum sustainable CPU utilization limit exceeded: 21 I0121 08:45:13.996376 127172608 watcher.cpp:583] osqueryd watcher (71860) executing worker (71928) I0121 08:45:14.841640 163079616 init.cpp:415] osquery worker initialized [watcher=71860] I0121 08:45:14.842711 163079616 rocksdb.cpp:131] Opening RocksDB handle: /tmp/osquery.db ... W0121 08:45:23.252063 163079616 config.cpp:317] Scheduled query may have failed: expensive_query Page 14 @osqueryatscale @thezachw

Slide 15

Slide 15 text

Osquery Watchdog Query Blacklisting ● Queries are “blacklisted” when execution fails. ● Blacklisted queries are removed from the schedule for 24 hours. ● This prevents crash-looping, and unnecessary use of resources when queries will be killed. ● Observe query blacklist status using the blacklisted column of the osquery_schedule table. ○ More on this later. Page 15 @osqueryatscale @thezachw

Slide 16

Slide 16 text

Osquery Watchdog core/watcher.cpp:467 return SQL::selectFrom( {"parent", "user_time", "system_time", "resident_size"}, "processes", "pid", EQUALS, INTEGER(p)); Page 16 @osqueryatscale @thezachw

Slide 17

Slide 17 text

Osquery Watchdog Configuring the Watchdog ● The watchdog is enabled in osquery by default. ○ Default settings: “normal” ■ CPU - Over 10% for up to 12 seconds ■ Memory - Up to 200MB ○ --watchdog_level=1: “restrictive” ■ CPU - Over 5% for up to 6 seconds ■ Memory - Up to 100MB ○ --watchdog_level=-1: “off” ■ Performance limits are disabled Page 17 @osqueryatscale @thezachw

Slide 18

Slide 18 text

Osquery Watchdog Configuring the Watchdog ● Settings can be customized to specific needs. ● --watchdog_utilization_limit ○ Threshold percentage of CPU ○ Time allowed over the threshold is defined by the intervals from --watchdog_level ■ --watchdog_level=0 - 10 seconds above limit ■ --watchdog_level=1 - 5 seconds above limit ● --watchdog_memory_limit ○ Maximum memory in MB ● Tradeoff: Visibility <-> Performance safety Page 18 @osqueryatscale @thezachw

Slide 19

Slide 19 text

Osquery Watchdog Extensions ● It is also possible to use the watchdog to limit utilization by osquery extensions. ○ Extensions typically run as child processes spawned by osqueryd (with the --extensions_autoload flag). ○ Use --enable_extensions_watchdog to turn on this feature. Page 19 @osqueryatscale @thezachw

Slide 20

Slide 20 text

Monitoring Use osquery to monitor osquery. Page 20 @osqueryatscale @thezachw

Slide 21

Slide 21 text

Monitoring osquery_schedule ● Osquery itself provides excellent facilities for monitoring osquery performance. ● The osquery_schedule table exposes performance information for all scheduled queries. ● Performance information is collected by looking at the difference in CPU time and memory of the worker process during execution. Page 21 @osqueryatscale @thezachw

Slide 22

Slide 22 text

osquery_schedule Schema What information is available in the osquery_schedule table? Page 22 @osqueryatscale @thezachw

Slide 23

Slide 23 text

Monitoring Blacklisted Queries ● SELECT * FROM osquery_schedule WHERE blacklisted = 1; ● Returns information about all of the queries that are currently blacklisted. ● Depending on your requirements, this may be worth alerting on! Page 23 @osqueryatscale @thezachw

Slide 24

Slide 24 text

Monitoring Dashboards ● Consider using the data from osquery_schedule to create osquery performance dashboards. ● Useful charts: ○ Blacklisted queries ○ Memory usage of queries ○ System + user time usage of queries ● Visualizing middle and top percentiles can help find outliers. Page 24 @osqueryatscale @thezachw

Slide 25

Slide 25 text

Monitoring Page 25 @osqueryatscale @thezachw

Slide 26

Slide 26 text

Monitoring Page 26 @osqueryatscale @thezachw

Slide 27

Slide 27 text

Monitoring Page 27 @osqueryatscale @thezachw

Slide 28

Slide 28 text

Query Deployment Deploy new queries in a controlled manner. Page 28 @osqueryatscale @thezachw

Slide 29

Slide 29 text

Query Deployment Deployment Strategies Two major strategies for controlling deployment of new queries: 1. Segment hosts and deploy queries to groups of hosts. 2. Use the shard option of scheduled queries to slow roll queries within a host group. Use these strategies together for the best control of rollouts. Page 29 @osqueryatscale @thezachw

Slide 30

Slide 30 text

Query Deployment Group Hosts ● Segment hosts by risk tolerance for performance issues. ○ Start with lower risk hosts. ● Different techniques can be used depending on the deployment/configuration strategy of osquery. ● With tools like Chef/Puppet/Ansible: ○ Use the tooling to deploy different pack files to each group of hosts. ● With plain osquery: ○ Use the discovery query feature of query packs to gate pack execution based on the results of dynamic queries. ● With Fleet: ○ Use labels to segment hosts and target packs to labels. Page 30 @osqueryatscale @thezachw

Slide 31

Slide 31 text

Query Deployment Shard Queries ● Set the shard option in a scheduled query to enable the query on a subset of hosts that receive the pack. ○ Shard is a percentage of hosts on which to enable the query. ■ 0 - No hosts ■ 100 - All hosts ● Increase the shard value as confidence in the query performance increases. Page 31 @osqueryatscale @thezachw

Slide 32

Slide 32 text

Query Deployment Monitor Rollout ● Use each of the rollout techniques to begin sending the new query to hosts. ● Ensure that you have visibility (alerting, dashboards, etc.) into the performance of osquery on those systems. ● Deploy to more hosts as confidence increases. ● Good monitoring dashboards really pay off at this stage. Page 32 @osqueryatscale @thezachw

Slide 33

Slide 33 text

Query Deployment Page 33 @osqueryatscale @thezachw

Slide 34

Slide 34 text

Investigate Performance Use tooling to understand performance problems. Page 34 @osqueryatscale @thezachw

Slide 35

Slide 35 text

Investigate Performance Tools ● Osquery provides tools that we can use to investigate the performance of queries. ● Profiling ○ Use the profile script to preview the performance of query packs on the local machine. ● Query planning ○ Use SQLite explain query plan to begin debugging performance problems. Page 35 @osqueryatscale @thezachw

Slide 36

Slide 36 text

Investigate Performance Profiling $ ./tools/analysis/profile.py --config pack.json --shell /usr/bin/osqueryi Profiling query: select * from processes U:1 C:0 M:2 F:0 D:0 processes (1/1): utilization: 9.8 cpu_time: 0.099889228 memory: 18640896 fds: 4 duration: 0.5181262493133545 Profiling query: select * from users join user_groups using (uid) join groups using (gid) U:2 C:1 M:2 F:0 D:2 user_groups (1/1): utilization: 28.299999999999997 cpu_time: 0.570734208 memory: 19369984 fds: 4 duration: 1.5286788940429688 Profiling query: select * from time U:0 C:0 M:2 F:0 D:0 time (1/1): utilization: 5.35 cpu_time: 0.056201881999999995 memory: 16080896 fds: 4 duration: 0.5209510326385498 Page 36 @osqueryatscale @thezachw

Slide 37

Slide 37 text

Investigate Performance Query Plan osquery> EXPLAIN QUERY PLAN ...> SELECT * FROM processes JOIN process_open_sockets USING (pid) JOIN hash USING (path) ...> WHERE pid IN (SELECT pid FROM processes WHERE uid = 0); +----+--------+---------+---------------------------------------------------------+ | id | parent | notused | detail | +----+--------+---------+---------------------------------------------------------+ | 4 | 0 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 62: | | 8 | 0 | 0 | LIST SUBQUERY 1 | | 10 | 8 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 68: | | 30 | 0 | 0 | SCAN TABLE process_open_sockets VIRTUAL TABLE INDEX 64: | | 38 | 0 | 0 | SCAN TABLE hash VIRTUAL TABLE INDEX 66: | +----+--------+---------+---------------------------------------------------------+ Page 37 @osqueryatscale @thezachw

Slide 38

Slide 38 text

Investigate Performance Query Plan osquery> EXPLAIN QUERY PLAN ...> SELECT * FROM processes JOIN process_open_sockets USING (pid) JOIN hash USING (path) ...> WHERE uid = 0; +----+--------+---------+---------------------------------------------------------+ | id | parent | notused | detail | +----+--------+---------+---------------------------------------------------------+ | 4 | 0 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 70: | | 10 | 0 | 0 | SCAN TABLE process_open_sockets VIRTUAL TABLE INDEX 71: | | 18 | 0 | 0 | SCAN TABLE hash VIRTUAL TABLE INDEX 73: | +----+--------+---------+---------------------------------------------------------+ Page 38 @osqueryatscale @thezachw

Slide 39

Slide 39 text

Wrapping It Up ● Use the watchdog to constrain performance. ● Monitor performance and blacklists with the osquery_schedule table. ● Be strategic in rollout of new queries. ● Learn to use the tooling to evaluate performance. Page 39 @osqueryatscale @thezachw

Slide 40

Slide 40 text

Thank You! Email - zach@dactiv.llc Twitter - @thezachw Osquery Slack - @zwass Page 40 @osqueryatscale @thezachw