Osquery Performance @ Scale

Osquery Performance @ Scale Strategies for managing osquery performance in
production environments.

Zach has been involved with osquery since the inception of
the project at Facebook in 2014. He serves as a member of the Osquery Technical Steering Committee, and works with a variety of clients to integrate osquery into their operations as the Principal Engineer at Dactiv LLC. Zach believes that we can make security accessible to everyone through open-source tools. Zach Wasserman Principal Engineer Dactiv LLC Page 2 @osqueryatscale @thezachw

An Update From the Technical Steering Committee This content represents
my opinion and is not an ofﬁcial position of the TSC or the Linux Foundation. Page 3 @osqueryatscale @thezachw

Osquery in Transition • Early 2014 - Osquery project is
created by Mike Arpaia at Facebook. • October 2014 - Facebook open-sources osquery. • 2014-2019 - Facebook maintains osquery as an open-source project. • June 2019 - Osquery project is handed to the Linux Foundation for a community support model. • October 2019 - Osquery 4.0.2 becomes the ﬁrst stable release of osquery as a project of the Linux Foundation. • Current - Community is working to establish a regular release cycle, deﬁne a roadmap for the future, raise funds for the project, and help osquery grow. Page 4 @osqueryatscale @thezachw

Get Involved • TSC members and the public meet biweekly
to discuss status of the project and develop plans. ◦ Next meeting: February 4, 10AM PST ◦ Join #ofﬁcehours in osquery Slack for announcements. • Donate to the osquery project. ◦ Funds are not yet earmarked, but raising funds for the project will allow us to improve testing infrastructure, and hire devs to work on osquery features and maintenance. • Volunteer to test releases. ◦ Organizations that can deploy beta versions of the agent can help ensure the quality of stable releases. Page 5 @osqueryatscale @thezachw

Osquery Performance @ Scale The main event. Page 6 @osqueryatscale
@thezachw

Motivations • As security practitioners, we need visibility into the
state of the systems we manage. • Resource utilization has real impact to the bottom line of our business. ◦ Production: Resource utilization is multiplied over each production server. More performance impact = higher cost. ◦ Workstations: We need to ensure that security workloads do not interfere with employees doing their jobs. • Osquery is built for performance, but it is easy to schedule queries that will have signiﬁcant performance impacts on the system. Page 7 @osqueryatscale @thezachw

Goals • Limit the performance impact of osquery. ◦ Osquery
Watchdog • Develop monitoring for resource consumption of queries. ◦ The osquery_schedule table • Deploy new queries in a controlled manner. ◦ Host grouping and query sharding • Investigate performance. ◦ Proﬁling and SQLite explain query plan Page 8 @osqueryatscale @thezachw

Osquery Watchdog Limit the performance impact of osquery. Page 9
@osqueryatscale @thezachw

Osquery Watchdog The view from osqueryi osquery> SELECT path, pid,
pgroup, parent, cmdline ...> FROM processes WHERE name = 'osqueryd'; path = /usr/local/bin/osqueryd pid = 45472 pgroup = 45472 parent = 8938 cmdline = /usr/local/bin/osqueryd osqueryd --flagfile=/osquery.flags path = /usr/local/bin/osqueryd pid = 45473 pgroup = 45472 parent = 45472 cmdline = /usr/local/bin/osqueryd osqueryd Page 10 @osqueryatscale @thezachw

Osquery Watchdog Background • When we start osqueryd, we get
two processes: ◦ Parent process - The “watchdog” ◦ Child process - The “worker” • Potentially resource-intensive operations are performed in the worker process. ◦ Run queries, output logs, etc. • The watchdog process checks the utilization stats for the worker on an interval. ◦ Resource utilization limits exceeded -> Watchdog kills/respawns worker Page 11 @osqueryatscale @thezachw

Osquery Watchdog Managing query execution What happens when a query
runs on the worker? 1. Worker writes to RocksDB the name of the query being run. 2. Query executes. 3. Worker removes notation of running query. Page 12 @osqueryatscale @thezachw

Osquery Watchdog Managing query execution Now suppose the watchdog kills
the worker during query execution. 1. Worker writes to RocksDB the name of the query being run. 2. Query begins executing. 3. Watchdog kills worker. 4. Worker respawns, reads RocksDB, and sees that the previous worker was in the middle of execution. 5. Worker logs the failure during query execution and “blacklists” the query. Page 13 @osqueryatscale @thezachw

Osquery Watchdog Managing query execution I0121 08:44:48.398947 270000128 scheduler.cpp:96] Executing
scheduled query expensive_query: select 1 from users, users, users, users, users, users W0121 08:45:13.591068 127172608 watcher.cpp:331] osqueryd worker (71861) stopping: Maximum sustainable CPU utilization limit exceeded: 21 I0121 08:45:13.996376 127172608 watcher.cpp:583] osqueryd watcher (71860) executing worker (71928) I0121 08:45:14.841640 163079616 init.cpp:415] osquery worker initialized [watcher=71860] I0121 08:45:14.842711 163079616 rocksdb.cpp:131] Opening RocksDB handle: /tmp/osquery.db ... W0121 08:45:23.252063 163079616 config.cpp:317] Scheduled query may have failed: expensive_query Page 14 @osqueryatscale @thezachw

Osquery Watchdog Query Blacklisting • Queries are “blacklisted” when execution
fails. • Blacklisted queries are removed from the schedule for 24 hours. • This prevents crash-looping, and unnecessary use of resources when queries will be killed. • Observe query blacklist status using the blacklisted column of the osquery_schedule table. ◦ More on this later. Page 15 @osqueryatscale @thezachw

Osquery Watchdog core/watcher.cpp:467 return SQL::selectFrom( {"parent", "user_time", "system_time", "resident_size"}, "processes",
"pid", EQUALS, INTEGER(p)); Page 16 @osqueryatscale @thezachw

Osquery Watchdog Conﬁguring the Watchdog • The watchdog is enabled
in osquery by default. ◦ Default settings: “normal” ▪ CPU - Over 10% for up to 12 seconds ▪ Memory - Up to 200MB ◦ --watchdog_level=1: “restrictive” ▪ CPU - Over 5% for up to 6 seconds ▪ Memory - Up to 100MB ◦ --watchdog_level=-1: “off” ▪ Performance limits are disabled Page 17 @osqueryatscale @thezachw

Osquery Watchdog Configuring the Watchdog • Settings can be customized
to specific needs. • --watchdog_utilization_limit ◦ Threshold percentage of CPU ◦ Time allowed over the threshold is defined by the intervals from --watchdog_level ▪ --watchdog_level=0 - 10 seconds above limit ▪ --watchdog_level=1 - 5 seconds above limit • --watchdog_memory_limit ◦ Maximum memory in MB • Tradeoff: Visibility <-> Performance safety Page 18 @osqueryatscale @thezachw

Osquery Watchdog Extensions • It is also possible to use
the watchdog to limit utilization by osquery extensions. ◦ Extensions typically run as child processes spawned by osqueryd (with the --extensions_autoload ﬂag). ◦ Use --enable_extensions_watchdog to turn on this feature. Page 19 @osqueryatscale @thezachw

Monitoring Use osquery to monitor osquery. Page 20 @osqueryatscale @thezachw

Monitoring osquery_schedule • Osquery itself provides excellent facilities for monitoring
osquery performance. • The osquery_schedule table exposes performance information for all scheduled queries. • Performance information is collected by looking at the difference in CPU time and memory of the worker process during execution. Page 21 @osqueryatscale @thezachw

osquery_schedule Schema What information is available in the osquery_schedule table?
Page 22 @osqueryatscale @thezachw

Monitoring Blacklisted Queries • SELECT * FROM osquery_schedule WHERE blacklisted
= 1; • Returns information about all of the queries that are currently blacklisted. • Depending on your requirements, this may be worth alerting on! Page 23 @osqueryatscale @thezachw

Monitoring Dashboards • Consider using the data from osquery_schedule to
create osquery performance dashboards. • Useful charts: ◦ Blacklisted queries ◦ Memory usage of queries ◦ System + user time usage of queries • Visualizing middle and top percentiles can help ﬁnd outliers. Page 24 @osqueryatscale @thezachw

Monitoring Page 25 @osqueryatscale @thezachw

Query Deployment Deploy new queries in a controlled manner. Page
28 @osqueryatscale @thezachw

Query Deployment Deployment Strategies Two major strategies for controlling deployment
of new queries: 1. Segment hosts and deploy queries to groups of hosts. 2. Use the shard option of scheduled queries to slow roll queries within a host group. Use these strategies together for the best control of rollouts. Page 29 @osqueryatscale @thezachw

Query Deployment Group Hosts • Segment hosts by risk tolerance
for performance issues. ◦ Start with lower risk hosts. • Different techniques can be used depending on the deployment/conﬁguration strategy of osquery. • With tools like Chef/Puppet/Ansible: ◦ Use the tooling to deploy different pack ﬁles to each group of hosts. • With plain osquery: ◦ Use the discovery query feature of query packs to gate pack execution based on the results of dynamic queries. • With Fleet: ◦ Use labels to segment hosts and target packs to labels. Page 30 @osqueryatscale @thezachw

Query Deployment Shard Queries • Set the shard option in
a scheduled query to enable the query on a subset of hosts that receive the pack. ◦ Shard is a percentage of hosts on which to enable the query. ▪ 0 - No hosts ▪ 100 - All hosts • Increase the shard value as conﬁdence in the query performance increases. Page 31 @osqueryatscale @thezachw

Query Deployment Monitor Rollout • Use each of the rollout
techniques to begin sending the new query to hosts. • Ensure that you have visibility (alerting, dashboards, etc.) into the performance of osquery on those systems. • Deploy to more hosts as conﬁdence increases. • Good monitoring dashboards really pay off at this stage. Page 32 @osqueryatscale @thezachw

Query Deployment Page 33 @osqueryatscale @thezachw

Investigate Performance Use tooling to understand performance problems. Page 34
@osqueryatscale @thezachw

Investigate Performance Tools • Osquery provides tools that we can
use to investigate the performance of queries. • Proﬁling ◦ Use the proﬁle script to preview the performance of query packs on the local machine. • Query planning ◦ Use SQLite explain query plan to begin debugging performance problems. Page 35 @osqueryatscale @thezachw

Investigate Performance Proﬁling $ ./tools/analysis/profile.py --config pack.json --shell /usr/bin/osqueryi Profiling
query: select * from processes U:1 C:0 M:2 F:0 D:0 processes (1/1): utilization: 9.8 cpu_time: 0.099889228 memory: 18640896 fds: 4 duration: 0.5181262493133545 Profiling query: select * from users join user_groups using (uid) join groups using (gid) U:2 C:1 M:2 F:0 D:2 user_groups (1/1): utilization: 28.299999999999997 cpu_time: 0.570734208 memory: 19369984 fds: 4 duration: 1.5286788940429688 Profiling query: select * from time U:0 C:0 M:2 F:0 D:0 time (1/1): utilization: 5.35 cpu_time: 0.056201881999999995 memory: 16080896 fds: 4 duration: 0.5209510326385498 Page 36 @osqueryatscale @thezachw

Investigate Performance Query Plan osquery> EXPLAIN QUERY PLAN ...> SELECT
* FROM processes JOIN process_open_sockets USING (pid) JOIN hash USING (path) ...> WHERE pid IN (SELECT pid FROM processes WHERE uid = 0); +----+--------+---------+---------------------------------------------------------+ | id | parent | notused | detail | +----+--------+---------+---------------------------------------------------------+ | 4 | 0 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 62: | | 8 | 0 | 0 | LIST SUBQUERY 1 | | 10 | 8 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 68: | | 30 | 0 | 0 | SCAN TABLE process_open_sockets VIRTUAL TABLE INDEX 64: | | 38 | 0 | 0 | SCAN TABLE hash VIRTUAL TABLE INDEX 66: | +----+--------+---------+---------------------------------------------------------+ Page 37 @osqueryatscale @thezachw

Investigate Performance Query Plan osquery> EXPLAIN QUERY PLAN ...> SELECT
* FROM processes JOIN process_open_sockets USING (pid) JOIN hash USING (path) ...> WHERE uid = 0; +----+--------+---------+---------------------------------------------------------+ | id | parent | notused | detail | +----+--------+---------+---------------------------------------------------------+ | 4 | 0 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 70: | | 10 | 0 | 0 | SCAN TABLE process_open_sockets VIRTUAL TABLE INDEX 71: | | 18 | 0 | 0 | SCAN TABLE hash VIRTUAL TABLE INDEX 73: | +----+--------+---------+---------------------------------------------------------+ Page 38 @osqueryatscale @thezachw

Wrapping It Up • Use the watchdog to constrain performance.
• Monitor performance and blacklists with the osquery_schedule table. • Be strategic in rollout of new queries. • Learn to use the tooling to evaluate performance. Page 39 @osqueryatscale @thezachw

Thank You! Email - [email protected] Twitter - @thezachw Osquery Slack
- @zwass Page 40 @osqueryatscale @thezachw

Osquery Performance @ Scale

Osquery Performance @ Scale

More Decks by Zach Wasserman

Other Decks in Technology

Featured

Transcript