$30 off During Our Annual Pro Sale. View Details »

Osquery Performance @ Scale

Osquery Performance @ Scale

My talk from the osquery@scale conference.

Zach Wasserman

January 22, 2020
Tweet

More Decks by Zach Wasserman

Other Decks in Technology

Transcript

  1. Osquery
    Performance
    @ Scale
    Strategies for managing osquery performance
    in production environments.

    View Slide

  2. Zach has been involved with osquery since the
    inception of the project at Facebook in 2014.
    He serves as a member of the Osquery Technical
    Steering Committee, and works with a variety of clients
    to integrate osquery into their operations as the Principal
    Engineer at Dactiv LLC.
    Zach believes that we can make security accessible to
    everyone through open-source tools.
    Zach Wasserman
    Principal Engineer
    Dactiv LLC
    Page 2
    @osqueryatscale
    @thezachw

    View Slide

  3. An Update From the Technical Steering Committee
    This content represents my opinion and is not an official position of the TSC or the Linux
    Foundation.
    Page 3
    @osqueryatscale
    @thezachw

    View Slide

  4. Osquery in Transition
    ● Early 2014 - Osquery project is created by Mike Arpaia at Facebook.
    ● October 2014 - Facebook open-sources osquery.
    ● 2014-2019 - Facebook maintains osquery as an open-source project.
    ● June 2019 - Osquery project is handed to the Linux Foundation for a
    community support model.
    ● October 2019 - Osquery 4.0.2 becomes the first stable release of osquery
    as a project of the Linux Foundation.
    ● Current - Community is working to establish a regular release cycle, define
    a roadmap for the future, raise funds for the project, and help osquery
    grow.
    Page 4
    @osqueryatscale
    @thezachw

    View Slide

  5. Get Involved
    ● TSC members and the public meet biweekly to discuss status of the project
    and develop plans.
    ○ Next meeting: February 4, 10AM PST
    ○ Join #officehours in osquery Slack for announcements.
    ● Donate to the osquery project.
    ○ Funds are not yet earmarked, but raising funds for the project will allow
    us to improve testing infrastructure, and hire devs to work on osquery
    features and maintenance.
    ● Volunteer to test releases.
    ○ Organizations that can deploy beta versions of the agent can help
    ensure the quality of stable releases.
    Page 5
    @osqueryatscale
    @thezachw

    View Slide

  6. Osquery Performance @ Scale
    The main event.
    Page 6
    @osqueryatscale
    @thezachw

    View Slide

  7. Motivations
    ● As security practitioners, we need visibility into the state of the systems we
    manage.
    ● Resource utilization has real impact to the bottom line of our business.
    ○ Production: Resource utilization is multiplied over each production
    server. More performance impact = higher cost.
    ○ Workstations: We need to ensure that security workloads do not
    interfere with employees doing their jobs.
    ● Osquery is built for performance, but it is easy to schedule queries that will
    have significant performance impacts on the system.
    Page 7
    @osqueryatscale
    @thezachw

    View Slide

  8. Goals
    ● Limit the performance impact of osquery.
    ○ Osquery Watchdog
    ● Develop monitoring for resource consumption of queries.
    ○ The osquery_schedule table
    ● Deploy new queries in a controlled manner.
    ○ Host grouping and query sharding
    ● Investigate performance.
    ○ Profiling and SQLite explain query plan
    Page 8
    @osqueryatscale
    @thezachw

    View Slide

  9. Osquery Watchdog
    Limit the performance impact of osquery.
    Page 9
    @osqueryatscale
    @thezachw

    View Slide

  10. Osquery Watchdog
    The view from osqueryi
    osquery> SELECT path, pid, pgroup, parent, cmdline
    ...> FROM processes WHERE name = 'osqueryd';
    path = /usr/local/bin/osqueryd
    pid = 45472
    pgroup = 45472
    parent = 8938
    cmdline = /usr/local/bin/osqueryd osqueryd --flagfile=/osquery.flags
    path = /usr/local/bin/osqueryd
    pid = 45473
    pgroup = 45472
    parent = 45472
    cmdline = /usr/local/bin/osqueryd osqueryd
    Page 10
    @osqueryatscale
    @thezachw

    View Slide

  11. Osquery Watchdog
    Background
    ● When we start osqueryd, we get two processes:
    ○ Parent process - The “watchdog”
    ○ Child process - The “worker”
    ● Potentially resource-intensive operations are performed in the worker
    process.
    ○ Run queries, output logs, etc.
    ● The watchdog process checks the utilization stats for the worker on an
    interval.
    ○ Resource utilization limits exceeded -> Watchdog kills/respawns
    worker
    Page 11
    @osqueryatscale
    @thezachw

    View Slide

  12. Osquery Watchdog
    Managing query execution
    What happens when a query runs on the worker?
    1. Worker writes to RocksDB the name of the query being run.
    2. Query executes.
    3. Worker removes notation of running query.
    Page 12
    @osqueryatscale
    @thezachw

    View Slide

  13. Osquery Watchdog
    Managing query execution
    Now suppose the watchdog kills the worker during query execution.
    1. Worker writes to RocksDB the name of the query being run.
    2. Query begins executing.
    3. Watchdog kills worker.
    4. Worker respawns, reads RocksDB, and sees that the previous worker was
    in the middle of execution.
    5. Worker logs the failure during query execution and “blacklists” the query.
    Page 13
    @osqueryatscale
    @thezachw

    View Slide

  14. Osquery Watchdog
    Managing query execution
    I0121 08:44:48.398947 270000128 scheduler.cpp:96] Executing scheduled query expensive_query:
    select 1 from users, users, users, users, users, users
    W0121 08:45:13.591068 127172608 watcher.cpp:331] osqueryd worker (71861) stopping: Maximum
    sustainable CPU utilization limit exceeded: 21
    I0121 08:45:13.996376 127172608 watcher.cpp:583] osqueryd watcher (71860) executing worker
    (71928)
    I0121 08:45:14.841640 163079616 init.cpp:415] osquery worker initialized [watcher=71860]
    I0121 08:45:14.842711 163079616 rocksdb.cpp:131] Opening RocksDB handle: /tmp/osquery.db
    ...
    W0121 08:45:23.252063 163079616 config.cpp:317] Scheduled query may have failed: expensive_query
    Page 14
    @osqueryatscale
    @thezachw

    View Slide

  15. Osquery Watchdog
    Query Blacklisting
    ● Queries are “blacklisted” when execution fails.
    ● Blacklisted queries are removed from the schedule for 24 hours.
    ● This prevents crash-looping, and unnecessary use of resources when
    queries will be killed.
    ● Observe query blacklist status using the blacklisted column of the
    osquery_schedule table.
    ○ More on this later.
    Page 15
    @osqueryatscale
    @thezachw

    View Slide

  16. Osquery Watchdog
    core/watcher.cpp:467
    return SQL::selectFrom(
    {"parent", "user_time", "system_time",
    "resident_size"},
    "processes",
    "pid",
    EQUALS,
    INTEGER(p));
    Page 16
    @osqueryatscale
    @thezachw

    View Slide

  17. Osquery Watchdog
    Configuring the Watchdog
    ● The watchdog is enabled in osquery by default.
    ○ Default settings: “normal”
    ■ CPU - Over 10% for up to 12 seconds
    ■ Memory - Up to 200MB
    ○ --watchdog_level=1: “restrictive”
    ■ CPU - Over 5% for up to 6 seconds
    ■ Memory - Up to 100MB
    ○ --watchdog_level=-1: “off”
    ■ Performance limits are disabled
    Page 17
    @osqueryatscale
    @thezachw

    View Slide

  18. Osquery Watchdog
    Configuring the Watchdog
    ● Settings can be customized to specific needs.
    ● --watchdog_utilization_limit
    ○ Threshold percentage of CPU
    ○ Time allowed over the threshold is defined by the intervals from
    --watchdog_level
    ■ --watchdog_level=0 - 10 seconds above limit
    ■ --watchdog_level=1 - 5 seconds above limit
    ● --watchdog_memory_limit
    ○ Maximum memory in MB
    ● Tradeoff: Visibility <-> Performance safety
    Page 18
    @osqueryatscale
    @thezachw

    View Slide

  19. Osquery Watchdog
    Extensions
    ● It is also possible to use the watchdog to limit utilization by osquery
    extensions.
    ○ Extensions typically run as child processes spawned by osqueryd (with
    the --extensions_autoload flag).
    ○ Use --enable_extensions_watchdog to turn on this feature.
    Page 19
    @osqueryatscale
    @thezachw

    View Slide

  20. Monitoring
    Use osquery to monitor osquery.
    Page 20
    @osqueryatscale
    @thezachw

    View Slide

  21. Monitoring
    osquery_schedule
    ● Osquery itself provides excellent facilities for monitoring osquery
    performance.
    ● The osquery_schedule table exposes performance information for all
    scheduled queries.
    ● Performance information is collected by looking at the difference in CPU
    time and memory of the worker process during execution.
    Page 21
    @osqueryatscale
    @thezachw

    View Slide

  22. osquery_schedule
    Schema
    What information is available in the
    osquery_schedule table?
    Page 22
    @osqueryatscale
    @thezachw

    View Slide

  23. Monitoring
    Blacklisted Queries
    ● SELECT * FROM osquery_schedule WHERE blacklisted = 1;
    ● Returns information about all of the queries that are currently blacklisted.
    ● Depending on your requirements, this may be worth alerting on!
    Page 23
    @osqueryatscale
    @thezachw

    View Slide

  24. Monitoring
    Dashboards
    ● Consider using the data from osquery_schedule to create osquery
    performance dashboards.
    ● Useful charts:
    ○ Blacklisted queries
    ○ Memory usage of queries
    ○ System + user time usage of queries
    ● Visualizing middle and top percentiles can help find outliers.
    Page 24
    @osqueryatscale
    @thezachw

    View Slide

  25. Monitoring
    Page 25
    @osqueryatscale
    @thezachw

    View Slide

  26. Monitoring
    Page 26
    @osqueryatscale
    @thezachw

    View Slide

  27. Monitoring
    Page 27
    @osqueryatscale
    @thezachw

    View Slide

  28. Query Deployment
    Deploy new queries in a controlled manner.
    Page 28
    @osqueryatscale
    @thezachw

    View Slide

  29. Query Deployment
    Deployment Strategies
    Two major strategies for controlling deployment of new queries:
    1. Segment hosts and deploy queries to groups of hosts.
    2. Use the shard option of scheduled queries to slow roll queries within a host
    group.
    Use these strategies together for the best control of rollouts.
    Page 29
    @osqueryatscale
    @thezachw

    View Slide

  30. Query Deployment
    Group Hosts
    ● Segment hosts by risk tolerance for performance issues.
    ○ Start with lower risk hosts.
    ● Different techniques can be used depending on the
    deployment/configuration strategy of osquery.
    ● With tools like Chef/Puppet/Ansible:
    ○ Use the tooling to deploy different pack files to each group of hosts.
    ● With plain osquery:
    ○ Use the discovery query feature of query packs to gate pack execution
    based on the results of dynamic queries.
    ● With Fleet:
    ○ Use labels to segment hosts and target packs to labels.
    Page 30
    @osqueryatscale
    @thezachw

    View Slide

  31. Query Deployment
    Shard Queries
    ● Set the shard option in a scheduled query to enable the query on a subset
    of hosts that receive the pack.
    ○ Shard is a percentage of hosts on which to enable the query.
    ■ 0 - No hosts
    ■ 100 - All hosts
    ● Increase the shard value as confidence in the query performance increases.
    Page 31
    @osqueryatscale
    @thezachw

    View Slide

  32. Query Deployment
    Monitor Rollout
    ● Use each of the rollout techniques to begin sending the new query to hosts.
    ● Ensure that you have visibility (alerting, dashboards, etc.) into the
    performance of osquery on those systems.
    ● Deploy to more hosts as confidence increases.
    ● Good monitoring dashboards really pay off at this stage.
    Page 32
    @osqueryatscale
    @thezachw

    View Slide

  33. Query Deployment
    Page 33
    @osqueryatscale
    @thezachw

    View Slide

  34. Investigate Performance
    Use tooling to understand performance problems.
    Page 34
    @osqueryatscale
    @thezachw

    View Slide

  35. Investigate Performance
    Tools
    ● Osquery provides tools that we can use to investigate the performance of
    queries.
    ● Profiling
    ○ Use the profile script to preview the performance of query packs on the
    local machine.
    ● Query planning
    ○ Use SQLite explain query plan to begin debugging performance
    problems.
    Page 35
    @osqueryatscale
    @thezachw

    View Slide

  36. Investigate Performance
    Profiling
    $ ./tools/analysis/profile.py --config pack.json --shell /usr/bin/osqueryi
    Profiling query: select * from processes
    U:1 C:0 M:2 F:0 D:0 processes (1/1): utilization: 9.8 cpu_time:
    0.099889228 memory: 18640896 fds: 4 duration: 0.5181262493133545
    Profiling query: select * from users join user_groups using (uid) join
    groups using (gid)
    U:2 C:1 M:2 F:0 D:2 user_groups (1/1): utilization:
    28.299999999999997 cpu_time: 0.570734208 memory: 19369984 fds: 4 duration:
    1.5286788940429688
    Profiling query: select * from time
    U:0 C:0 M:2 F:0 D:0 time (1/1): utilization: 5.35 cpu_time:
    0.056201881999999995 memory: 16080896 fds: 4 duration: 0.5209510326385498
    Page 36
    @osqueryatscale
    @thezachw

    View Slide

  37. Investigate Performance
    Query Plan
    osquery> EXPLAIN QUERY PLAN
    ...> SELECT * FROM processes JOIN process_open_sockets USING (pid) JOIN hash USING (path)
    ...> WHERE pid IN (SELECT pid FROM processes WHERE uid = 0);
    +----+--------+---------+---------------------------------------------------------+
    | id | parent | notused | detail |
    +----+--------+---------+---------------------------------------------------------+
    | 4 | 0 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 62: |
    | 8 | 0 | 0 | LIST SUBQUERY 1 |
    | 10 | 8 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 68: |
    | 30 | 0 | 0 | SCAN TABLE process_open_sockets VIRTUAL TABLE INDEX 64: |
    | 38 | 0 | 0 | SCAN TABLE hash VIRTUAL TABLE INDEX 66: |
    +----+--------+---------+---------------------------------------------------------+
    Page 37
    @osqueryatscale
    @thezachw

    View Slide

  38. Investigate Performance
    Query Plan
    osquery> EXPLAIN QUERY PLAN
    ...> SELECT * FROM processes JOIN process_open_sockets USING (pid) JOIN hash USING (path)
    ...> WHERE uid = 0;
    +----+--------+---------+---------------------------------------------------------+
    | id | parent | notused | detail |
    +----+--------+---------+---------------------------------------------------------+
    | 4 | 0 | 0 | SCAN TABLE processes VIRTUAL TABLE INDEX 70: |
    | 10 | 0 | 0 | SCAN TABLE process_open_sockets VIRTUAL TABLE INDEX 71: |
    | 18 | 0 | 0 | SCAN TABLE hash VIRTUAL TABLE INDEX 73: |
    +----+--------+---------+---------------------------------------------------------+
    Page 38
    @osqueryatscale
    @thezachw

    View Slide

  39. Wrapping It Up
    ● Use the watchdog to constrain performance.
    ● Monitor performance and blacklists with the osquery_schedule table.
    ● Be strategic in rollout of new queries.
    ● Learn to use the tooling to evaluate performance.
    Page 39
    @osqueryatscale
    @thezachw

    View Slide

  40. Thank You!
    Email - [email protected]
    Twitter - @thezachw
    Osquery Slack - @zwass
    Page 40
    @osqueryatscale
    @thezachw

    View Slide