Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Using Julia in an Anomaly Detection Pipeline (J...

Using Julia in an Anomaly Detection Pipeline (JuliaCon 2024)

Julia is a great multi-purpose language, but it also fits in as a component in a larger multi-language ecosystem. At Akamai, we use Julia as part of our data pipeline to do Anomaly Detection and Alerting on web performance data.

In this talk, I'll cover the tasks delegated to Julia as well as how it fits into the rest of our development and operations stack.

Julia is very good at running data analysis on columnar data and temporal data, ie, the kind of data we have a lot of where I work. We have packages to do regression analysis, hypothesis testing, signal processing, and more, allowing our development team to focus on business logic and data pipelines.

In this talk, we'll cover how our Data Scientists use Julia to analyze data, and develop algorithms that can then be operationalized into a real time data pipeline and we'll see how Julia complements a Java based web application that handles data collection and alerting.

Philip Tellis

July 10, 2024
Tweet

More Decks by Philip Tellis

Other Decks in Technology

Transcript

  1. • Bread was rationed: 1 Kg loaf per person •

    Bread is handmade, so isn’t exactly 1 Kg • Normal distribution centered below 1 Kg • Showed that the baker was skimming. • Baker adjusted process so loaves measured over 1 Kg • Distribution showed that the only change was in who received the heavier loaves
  2. Background on mPulse • We collect web performance data from

    end user browsers using the boomerang JavaScript library. • This is beaconed back to our performance analytics application – mPulse – written in Java (there’s also a Julia interface to the API). • The data is cleaned, filtered, sorted, and streamed to various backend tasks mostly written in Java for storage, analysis, visualization, and alerting. • We also have a Julia backend task to do advanced data analysis and visualization.
  3. High Level View mPulse Predefined dashboards Streaming Processing storage Advanced

    Analysis Customer Alerts: Email, Slack, etc. beacon resources Page Website User BOOMR
  4. Timeline • 2010: boomerang publicly released to collect performance data

    • 2011: precursor to mPulse to analyze the data • 2014: Added Advanced Analytics in Julia to get more meaningful insights ◦ Started with Julia 0.2 ◦ 2015: migrated to Julia 0.4 • 2016: Added Anomaly Detection built in Julia • 2020-2023: Migrated from Julia 0.4 to Julia 1.6 ◦ Come to my lightning talk on Friday 11:40-11:50 for a zip through the migration
  5. Our Data • Mostly time series (though we can ignore

    time for interesting views). • Web performance data related to real user experiences. • Data about multiple events during the page load process; analyze them independently or in relation to each other. • Real Users implies an uncontrolled environment, and we can’t even be sure that all those users are human with non-malicious intent.
  6. And of course time series This isn’t really outside expected

    bounds… But it is for this time of day
  7. • The Java process has a sliding window of in-memory

    data. • Used for dashboards, API calls, and alerts. • Java is good at applying a set of quick arithmetic rules to a short window of data. • Not so good at figuring out the rules on a very large amount of data.
  8. • Julia is great at processing large amounts of columnar

    data • Fast matrix multiplications • Modules for Fourier Analysis, Regression, Hypothesis Testing • Can try out multiple algorithms in parallel • But, not something you want to start up on every request
  9. mPulse Predefined dashboards Streaming Processing storage Advanced Analysis Customer Alerts:

    Email, Slack, etc. beacon resources Page Website User BOOMR Anomaly Modeler Anomaly Rules
  10. IPC mPulse storage Customer Alerts: Email, Slack, etc. Anomaly Modeler

    Anomaly Rules (REST API) Jupyter JSON over HTTPS JSON over ZeroMQ
  11. Inter-process communication • Modified Jupyter with 3 execution modes. ◦

    Notebook mode is standard, with our own authentication for multi-user notebooks. ◦ Widgets: Only accepts a set of filters (no code), and responds with data ◦ Anomaly Modeling: Only accepts a set of filters (no code), and does not respond. • AuthN/AuthZ is kind of like OAuth, but not exactly. • In Anomaly Modeling mode, Julia acknowledges the request, and then processes it asynchronously. • Once the best ruleset is determined, it writes that back to the mPulse engine. • mPulse immediately starts using the new rules on its in-memory data stream.
  12. Converting JSON Config to kwargs • Our Julia application receives

    a JSON configuration object. • It then checks to see if the keys and types on the JSON object match any of the fixed or keyword arguments to our processor method. • We built a module called MethodInspector.jl to do this. • Four methods: ◦ arg_names ◦ arg_types ◦ kwarg_names ◦ kwarg_types • Works in Julia 1.6+ • Anything unknown is discarded and a warning logged.
  13. Development Process • Data Scientists work with the data in

    Julia prototyping different algorithms for use in anomaly detection. • Once a stable proof of concept is developed, the team then works with engineers to implement a lightweight version of this algorithm in Java. • The two teams also define a specification for sharing model parameters between components. • At runtime, an algorithm is only used if both components are capable of executing it on the available data.
  14. Summary • As with bread, preparing the dough model takes

    way more time than consuming it. • Use the right tool for each step of the way. • Communicate your ingredients and process. • Allow function calls to be mapped to JSON configs.
  15. Thank You! • JuliaStats ◦ Clustering, ◦ Distances, ◦ Distributions,

    ◦ GLM, ◦ Loess, ◦ Statistics, ◦ StatsBase • DSP.jl • DataFrames.jl • DataStructures.jl • JuliaLang! Try our packages • MethodInspector.jl • CurlHTTP.jl • NelsonRules.jl • AnomalyBenchmark.jl • mPulseAPI.jl