Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Денис Баталов (Amazon Web Services), Ваши метрики сошли с ума? Алгоритм обнаружения аномалий и Streaming SQL вам помогут!, CodeFest 2017

CodeFest
February 01, 2018

Денис Баталов (Amazon Web Services), Ваши метрики сошли с ума? Алгоритм обнаружения аномалий и Streaming SQL вам помогут!, CodeFest 2017

https://2017.codefest.ru/lecture/1220

Я расскажу про новый алгоритм обнаружения аномалий Random Cut Forest а также концепцию Streaming SQL для обработки потоковых данных в реальном времени. Покажу пример использования обеих технологий в сервисе Amazon Kinesis Analytics.

CodeFest

February 01, 2018
Tweet

More Decks by CodeFest

Other Decks in Technology

Transcript

  1. Алгоритм обнаружения аномалий и Streaming SQL в Amazon Kinesis Analytics

    Денис Баталов, PhD @dbatalov Sr. Solutions Architect Спец по ML и AI Amazon Web Services Luxembourg
  2. Сегодня вы узнаете про 1. Алгоритм обнаружения аномалий Random Cut

    Forest 2. Спотовый рынок виртуальных машин Amazon EC2 3. Streaming SQL для обработки потоков 4. Обнаружение ценовых аномалий спотового рынка с использованием Amazon Kinesis Analytics
  3. Random Cut Tree – Дерево Случайных Разбивок повторяем: разбивка заканчивается

    когда все точки изолированы Разбивка длинной стороны много данных Неудачная разбивка
  4. Случайная выборка из потока «резервуарная выборка» [Vitter] Случайная выборка 5-ти

    значений из потока? сохраняем с вероятностью выбрасываем с вероятностью 5 7 2 7 5 6 1 6
  5. Операция Вставки – Случай I Начинаем с корневого узла Если

    значение внутри контура спускайся ниже по дереву по соответствующей ветви
  6. Показатель Аномальности Значение является аномальным если его вставка в дерево

    существенно увеличивает размер дерева, то есть сумму длин всех ветвей (или длину описания данных) нормальное значение:
  7. Поездки такси в Нью Йорке 0 5000 10000 15000 20000

    25000 30000 2014-12-01 00:00:00 2014-12-01 03:30:00 2014-12-01 07:00:00 2014-12-01 10:30:00 2014-12-01 14:00:00 2014-12-01 17:30:00 2014-12-01 21:00:00 2014-12-02 00:30:00 2014-12-02 04:00:00 2014-12-02 07:30:00 2014-12-02 11:00:00 2014-12-02 14:30:00 2014-12-02 18:00:00 2014-12-02 21:30:00 2014-12-03 01:00:00 2014-12-03 04:30:00 2014-12-03 08:00:00 2014-12-03 11:30:00 2014-12-03 15:00:00 2014-12-03 18:30:00 2014-12-03 22:00:00 2014-12-04 01:30:00 2014-12-04 05:00:00 2014-12-04 08:30:00 2014-12-04 12:00:00 2014-12-04 15:30:00 2014-12-04 19:00:00 2014-12-04 22:30:00 2014-12-05 02:00:00 2014-12-05 05:30:00 2014-12-05 09:00:00 2014-12-05 12:30:00 2014-12-05 16:00:00 2014-12-05 19:30:00 2014-12-05 23:00:00 2014-12-06 02:30:00 2014-12-06 06:00:00 2014-12-06 09:30:00 2014-12-06 13:00:00 2014-12-06 16:30:00 2014-12-06 20:00:00 2014-12-06 23:30:00 2014-12-07 03:00:00 2014-12-07 06:30:00 2014-12-07 10:00:00 2014-12-07 13:30:00 2014-12-07 17:00:00 2014-12-07 20:30:00 numPassengers Mon 8am 6pm 4pm Sat 11pm 11am Tue Wed Thu Fri Данные агрегируются каждые 30 мин, размер шингла: 48
  8. 0 5000 10000 15000 20000 25000 30000 35000 40000 45000

    2014-09-16 21:30:00 2014-09-19 05:30:00 2014-09-21 13:30:00 2014-09-23 21:30:00 2014-09-26 05:30:00 2014-09-28 13:30:00 2014-09-30 21:30:00 2014-10-03 05:30:00 2014-10-05 13:30:00 2014-10-07 21:30:00 2014-10-10 05:30:00 2014-10-12 13:30:00 2014-10-14 21:30:00 2014-10-17 05:30:00 2014-10-19 13:30:00 2014-10-21 21:30:00 2014-10-24 05:30:00 2014-10-26 13:30:00 2014-10-28 21:30:00 2014-10-31 05:30:00 2014-11-02 13:30:00 2014-11-04 21:30:00 2014-11-07 05:30:00 2014-11-09 13:30:00 2014-11-11 21:30:00 2014-11-14 05:30:00 2014-11-16 13:30:00 2014-11-18 21:30:00 2014-11-21 05:30:00 2014-11-23 13:30:00 2014-11-25 21:30:00 2014-11-28 05:30:00 2014-11-30 13:30:00 2014-12-02 21:30:00 2014-12-05 05:30:00 2014-12-07 13:30:00 2014-12-09 21:30:00 2014-12-12 05:30:00 2014-12-14 13:30:00 2014-12-16 21:30:00 2014-12-19 05:30:00 2014-12-21 13:30:00 2014-12-23 21:30:00 2014-12-26 05:30:00 2014-12-28 13:30:00 2014-12-30 21:30:00 2015-01-02 05:30:00 2015-01-04 13:30:00 2015-01-06 21:30:00 2015-01-09 05:30:00 2015-01-11 13:30:00 2015-01-13 21:30:00 2015-01-16 05:30:00 2015-01-18 13:30:00 2015-01-20 21:30:00 2015-01-23 05:30:00 2015-01-25 13:30:00 2015-01-27 21:30:00 2015-01-30 05:30:00 numPassengers Поездки такси в Нью Йорке
  9. 0 10 20 30 40 50 60 70 80 90

    100 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 2014-09-16 21:30:00 2014-09-19 07:30:00 2014-09-21 17:30:00 2014-09-24 03:30:00 2014-09-26 13:30:00 2014-09-28 23:30:00 2014-10-01 09:30:00 2014-10-03 19:30:00 2014-10-06 05:30:00 2014-10-08 15:30:00 2014-10-11 01:30:00 2014-10-13 11:30:00 2014-10-15 21:30:00 2014-10-18 07:30:00 2014-10-20 17:30:00 2014-10-23 03:30:00 2014-10-25 13:30:00 2014-10-27 23:30:00 2014-10-30 09:30:00 2014-11-01 19:30:00 2014-11-04 05:30:00 2014-11-06 15:30:00 2014-11-09 01:30:00 2014-11-11 11:30:00 2014-11-13 21:30:00 2014-11-16 07:30:00 2014-11-18 17:30:00 2014-11-21 03:30:00 2014-11-23 13:30:00 2014-11-25 23:30:00 2014-11-28 09:30:00 2014-11-30 19:30:00 2014-12-03 05:30:00 2014-12-05 15:30:00 2014-12-08 01:30:00 2014-12-10 11:30:00 2014-12-12 21:30:00 2014-12-15 07:30:00 2014-12-17 17:30:00 2014-12-20 03:30:00 2014-12-22 13:30:00 2014-12-24 23:30:00 2014-12-27 09:30:00 2014-12-29 19:30:00 2015-01-01 05:30:00 2015-01-03 15:30:00 2015-01-06 01:30:00 2015-01-08 11:30:00 2015-01-10 21:30:00 2015-01-13 07:30:00 2015-01-15 17:30:00 2015-01-18 03:30:00 2015-01-20 13:30:00 2015-01-22 23:30:00 2015-01-25 09:30:00 2015-01-27 19:30:00 2015-01-30 05:30:00 numPassengers Anomaly Score Поездки такси в Нью Йорке
  10. Копайте Глубже “Robust Random Cut Forest Based Anomaly Detection on

    Streams” [Guha, Mishra, Roy, Schrijvers] http://docs.aws.amazon.com/kinesisanalytics/latest/dev /app-anomaly-detection.html
  11. Compute Purchasing Models On-Demand Pay for compute capacity by the

    hour with no long-term commitments For spiky workloads, or to define needs Reserved Make a low, one-time payment and receive a significant discount on the hourly charge For committed utilization Spot Bid for unused capacity, charged at a Spot Price which fluctuates based on supply and demand For time-insensitive or transient workloads Dedicated Launch instances within Amazon VPC that run on hardware dedicated to a single customer For highly sensitive or compliance related workloads Free Tier Get Started on AWS with free usage & no commitment For POCs and getting started
  12. Reserved Instances (RI) For example: Reserve capacity for one or

    three years Pay a low, one-time fee for the capacity reservation Receive a significant discount on the hourly charge for your instance
  13. Reserved Instance Payment Options Explained No Upfront option: •Up to

    a 55% discount compared to On-Demand •Does not require upfront payment •Low hourly rate for the RI on an ongoing hourly basis Partial Upfront option: •Balances the payments of an RI between upfront and hourly •Provides a higher discount (up to 76%) compared to the No Upfront option •Pay a very low hourly rate upfront for every hour in the term regardless of usage With the All Upfront option: •Highest discount compared to On-Demand (up to 77% off).
  14. Reserved Instance vs. On-Demand $- $500 $1,000 $1,500 $2,000 $2,500

    $3,000 30% 40% 50% 60% 70% 80% 90% 100% Utilization Over a Year m3.xlarge 1yr OD/RI Break Even Utilization On Demand No Upfront Partial Upfront All Upfront What are the “break-even” points of each of these options in relation to purchasing instances On- Demand?
  15. Spot instances What are Spot instances? •Spare EC2 instances bid

    on in hourly increments •One hour at a time •Behave exactly like a regular instances Cost Benefits •Up to 92% off regular on-demand prices per hour What is the trade-off? •May be interrupted if that instance is needed for a EC2 capacity •No charge for any partial hour due to termination
  16. Amazon Kinesis Streams Easy administration: Create a stream, set capacity

    level with shards. Scale to match your data throughput rate & volume. Build real-time applications: Process streaming data with Kinesis Client Library (KCL), Apache Spark/Storm, AWS Lambda, .... Low cost: Cost-efficient for workloads of any scale.
  17. Amazon Kinesis Firehose Zero administration: Capture and deliver streaming data

    to Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service without writing an app or managing infrastructure. Direct-to-data-store integration: Batch, compress, and encrypt streaming data for delivery in as little as 60 seconds. Seamless elasticity: Seamlessly scales to match data throughput without intervention. Capture and submit streaming data to Firehose Analyze streaming data using your favorite BI tools Firehose loads streaming data continuously into S3, Amazon Redshift, and Amazon ES
  18. Amazon Kinesis Analytics Apply SQL on streams: Easily connect to

    a Kinesis stream or Firehose delivery stream and apply SQL skills. Build real-time applications: Perform continual processing on streaming big data with sub-second processing latencies. Easy scalability: Elastically scales to match data throughput. Connect to Kinesis streams, Firehose delivery streams Run standard SQL queries against data streams Kinesis Analytics can send processed data to analytics tools so you can create alerts and respond in real time
  19. Amazon Kinesis: streaming data made easy Services make it easy

    to capture, deliver, and process streams on AWS Kinesis Analytics For all developers, data scientists Easily analyze data streams using standard SQL queries Kinesis Firehose For all developers, data scientists Easily load massive volumes of streaming data into S3, Amazon Redshift, or Amazon ES Kinesis Streams For Technical Developers Collect and stream data for ordered, replayable, real-time processing
  20. Kinesis Analytics Pay for only what you use Automatic elasticity

    Standard SQL for analytics Real-time processing Easy to use
  21. Use SQL to build real-time applications Easily write SQL code

    to process streaming data Connect to streaming source Continuously deliver SQL results
  22. Connect to streaming source • Streaming data sources include Firehose

    or Streams • Input formats include JSON, .csv, variable column, unstructured text • Each input has a schema; schema is inferred, but you can edit • Reference data sources (S3) for data enrichment
  23. Write SQL code • Build streaming applications with one-to-many SQL

    statements • Robust SQL support and advanced analytic functions • Extensions to the SQL standard to work seamlessly with streaming data • Support for at-least-once processing semantics
  24. Continuously deliver SQL results • Send processed data to multiple

    destinations • S3, Amazon Redshift, Amazon ES (through Firehose) • Streams (with AWS Lambda integration for custom destinations) • End-to-end processing speed as low as sub- second • Separation of processing and data delivery
  25. Generate time series analytics • Compute key performance indicates over-time

    windows • Combine with historical data in S3 or Amazon Redshift Analytics Streams Firehose Amazon Redshift S3 Streams Firehose Custom, real- time destinations
  26. Feed real-time dashboards • Validate and transform raw data, and

    then process to calculate meaningful statistics • Send processed data downstream for visualization in BI and visualization services Amazon QuickSight Analytics Amazon ES Amazon Redshift Amazon RDS Streams Firehose
  27. Create real-time alarms and notifications • Build sequences of events

    from the stream, like user sessions in a clickstream or app behavior through logs • Identify events (or a series of events) of interest, and react to the data through alarms and notifications Analytics Streams Firehose Streams Amazon SNS Amazon CloudWatch Lambda
  28. SQL on streaming data • SQL is an API to

    your data • Ask for what you want, system decides how to get it • For all data, not just “flat” data in a database • Opportunity for novel data organization and algorithms • A standard (ANSI 2008, 2011) and the most commonly used data manipulation language
  29. A simple streaming query • Tweets about the AWS NYC

    Summit • Selecting from a STREAM of tweets, an in-application stream • Each row has a corresponding ROWTIME SELECT STREAM ROWTIME, author, text FROM Tweets WHERE text LIKE ‘%#AWSNYCSummit%'
  30. A streaming table is a STREAM • In relational databases,

    you work with SQL tables • With Analytics, you work with STREAMS • SELECT, INSERT, and CREATE can be used with STREAMs CREATE STREAM Tweets (author VARCHAR(20), text VARCHAR(140)); INSERT INTO Tweets SELECT …
  31. Writing queries on unbounded data sets • Streams are unbounded

    data sets • Need continuous queries, row-by-row or across rows • WINDOWs define a start and end to the query SELECT STREAM author, count(author) OVER ONE_MINUTE FROM Tweets WINDOW ONE_MINUTE AS (PARTITION BY author RANGE INTERVAL '1' MINUTE PRECEDING);
  32. Аномалии в спотовых ценах CREATE OR REPLACE PUMP "WEIGHTED_FAMILY_STREAM_PUMP" AS

    INSERT INTO "WEIGHTED_FAMILY_STREAM" SELECT STREAM "ts", "availabilityzone", "instancetype", "family", "size", "magnitude", "spotprice"/"magnitude" as "weightedprice", "spotprice" FROM (SELECT STREAM "ts", "availabilityzone", "instancetype", instance_family("instancetype") as "family", instance_size("instancetype") as "size", instance_magnitude("instancetype") as "magnitude", "spotprice" FROM "SOURCE_SQL_STREAM_001" WHERE "productdescription" = 'Linux/UNIX') WHERE "family" = 'C4';
  33. Аномалии в спотовых ценах CREATE OR REPLACE PUMP "AZ_PRICE_STREAM_PUMP" AS

    INSERT INTO "AZ_PRICE_STREAM" SELECT STREAM "ts", "eu-west-1a-price", "eu-west-1b-price", "eu-west-1c-price", "ANOMALY_SCORE" as "anomaly_score" FROM TABLE(RANDOM_CUT_FOREST(CURSOR(SELECT STREAM "ts", avg(case when "availabilityzone" = 'eu-west-1a' then "weightedprice" else null end) over w1 as "eu-west-1a-price", avg(case when "availabilityzone" = 'eu-west-1b' then "weightedprice" else null end) over w1 as "eu-west-1b-price", avg(case when "availabilityzone" = 'eu-west-1c' then "weightedprice" else null end) over w1 as "eu-west-1c-price" FROM "WEIGHTED_FAMILY_STREAM" WINDOW W1 AS (RANGE INTERVAL '10' MINUTE PRECEDING)), 100, 100, 10000, 10));
  34. @dbatalov Вопросы? aws.amazon.com/ru Денис Баталов, PhD Sr. Solutions Architect Спец

    по ML и AI Amazon Web Services Luxembourg @awsoblako aws.amazon.com/ru/blogs/rus/