Blockchain 8 Years Cloud Computing 8 Years Distributed Computing Architecting & Building Apps a tech presentorial Combination of presentation & tutorial ARAF KARSH HAMID Co-Founder / CTO MetaMagic Global Inc., NJ, USA @arafkarsh arafkarsh 1 Microservice Architecture Series Building Cloud Native Apps Kinesis Data Steams Kinesis Firehose Kinesis Data Analytics Apache Flink Part 3 of 12
to securely stream video from systems to AWS for processing such as Analytics, Machine Learning and others. Kinesis Data Streams are a highly Scalable, Durable, & Realtime data streaming service that can capture Gigabytes of data per second different data sources. Kinesis Data Firehose is used to Extract, Load, Transform (ETL) data streams into AWS stores like S3, Redshift, Open Search etc. for near Realtime data analytics. Kinesis Data Analytics is used to process the real-time streams in SQL or Java or Python.
Rekognition • AWS Sage Maker • Tensor Flow • HLS Playback • Custom Video Processing • Automatically scales the infrastructure needed for streaming video data from devices • Stream video from connected devices to AWS for Analytics, Machine Learning, Playback etc. • Stores, Encrypts and indexes video data and access the data using APIs HLS – HTTP Live Streaming INPUT Kinesis Video Stream
Data Analytics • Spark • AWS EC2 • AWS Lambda • Kinesis Data Streams are Highly Scalable and Durable Real-time streaming • Stream Data from connected devices to AWS for Analytics, Machine Learning. etc. INPUT Kinesis Data Stream
Events are coming from Cart Checkout • Using the Lambda, the Raw Event is Enriched and send to another Stream for further processing Event Producer Kinesis Data Stream Raw Events Kinesis Data Stream Enriched Events Enrich the Checkout Event IN OUT Example Source: https://github.com/MetaArivu/Kinesis-Quickstart
S3 • AWS Redshift • AWS Elastic Search • Splunk • Kinesis Data Firehose is to store the streaming data into Data Stores, Lakes etc. • Firehose is used to Capture, Transform and Load Data into S3, Redshift etc. Kinesis Data Stream Kinesis Data Firehose Data Transformation using Lambda
is used to analyze the streaming Data • Reduces the complexity in building and deploying Analytics Applications • Provides built-in Functions to Filter, Aggregate and Transform Streaming Data • Serverless Architecture • Under the hood its Apache Flink (v1.13) INPUT Kinesis Data Stream Kinesis Data Analytics OUTPUT Kinesis Data Stream
helps you to securely stream video from systems to AWS for processing such as Analytics, Machine Learning and others. Kinesis Data Streams are a highly Scalable, Durable, & Realtime data streaming service that can capture Gigabytes of data per second different data sources. Kinesis Data Firehose is used to Extract, Load, Transform (ETL) data streams into AWS stores like S3, Redshift, Open Search etc. for near Realtime data analytics. Kinesis Data Analytics is used to process the real-time streams in SQL or Java or Python.
unit of data in a Data Stream stored in Kinesis Data Stream Collection of Data Records streamed and stored in multiple shards. Data Record Data Record Data Record Data Record Data Stream Data Record Data Record Data Record Shard 1 Data Record Data Record Data Record Shard 2 Data Record Data Record Data Record Shard n Data Stream Source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html Producer puts the Data Records into the Shards and Consumer retrieves the data from the Shard.
is a uniquely identified sequence of data records in a stream. • A stream is composed of one or more shards, each of which provides a fixed unit of capacity. • Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second and up to 1,000 records per second for writes, up to a maximum total data write rate of 1 MB per second (including partition keys). • The data capacity of your stream is a function of the number of shards that you specify for the stream. Data Record Data Record Data Record Shard 1 Data Record Data Record Data Record Shard 2 Data Record Data Record Data Record Shard n Data Stream Source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html • The total capacity of the stream is the sum of the capacities of its shards.
Partition Key Data BLOB • A partition key is used to group data by shard within a stream. • Kinesis Data Streams segregates the data records belonging to a stream into multiple shards. • It uses the partition key that is associated with each data record to determine which shard a given data record belongs to. • Partition keys are Unicode strings, with a maximum length limit of 256 characters for each key. • An MD5 hash function is used to map partition keys to 128-bit integer values and to map associated data records to shards using the hash key ranges of the shards. • When an application puts data into a stream, it must specify a partition key.
data record has a sequence number that is unique per partition-key within its shard. • Kinesis Data Streams assigns the sequence number after you write to the stream with client.putRecords or client.putRecord. • Sequence numbers for the same partition key generally increase over time. • The longer the time period between write requests, the larger the sequence numbers become. Source: https://docs.aws.amazon.com/streams/latest/dev/key-concepts.html
be any video-generating device, such as a • security camera, • a body-worn camera, • a smartphone camera, or a • Dashboard camera. • A producer can also send non-video data, such as audio feeds, images, or RADAR data. A single producer can generate one or more video streams. For example, a video camera can push video data to one Kinesis video stream and audio data to another. Kinesis Video Streams Producer libraries • Install and configure on your devices. • Securely connect and reliably stream video in different ways, • including in real time, after buffering it for a few seconds, • or as after-the-fact media uploads.
Data to Kinesis Video Streams • Example: Kinesis Video Streams Producer SDK GStreamer Plugin: Shows how to build the Kinesis Video Streams Producer SDK to use as a GStreamer destination. • Run the GStreamer Element in a Docker Container: Shows how to use a pre-built Docker image for sending RTSP video from an IP camera to Kinesis Video Streams. • Example: Streaming from an RTSP Source: Shows how to build your own Docker image and send RTSP video from an IP camera to Kinesis Video Streams. • Example: Sending Data to Kinesis Video Streams Using the PutMedia API: Shows how to use the Using the Java Producer Library to send data to Kinesis Video Streams that is already in a container format Matroska (MKV) using the PutMedia API. GStreamer is a popular media framework used by a multitude of cameras and video sources to create custom media pipelines by combining modular plugins. • RTSP Camera on Ubuntu • USB Camera on Ubuntu • Camera on Raspberry Pi Source: https://docs.aws.amazon.com/kine sisvideostreams/latest/dg/examples -gstreamer-plugin.html
Transport live video data, optionally store it • Data available for consumption both in real time and on a batch or ad hoc basis. • A Kinesis video stream has only one producer publishing data into it. The stream can carry • audio, • video, and • similar time-encoded data streams, such as • depth sensing feeds, • RADAR feeds, and more. Source: https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/how-it-works.html Kinesis Video Stream Consumer (App) • Gets data, such as fragments and frames, from a Kinesis video stream • To view, process, or analyse it. Kinesis Video Stream Parser Library • To reliably get media from Kinesis video streams in a low-latency manner. • It parses the frame boundaries in the media so that applications can focus on processing and analysing the frames themselves.
This class reads specified MKV elements from a video stream. • FragmentMetadataVisitor: This class retrieves metadata for fragments (media elements) and tracks (individual data streams containing media information, such as audio or subtitles). • OutputSegmentMerger: This class merges consecutive fragments or chunks in a video stream. • KinesisVideoExample: This is a sample application that shows how to use the Kinesis Video Stream Parser Library. The library also includes tests that show how the tools are used.
S3 • AWS Redshift • AWS Elastic Search • Splunk • Kinesis Data Firehose is to Store the Streaming data into Data Stores, Lakes etc. • Firehose is used to Capture, Transform & Load Data into S3, Redshift etc. Kinesis Data Stream Kinesis Data Firehose Data Transformation using Lambda
• The record ID is passed from Kinesis Data Firehose to Lambda during the invocation. • The transformed record must contain the same record ID. • Any mismatch between the ID of the original record and the ID of the transformed record is treated as a data transformation failure. result The status of the data transformation of the record. The possible values are: • Ok (the record was transformed successfully), • Dropped (the record was dropped intentionally by your processing logic), and • ProcessingFailed (the record could not be transformed). If a record has a status of Ok or Dropped, Kinesis Data Firehose considers it successfully processed. Otherwise, Kinesis Data Firehose considers it unsuccessfully processed. data The transformed data payload, after base64-encoding. Source: https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html
is used to Analyze the Streaming Data • Reduces the complexity in building and deploying Analytics Applications • Provides built-in Functions to Filter, Aggregate & Transform Streaming Data • Serverless Architecture • Under the hood its Apache Flink (v1.13) – December 2021 INPUT Kinesis Data Stream Kinesis Data Analytics OUTPUT Kinesis Data Stream
High Performance Strong Data Integrity Flexible APIs for Programming Low Latency & Horizontally Scalable Stores Application States Exactly Once Processing & Consistent State Is an Open-Source Stream Processing Framework
Data Streams Batch Processing Process Static & historic Data Data Stream Processing Realtime Results from Data Streams Event Driven Applications Data Driven Actions and Services Instead of Spark + Hadoop
Traditional Periodic ETL • External Tool Periodically triggers ETL Batch Job Batch Processing Process Static & historic Data Data Stream Processing Realtime Results from Data Streams Continuous Streaming Data Pipeline • Ingestion with Low Latency • No Artificial Boundaries Streaming App Ingest Append Real Time Events Event Logs Batch Process Module Read Write Transactional Data Extract, Transform, Load Capture, Transform, Load State Source: GoTo: Intro to Stateful Stream Processing – Robert Metzger
Ad-Hoc Queries • Queries changes faster than data Batch Analytics Stream Analytics Ingest K-V Data Store Real Time Events Batch Analytics Read Write Recorded Events • High Performance Low Latency Result • Data Changes faster than Queries Analytics App State State Update Source: GoTo: Intro to Stateful Stream Processing – Robert Metzger
& Data Tier Architecture • React to Process Events • State is stored in (Remote) Database Traditional Application Design Event Driven Application • High Performance Low Latency Result • Data Changes faster than Queries Application Read Write Events Trigger Action Ingest Real Time Events Application State Append Periodically write asynchronous checkpoints in Remote Database Event Logs Event Logs Trigger Action Source: GoTo: Intro to Stateful Stream Processing – Robert Metzger
distributes the work onto the Task Managers, where the actual operators such as 1. sources, 2. transformations and 3. sinks are running. Job Manager is the name of the central work coordination component of Flink. Task Managers are the services actually performing the work of a Flink job.
Job Manager: Resource Manager It is responsible for resource de-/allocation and provisioning in a Flink cluster — it manages task slots, which are the unit of resource scheduling in a Flink cluster. Dispatcher It provides a REST interface to submit Flink applications for execution and starts a new Job Master for each submitted job. Job Master It is responsible for managing the execution of a single JobGraph. Multiple jobs can run simultaneously in a Flink cluster, each having its own Job Master.
with two high availability service implementations: • ZooKeeper: ZooKeeper HA services can be used with every Flink cluster deployment. They require a running ZooKeeper quorum. • Kubernetes: Kubernetes HA services only work when running on Kubernetes. Flink’s high availability services encapsulate the required services to make everything work: • Leader election: Selecting a single leader out of a pool of n candidates • Service discovery: Retrieving the address of the current leader • State persistence: Persisting state which is required for the successor to resume the job execution (Job Graphs, user code jars, completed checkpoints
For distributed execution, Flink chains operator subtasks together into tasks • Each task is executed by one thread. • Chaining operators together into tasks is a useful optimization: • it reduces the overhead of thread-to-thread handover and buffering, • and increases overall throughput while decreasing latency. T1 T2 T3 T4 T5
(Task Manager) is a JVM process and may execute one or more subtasks in separate threads. • To control how many tasks a Task Manager accepts, it has so called task slots (at least one). • Memory is divided equally across the slots. • No CPU isolation across task slot. • Having multiple slots means more subtasks share the same JVM. • Tasks in the same JVM share TCP connections (via multiplexing) and heartbeat messages. • They may also share data sets and data structures, thus reducing the per-task overhead. • Flink allows subtasks to share slots even if they are subtasks of different tasks, so long as they are from the same job.
an execution environment, 2. Load/create the initial data, 3. Specify transformations on this data, 4. Specify where to put the results of your computations, 5. Trigger the program execution. Will be triggered on your local machine or submit your program for execution on a cluster. Source Transform Transform Sink 1 2 3 5 4 Each program consists of the same basic parts: Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/#anatomy-of-a-flink-program
Availability Service Provider Flink's Job Manager can be run in high availability mode which allows Flink to recover from Job Manager faults. In order to failover faster, multiple standby Job Managers can be started to act as backups. • Zookeeper • Kubernetes HA 2 File Storage and Persistency For checkpointing (recovery mechanism for streaming jobs) Flink relies on external file storage systems See FileSystems page. 3 Resource Provider Flink can be deployed through different Resource Provider Frameworks, such as Kubernetes, YARN or Mesos. • Kubernetes • YARN • Mesos 4 Metrics Storage Flink components report internal metrics and Flink jobs can report additional, job specific metrics as well. See Metrics Reporter page. 5 Application-level data sources and sinks While application-level data sources and sinks are not technically part of the deployment of Flink cluster components, they should be considered when planning a new Flink production deployment. Colocating frequently used data with Flink can have significant performance benefits For example: • Apache Kafka • Amazon S3 • Amazon Kinesis • Elastic Search See Connectors page. Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/overview/
a regular Java Collection in terms of usage but is quite different in some keyways. • They are immutable, meaning that once they are created you cannot add or remove elements. • You can also not simply inspect the elements inside but only work on them using the DataStream API operations, which are also called transformations. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/ Reading from Socket
- Reads text files, i.e. files that respect the TextInputFormat specification, line-by- line and returns them as Strings. • readFile(fileInputFormat, path) - Reads (once) files as dictated by the specified file input format. • readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo) - This is the method called internally by the two previous ones. It reads files in the path based on the given fileInputFormat. Depending on the provided watchType, this source may periodically monitor (every interval ms) the path for new data (FileProcessingMode.PROCESS_CONTINUOUSLY), or process once the data currently in the path and exit (FileProcessingMode.PROCESS_ONCE). Using the pathFilter, the user can further exclude files from being processed.
from a socket. Elements can be separated by a delimiter. Collection-based: • fromCollection(Collection) - Creates a data stream from the Java Java.util.Collection. All elements in the collection must be of the same type. • fromCollection(Iterator, Class) - Creates a data stream from an iterator. The class specifies the data type of the elements returned by the iterator. • fromElements(T ...) - Creates a data stream from the given sequence of objects. All objects must be of the same type. • fromParallelCollection(SplittableIterator, Class) - Creates a data stream from an iterator, in parallel. The class specifies the data type of the elements returned by the iterator. • generateSequence(from, to) - Generates the sequence of numbers in the given interval, in parallel. Custom: • addSource - Attach a new source function. For example, to read from Apache Kafka you can use addSource(new FlinkKafkaConsumer<>(...)). See connectors for more details. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/overview/
TextOutputFormat - Writes elements line-wise as Strings. The Strings are obtained by calling the toString() method of each element. • writeAsCsv(...) / CsvOutputFormat - Writes tuples as comma-separated value files. Row and field delimiters are configurable. The value for each field comes from the toString() method of the objects. • print() / printToErr() - Prints the toString() value of each element on the standard out / standard error stream. Optionally, a prefix (msg) can be provided which is prepended to the output. This can help to distinguish between different calls to print. If the parallelism is greater than 1, the output will also be prepended with the identifier of the task which produced the output. • writeUsingOutputFormat() / FileOutputFormat - Method and base class for custom file outputs. Supports custom object-to-bytes conversion. • writeToSocket - Writes elements to a socket according to a SerializationSchema • addSink - Invokes a custom sink function. Flink comes bundled with connectors to other systems (such as Apache Kafka) that are implemented as sink functions. Data sinks consume DataStreams and forward them to files, sockets, external systems, or print them
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/execution_mode/ The execution mode can be configured via the execution.runtime-mode setting. There are three possible values: 1. STREAMING: The classic DataStream execution mode (default) 2. BATCH: Batch-style execution on the DataStream API 3. AUTOMATIC: Let the system decide based on the boundedness of the sources • The BATCH execution mode can only be used for Jobs/Flink Programs that are bounded. • Boundedness is a property of a data source that tells us whether all the input coming from that source is known before execution or whether new data will show up, potentially indefinitely. • A job, in turn, is bounded if all its sources are bounded, and unbounded otherwise. • STREAMING execution mode, on the other hand, can be used for both bounded and unbounded jobs. • As a rule of thumb, you should be using BATCH execution mode when your program is bounded because this will be more efficient.
and produces one element. A map function that doubles the values of the input stream: Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/overview/ Flat Map Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words: Filter Evaluates a Boolean function for each element and retains those for which the function returns true. Key By Logically partitions a stream into disjoint partitions. All records with the same key are assigned to the same partition.
“rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value. Union Union of two or more data streams creating a new stream containing all the elements from all the streams. Join Join two data streams on a given key and a common window. Join Interval Join two elements e1 and e2 of two keyed streams with a common key over a given time interval, so that e1.timestamp + lowerBound <= e2.timestamp <= e1.timestamp + upperBound
be defined on regular Data Streams. Windows group all the stream events according to some characteristic. Window Apply Applies a general function to the window as a whole. Below is a function that manually sums the elements of a window. Window Reduce Applies a functional reduce function to the window and returns the reduced value. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/overview/ Window Windows can be defined on already partitioned Keyed Streams. Windows group the data in each key according to some characteristic.
Data Source of Application 2. They are part of the Stream and carry a timestamp 3. A Watermark asserts that all earlier events have probably arrived • Watermark w9 asserts that all the events with time < w9 has arrived. • Watermark w15 asserts that all the events with time < w15 has arrived. 27 Event Stream 25 13 21 4 10 13 12 15 8 7 11 1 3 w9 w15 w5 18 w21 Event Timestamp Watermarks Late Events
fired it’s state is freed & all the late events are dropped. • You can avoid the dropping of the late events by configuring the max time to wait for the late events. • With Sufficient lateness allowed Event  and  are updated in the respective window and result is updated (R2) stream.window(<window assigner>).allowedLateness(<timer>)
o timerService.registerEventTimeTimer(event.timestamp); // Time In Millis o timerService.registerProcessingTimeTimer(event.timestamp); // Time In Millis Implicit o stream.window(TumblingEventTimeWindows.of(Time.seconds(7))) o stream.window(TumblingProcessingTimeWindows.of(Time.seconds(7))) o SELECT user, SUM(amount) o FROM Orders o GROUP BY TUMBLE(rowtime, INTERVAL ‘1’ HOUR), user Source: Streaming Concepts & Introduction – Feb 1, 2021: https://www.youtube.com/watch?v=QVDJFZVHZ3c
To measure progress in event time. • It flow as part of the data stream and carry a timestamp t. • A Watermark(t) declares that event time has reached time t in that stream. • Meaning that there should be no more elements from the stream with a timestamp t' <= t (i.e. events with timestamps older or equal to the watermark). Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/time/
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/time/ • A watermark is a declaration that by that point in the stream, all events up to a certain timestamp should have arrived. • Once a watermark reaches an operator, the operator can advance its internal event time clock to the value of the watermark.
• Watermarks are generated at, or directly after, source functions. • Each parallel subtask of a source function usually generates its watermarks independently. • These watermarks define the event time at that particular parallel source.
event time, Flink needs to know the events timestamps, meaning each element in the stream needs to have its event timestamp assigned. This is usually done by accessing/extracting the timestamp from some field in the element by using a Timestamp Assigner. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/event-time/generating_watermarks/ Specifying a Timestamp Assigner is optional, and, in most cases, you don’t actually want to specify one. For example, when using Kafka or Kinesis you would get timestamps directly from the Kafka/Kinesis records. Idle Input Source If one of the input splits/partitions/shards does not carry events for a while this means that the Watermark Generator also does not get any new information on which to base a watermark. To deal with this, you can use a Watermark Strategy that will detect idleness and mark an input as idle.
places in Flink applications where a Watermark Strategy can be used: 1. directly on sources and (RECOMMENDED) 2. after non-source operation. The first option is preferable, because • it allows sources to exploit knowledge about shards/partitions/splits in the watermarking logic. • Sources can usually then track watermarks at a finer level and • the overall watermark produced by a source will be more accurate. The second option (setting a Watermark Strategy after arbitrary operations) should only be used if you cannot set a strategy directly on the source. After non-source operation
of Time • Event Time • Processing Time • With Event Time • Events can be out of Order • Expect Deterministic Results • Event time Applications are Responsible for • Providing Watermarks • Deciding how to handle late events • Streaming Applications must trade off Completeness for Latency • Can wait longer to have more complete information before acting • Can wait less to reduce latency • Watermarks are the mechanism for managing this trade off Source: https://www.youtube.com/watch?v=QVDJFZVHZ3c
& hindsight State Complex Business Logic Consistency with out-of- order data & Late data Event Time Snapshots Forking / versioning / Time Travel Source: Flink Forward 2021: https://www.youtube.com/watch?v=vLLn5PxF2Lw
3 Messaging Layer Kafka / Kinesis Data Streams Event Time Broker Time Source: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/concepts/time/ Event Producer Flink Data Source Flink Window Operator [ ] [ ] Processing Time Ingestion Time
of Event Time • Increases Latency for Ordered Event Time • Flink Reconstruct the order Event time: Event time is the time that each individual event occurred on its producing device. Processing time: Processing time refers to the system time of the machine that is executing the respective operation. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/concepts/time/
of processing infinite streams. • Windows split the stream into “buckets” of finite size, over which we can apply computations. • It is created as soon as the first element that should belong to this window arrives, and the • Window is completely removed when the time (event or processing time) passes its end timestamp plus the user-specified allowed lateness. • Flink guarantees removal only for time-based windows. • 2 Category of Windows – Keyed keyBy(…) and non-Keyed Windows windowAll(…) Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/ Types of Windows 1. Time Windows 2. Count Windows
fixed size and do not overlap. • Without offsets hourly tumbling windows are aligned with epoch, that is you will get windows such as • 1:00:00.000 - 1:59:59.999, 2:00:00.000 - 2:59:59.999 and so on. • Offset of 15 minutes you would, for example, get 1:15:00.000 - 2:14:59.999. • An important use case for offsets is to adjust windows to time zones other than UTC-0. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
groups elements by sessions of activity. • Session windows do not overlap and do not have a fixed start and end time. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
is only useful if you also specify a custom trigger. • Otherwise, no computation will be performed, as the global window does not have a natural end at which we could process the aggregated elements. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
Function Process Window Function Process Window Function with Incremental Aggregation • Window functions are used to specify the computation needs to happen on the window. • This is done when a Window is ready for Processing. • Triggers are used to determine when the Window is ready for Computation. The window function can be one of Reduce Function, Aggregate Function, or Process Window Function. The Reduce Function, Aggregate Function can be executed more efficiently because Flink can incrementally aggregate the elements for each window as they arrive. A Process Window Function gets an Iterable for all the elements contained in a window and additional meta information about the window to which the elements belong.
is a generalized version of a Reduce Function that has three types: 1. an input type (IN), 2. accumulator type (ACC), 3. and an output type (OUT). The input type is the type of elements in the input stream and the Aggregate Function has a method for adding one input element to an accumulator. The interface also has methods for 1. creating an initial accumulator, 2. for merging two accumulators into one accumulator and for 3. extracting an output (of type OUT) from an accumulator. Same as with Reduce Function, Flink will incrementally aggregate input elements of a window as they arrive.
Function gets an Iterable containing all the elements of the window, and a Context object with access to time and state information, which enables it to provide more flexibility than other window functions.
Window Function can be combined with either a Reduce Function, or an Aggregate Function to incrementally aggregate elements as they arrive in the window. When the window is closed, the Process Window Function will be provided with the aggregated result. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
window (as formed by the window assigner) is ready to be processed by the window function. • It comes with a default Trigger. Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/ 1. The onElement() method is called for each element that is added to a window. 2. The onEventTime() method is called when a registered event-time timer fires. 3. The onProcessingTime() method is called when a registered processing-time timer fires. 4. The onMerge() method is relevant for stateful triggers and merges the states of two triggers when their corresponding windows merge, e.g. when using session windows. 5. The clear() method performs any action needed upon removal of the corresponding window.
optional Evictor in addition to the Window Assigner and the Trigger. This can be done using the evictor(...) method (shown in the beginning of this document). The evictor has the ability to remove elements from a window after the trigger fires and before and/or after the window function is applied. Flink comes with three pre-implemented evictors. These are: • Count Evictor: keeps up to a user-specified number of elements from the window and discards the remaining ones from the beginning of the window buffer. • Delta Evictor: takes a Delta Function and a threshold, computes the delta between the last element in the window buffer and each of the remaining ones, and removes the ones with a delta greater or equal to the threshold. • Time Evictor: takes as argument an interval in milliseconds and for a given window, it finds the maximum timestamp max_ts among its elements and removes all the elements with timestamps smaller than max_ts - interval. • By default, all the pre-implemented evictors apply their logic before the window function Source: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/operators/windows/
(stream) and “bounded” (batch) data • Processes recorded (offline) and live (real-time) data • Batch is just a special case of streaming data Event Log Bounded Stream Bounded Stream now Unbounded Stream Unbounded Stream Start of the Stream Past Future Source: Flink Forward 2021: https://www.youtube.com/watch?v=vLLn5PxF2Lw
& Transform Window State Read & Write Sink 1 2 keyBy(R1, R3, R5) keyBy(R2, R4, R6) Scalable Local State Scalable Local State keyBy() keyBy() High Performance In Memory Computing & Parallelize the Tasks Raw Events Raw Events New Aggregated Event External Storage For Snapshots
Storage Processor State Snapshots Internal Storage Internal • Independent from Processing • Low Performance due to remote Storage • Hard to get ”Exactly-Once” guarantees • Highly consistent distributed Snapshotting • Faster access with Local Storage • Stream processor needs to handle scaling and storage
Snapshots HashMap State Backend • Store the state in Memory (HashMap) • Faster access with Memory Storage • Subject to Garbage Collection Processor State Snapshots RocksDB State Backend • Stores the state in Local RocksDB • Limited only by Local Disk Size • Slower than Memory Storage (10x Slower) • Serialize on write and DeSerialize on Read RocksDB Key Value Storage • Jobs with large state, long windows, large key/value states. • All high-availability setups • Jobs with large state, long windows, large key/value states. • All high-availability setups
2. Infor: Compliance Violation (Banking) 3. Biogen: Centralized Log Management 4. Viber: Massive Data Handling - 300 Msgs / Second 5. AWS: IoT Data using Firehose and Data Analytics 6. Nordstrom: Ledger with Multi Data Views 128 4
Data comes Kinesis • Using Lambda’s Data is stored in DynamoDB (Sequential Ops) • Firehose stores Raw Data in S3 • Enriched Data is stored in Aurora, Elastic Search and S3 • Glue is used for Batch Process 129 Source: https://www.youtube.com/watch?v=KM5ONS2fnG0
& Tx Data is sent to Kinesis Data Stream • Services in Fargate picks up the data from KDS send to Aurora & S3 • Scheduler (5) invokes service to EMR processing. • EMR fetch data from Aurora & S3 and sends data to Event bridge • Event Bridge (10) sent data to SQS • Service in Fargate picks up the data from SQS and sends out email. 130 Source: https://www.youtube.com/watch?v=0gNMEyei-co
VPC Logs sent to Kinesis Firehose • Firehose (4) sends data to Lambda • Lambda (5) Enrich / Normalize the data and stores in S3 • Lambda (7)npicks up the data from S3 and stores in Elastic Search • Kibana is used for Data Visualization. 131 Source: https://www.youtube.com/watch?v=m8xtR3-ZQs8
• From Viber BE events are batched and send to Kinesis. • Using KCL in Apache Storm Events are picked from Kinesis and using Firehose Events are stored in S3 • Aggregated Data is Sent to another Kinesis Stream and using a Lambda the event is send in Viber BE based on Rules. 132 Source: https://www.youtube.com/watch?v=7i1tj59pvYw EMR – Elastic Map Reduce
Data is stored in Kinesis Data Stream as Raw Data (Ledger) • Firehose Stores (4) Raw Data in S3 Bucket • Lambda (5.1-5.3) Transforms and stores data in different DB in different formats for various Read usages. 133 Source: https://www.youtube.com/watch?v=O7PTtm_3Os4
DynamoDB • MQTT based Data from IoT • Firehose stores the data in S3 • Kinesis DA get the data from Firehose analyze it and stores send to Firehose to store in S3 • Using Lambda the data is enriched and stored in DynamoDB • Using Web Based App user gets the data from DynamoDB 134 Source: https://www.youtube.com/watch?v=uWUAcc68MWI
is Dead : GoTo 2015 By Dave Thomas 2. Apr 7, 2016 - Agile Project Management with Kanban | Eric Brechner | Talks at Google 3. Sep 27, 2017 - Scrum vs Kanban - Two Agile Teams Go Head-to-Head 4. Feb 17, 2019 - Lean vs Agile vs Design Thinking 5. Dec 17, 2020 - Scrum vs Kanban | Differences & Similarities Between Scrum & Kanban 6. Feb 24, 2021 - Agile Methodology Tutorial for Beginners | Jira Tutorial | Agile Methodology Explained. Agile Methodologies
Fowler 2. When to use Microservices By Martin Fowler 3. GoTo: Sep 3, 2020: When to use Microservices By Martin Fowler 4. GoTo: Feb 26, 2020: Monolith Decomposition Pattern 5. Thought Works: Microservices in a Nutshell 6. Microservices Prerequisites 7. What do you mean by Event Driven? 8. Understanding Event Driven Design Patterns for Microservices
Fowler – Micro Services : https://www.youtube.com/watch?v=2yko4TbC8cI&feature=youtu.be&t=15m53s 2. GOTO 2016 – Microservices at NetFlix Scale: Principles, Tradeoffs & Lessons Learned. By R Meshenberg 3. Mastering Chaos – A NetFlix Guide to Microservices. By Josh Evans 4. GOTO 2015 – Challenges Implementing Micro Services By Fred George 5. GOTO 2016 – From Monolith to Microservices at Zalando. By Rodrigue Scaefer 6. GOTO 2015 – Microservices @ Spotify. By Kevin Goldsmith 7. Modelling Microservices @ Spotify : https://www.youtube.com/watch?v=7XDA044tl8k 8. GOTO 2015 – DDD & Microservices: At last, Some Boundaries By Eric Evans 9. GOTO 2016 – What I wish I had known before Scaling Uber to 1000 Services. By Matt Ranney 10. DDD Europe – Tackling Complexity in the Heart of Software By Eric Evans, April 11, 2016 11. AWS re:Invent 2016 – From Monolithic to Microservices: Evolving Architecture Patterns. By Emerson L, Gilt D. Chiles 12. AWS 2017 – An overview of designing Microservices based Applications on AWS. By Peter Dalbhanjan 13. GOTO Jun, 2017 – Effective Microservices in a Data Centric World. By Randy Shoup. 14. GOTO July, 2017 – The Seven (more) Deadly Sins of Microservices. By Daniel Bryant 15. Sept, 2017 – Airbnb, From Monolith to Microservices: How to scale your Architecture. By Melanie Cubula 16. GOTO Sept, 2017 – Rethinking Microservices with Stateful Streams. By Ben Stopford. 17. GOTO 2017 – Microservices without Servers. By Glynn Bird.
2012 What I have learned about DDD Since the book. By Eric Evans 2. Mar 19, 2013 Domain Driven Design By Eric Evans 3. Jun 02, 2015 Applied DDD in Java EE 7 and Open Source World 4. Aug 23, 2016 Domain Driven Design the Good Parts By Jimmy Bogard 5. Sep 22, 2016 GOTO 2015 – DDD & REST Domain Driven API’s for the Web. By Oliver Gierke 6. Jan 24, 2017 Spring Developer – Developing Micro Services with Aggregates. By Chris Richardson 7. May 17. 2017 DEVOXX – The Art of Discovering Bounded Contexts. By Nick Tune 8. Dec 21, 2019 What is DDD - Eric Evans - DDD Europe 2019. By Eric Evans 9. Oct 2, 2020 - Bounded Contexts - Eric Evans - DDD Europe 2020. By. Eric Evans 10. Oct 2, 2020 - DDD By Example - Paul Rayner - DDD Europe 2020. By Paul Rayner
Event Driven Architecture – Mar 21, 2021 2. Martin Fowler: Event Driven Architecture – GOTO 2017 3. Greg Young: A Decade of DDD, Event Sourcing & CQRS – April 11, 2016 4. Nov 13, 2014 GOTO 2014 – Event Sourcing. By Greg Young 5. Mar 22, 2016 Building Micro Services with Event Sourcing and CQRS 6. Apr 15, 2016 YOW! Nights – Event Sourcing. By Martin Fowler 7. May 08, 2017 When Micro Services Meet Event Sourcing. By Vinicius Gomes
RabbitMQ 3. IBM: Apache Kafka – Sept 18, 2020 4. Confluent: Apache Kafka Fundamentals – April 25, 2020 5. Confluent: How Kafka Works – Aug 25, 2020 6. Confluent: How to integrate Kafka into your environment – Aug 25, 2020 7. Kafka Streams – Sept 4, 2021 8. Kafka: Processing Streaming Data with KSQL – Jul 16, 2018 9. Kafka: Processing Streaming Data with KSQL – Nov 28, 2019
1. Google: How to Choose the right database? 2. AWS: Choosing the right Database 3. IBM: NoSQL Vs. SQL 4. A Guide to NoSQL Databases 5. How does NoSQL Databases Work? 6. What is Better? SQL or NoSQL? 7. What is DBaaS? 8. NoSQL Concepts 9. Key Value Databases 10. Document Databases 11. Jun 29, 2012 – Google I/O 2012 - SQL vs NoSQL: Battle of the Backends 12. Feb 19, 2013 - Introduction to NoSQL • Martin Fowler • GOTO 2012 13. Jul 25, 2018 - SQL vs NoSQL or MySQL vs MongoDB 14. Oct 30, 2020 - Column vs Row Oriented Databases Explained 15. Dec 9, 2020 - How do NoSQL databases work? Simply Explained! 1. Graph Databases 2. Column Databases 3. Row Vs. Column Oriented Databases 4. Database Indexing Explained 5. MongoDB Indexing 6. AWS: DynamoDB Global Indexing 7. AWS: DynamoDB Local Indexing 8. Google Cloud Spanner 9. AWS: DynamoDB Design Patterns 10. Cloud Provider Database Comparisons 11. CockroachDB: When to use a Cloud DB?
IBM: Virtual Machines and Containers 2. IBM: What is a Hypervisor? 3. IBM: Docker Vs. Kubernetes 4. IBM: Containerization Explained 5. IBM: Kubernetes Explained 6. IBM: Kubernetes Ingress in 5 Minutes 7. Microsoft: How Service Mesh works in Kubernetes 8. IBM: Istio Service Mesh Explained 9. IBM: Kubernetes and OpenShift 10. IBM: Kubernetes Operators 11. 10 Consideration for Kubernetes Deployments Istio – Metrics 1. Istio – Metrics 2. Monitoring Istio Mesh with Grafana 3. Visualize your Istio Service Mesh 4. Security and Monitoring with Istio 5. Observing Services using Prometheus, Grafana, Kiali 6. Istio Cookbook: Kiali Recipe 7. Kubernetes: Open Telemetry 8. Open Telemetry 9. How Prometheus works 10. IBM: Observability vs. Monitoring
introduction to TDD 2. Aug 14, 2019 – Component Software Testing 3. May 30, 2020 – What is Component Testing? 4. Apr 23, 2013 – Component Test By Martin Fowler 5. Jan 12, 2011 – Contract Testing By Martin Fowler 6. Jan 16, 2018 – Integration Testing By Martin Fowler 7. Testing Strategies in Microservices Architecture 8. Practical Test Pyramid By Ham Vocke Testing – TDD / BDD
framework. It was designed to be easy to extend and most of the important components are plug‐ gable. 2. Pumba : A chaos testing and network emulation tool for Docker. 3. Chaos Lemur : Self-hostable application to randomly destroy virtual machines in a BOSH- managed environment, as an aid to resilience testing of high-availability systems. 4. Chaos Lambda : Randomly terminate AWS ASG instances during business hours. 5. Blockade : Docker-based utility for testing network failures and partitions in distributed applications. 6. Chaos-http-proxy : Introduces failures into HTTP requests via a proxy server. 7. Monkey-ops : Monkey-Ops is a simple service implemented in Go, which is deployed into an OpenShift V3.X and generates some chaos within it. Monkey-Ops seeks some OpenShift components like Pods or Deployment Configs and randomly terminates them. 8. Chaos Dingo : Chaos Dingo currently supports performing operations on Azure VMs and VMSS deployed to an Azure Resource Manager-based resource group. 9. Tugbot : Testing in Production (TiP) framework for Docker. Testing tools
Continuous Integration? 2. What is Continuous Delivery? 3. CI / CD Pipeline 4. What is CI / CD Pipeline? 5. CI / CD Explained 6. CI / CD Pipeline using Java Example Part 1 7. CI / CD Pipeline using Ansible Part 2 8. Declarative Pipeline vs Scripted Pipeline 9. Complete Jenkins Pipeline Tutorial 10. Common Pipeline Mistakes 11. CI / CD for a Docker Application
“Microservices: A Definition of This New Architectural Term”, March 25, 2014. 2. Miller, Matt. “Innovate or Die: The Rise of Microservices”. e Wall Street Journal, October 5, 2015. 3. Newman, Sam. Building Microservices. O’Reilly Media, 2015. 4. Alagarasan, Vijay. “Seven Microservices Anti-patterns”, August 24, 2015. 5. Cockcroft, Adrian. “State of the Art in Microservices”, December 4, 2014. 6. Fowler, Martin. “Microservice Prerequisites”, August 28, 2014. 7. Fowler, Martin. “Microservice Tradeoffs”, July 1, 2015. 8. Humble, Jez. “Four Principles of Low-Risk Software Release”, February 16, 2012. 9. Zuul Edge Server, Ketan Gote, May 22, 2017 10. Ribbon, Hysterix using Spring Feign, Ketan Gote, May 22, 2017 11. Eureka Server with Spring Cloud, Ketan Gote, May 22, 2017 12. Apache Kafka, A Distributed Streaming Platform, Ketan Gote, May 20, 2017 13. Functional Reactive Programming, Araf Karsh Hamid, August 7, 2016 14. Enterprise Software Architectures, Araf Karsh Hamid, July 30, 2016 15. Docker and Linux Containers, Araf Karsh Hamid, April 28, 2015