Slide 1

Slide 1 text

Building Data Pipelines on Apache NiFi with Shuhsi Lin 20190921 at PyCon TW

Slide 2

Slide 2 text

Lurking in PyHug, Taipei.py and various Meetups About Me 2 Scrum Master + Data Engineer in a manufacturing company Working with • data and people Focus on • Agile/Engineering culture • Streaming process • IoT applications • Data visualization Shuhsi Lin sucitw gmail.com https://medium.com/@suci/

Slide 3

Slide 3 text

3 Agenda What we will focus on 1. Data Pipelines/Data Flow 2. Key Concepts of NiFi 3. Why you may need NiFi 4. Control NiFi with Python 5. Work in NiFi with Python 6. A path to dive into NiFi

Slide 4

Slide 4 text

4 What we will not focus on 1. Deployment/Operation/Monitoring/Administration 2. Infrastructure Details 3. NiFi Registry 4. MiniNiFi 5. Advanced Data Flow Design

Slide 5

Slide 5 text

Data pipeline https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini Hello ETL/ELT World Extract Transform Load Load Transform Extract Input Data Output Data Pipe Operation

Slide 6

Slide 6 text

3 Paradigms for Programming Interactive Request/Response https://qconnewyork.com/ny2016/ny2016/presentation/large-scale-stream-processing-apache-kafka.html Batch Stream Processing Input Output

Slide 7

Slide 7 text

Simplistic Data Flow 7 Data Store/Application B Acquire/Ingest Data Process and Analyzed Data Data Store/Application A ● Data movement as flow ● Moving data content from A to B

Slide 8

Slide 8 text

Data Pipeline Transform/Process Parse Filter Split/merge route Data Store Data Pipeline 8 BI tool Target data store Load/Ingest/Move Application Extra/Acquire Data Data Pipeline Data Pipeline diverse sources diverse targets

Slide 9

Slide 9 text

Many Flow-like Data in a Real World 9 Across Organizations/ Business unit/ Geographic locations

Slide 10

Slide 10 text

What Often Happen in (Large Data) Complex Data Pipelines ● Data governance ● An architecture/platform, not just pipelines ● Data quality issues of all kinds. ● Custom -> Competing standards ● Data auditing and Security ● Project/Requirement management ● Maintenance/operation Apache Nifi Crash Course, DataWorks Summit 2018 Data ● Standards ● Schemas/Formats/Protocols ● Veracity ● Validity ● Data auditing ● Partitioning/ Bundling ● Increase in data velocity ● Unclear/uncontrollable data source Infrastructure/Operation support ● “Exactly Once” Delivery ● Ensuring Security ● Overcoming Security ● Credential Management ● Network ● Long-term maintenance ● Scalability People ● Compliance ● Person| team | group ● Consumers Changes ● Requirements Unclear/Change ● “Exactly Once” Delivery Req 10

Slide 11

Slide 11 text

So, What and Why is NiFi preferred pronunciation- >"nye fye" (nī fī)

Slide 12

Slide 12 text

NiFi’s Features ● Server/Data Center class ● Web UI & REST API ● Highly configurable ● Data Provenance ● Designed for extension ● Secure Created in 2006 at NSA, donated to ASF at 2014 Similar concepts of Flow Based Programming (FBP) 12

Slide 13

Slide 13 text

Process and Distribute Data ● Moves data around systems and gives you tools to process this data 13 https://medium.com/free-code-camp/nifi-surf-on-your-dataflow-4f3343c50aa2 ● Deal a great variety of data sources and format ● From one source, transform it, and push it to a different data sink

Slide 14

Slide 14 text

Why NiFi 14 ● Single ingestion platform for ○ Various data source and target ○ Resilient to update components ( 300 +processors, data source, transformation logic, Job scheduling ...) ○ Access control (Group, Users, LDAP) ● Scalable (Clustering) ● Reliability ○ Guaranteed delivery (no data loss) ○ Data provenance/lineage (auditing pipeline) ● Fast develop/deploy cycle ● Better data flow visibility (across Multi-disciplinary teams/peers ) ● Data Processing without Programming The four Vs of Big Data https://medium.com/free-code-camp/nifi-surf-on-your-dataflow-4f3343c50aa2

Slide 15

Slide 15 text

Let’s See Some Cases Photo by Jared Erondu on Unsplash

Slide 16

Slide 16 text

16 https://nifi.apache.org/powered-by-nifi.html

Slide 17

Slide 17 text

17 “Best practices and lessons learnt from Running Apache NiFi at Renault” Datawork Summit 2018 NiFi Values for Renault

Slide 18

Slide 18 text

18 “Best practices and lessons learnt from Running Apache NiFi at Renault” Datawork Summit 2018

Slide 19

Slide 19 text

19 https://kylo.readthedocs.io/en/v0.10.0/ Kylo uses Apache NiFi for orchestrating data pipelines.

Slide 20

Slide 20 text

20 Pipeline Builder: Micron’s Journey Automating the Global Data Warehouse, Datawork Summit 2019

Slide 21

Slide 21 text

How NiFi Works 21 Photo by Tomas Robertson on Unsplash

Slide 22

Slide 22 text

NiFi Web UI 22 https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#versioning_dataflow ● Drag and drop processors to build a flow ● Start and Stop and configure components in real time ● View errors and corresponding error messages ● View statistics and health of data flow ● create templates of common processor & connections

Slide 23

Slide 23 text

Hello-World NiFi ● Drag and two components into the canvas. ● Two processors linked together by one connection (queue) 23 Processor 1- Connection - Processor 2 Simple data flow https://medium.com/@suci/hello-world-nifi-dcafcba0fdb0 Generate Flowfile (generate content) PutFile (Store content) connector Processor 2 Processor 1 I am flowfile

Slide 24

Slide 24 text

24 ● Dataflow (NiFi pipeline) ● FlowFile ● Processor/ Processor group ● Connection ● FlowFlie Controller ● Provenance Nifi Architecture/terminology

Slide 25

Slide 25 text

25 FlowFile ● Attributes ○ key/value pairs ○ JVM heap memory space (swappable) ○ Ex, the file name, file path, and a unique identifier are standard attributes ● Content ○ a reference to the stream of bytes compose the FlowFile content FileFile object is immutable How Apache Nifi works — surf on your dataflow, don’t drown in it

Slide 26

Slide 26 text

26 https://en.wikipedia.org/wiki/HTTP_message_body Hands-on with Apache NiFi and MiNiFi | Berlin Buzzwords 2017 Attributes Content/payload FlowFile HTTP/1.1 200 OK Date: Sun, 10 Oct 2010 23:26:07 GMT Server: Apache/2.2.8 (Ubuntu) mod_ssl/2.2.8 OpenSSL/0.9.8g Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT ETag: "45b6-834-49130cc1182c0" Accept-Ranges: bytes Content-Length: 12 Connection: close Content-Type: text/html Header Message body HTTP Hello world!

Slide 27

Slide 27 text

27 FlowFile Processor ● Black box/high-level abstractions of data operation Three different kinds of processors How Apache Nifi works — surf on your dataflow, don’t drown in it connection

Slide 28

Slide 28 text

28 P1 not scheduled until the connector goes back below its threshold. Two processors linked by a connector with its limit respected. Number of FlowFiles below the threshold. The Flow Controller schedules the P1 for execution again Backpressure

Slide 29

Slide 29 text

29 Processor Group: Building a new processor from the existing processors How Apache Nifi works — surf on your dataflow, don’t drown in it Processor Group

Slide 30

Slide 30 text

30 FlowFile Content in Content Repository pointer references https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html#deeper-view-content-claim How Apache Nifi works — surf on your dataflow, don’t drown in it

Slide 31

Slide 31 text

Copy-on-write in NiFi The original content is still present in the repository after a FlowFile modification. How Apache Nifi works — surf on your dataflow, don’t drown in it Compresses the content

Slide 32

Slide 32 text

32 FlowFile Repository The FlowFile Repository contains metadata about the files currently in the flow. How Apache Nifi works — surf on your dataflow, don’t drown in it

Slide 33

Slide 33 text

33 Provenance Repository Provenance Event ● Track the complete history of all the FlowFiles in the flow ● The metadata and context information of each FlowFile ● Replay the data from any point in time How Apache Nifi works — surf on your dataflow, don’t drown in it

Slide 34

Slide 34 text

34 WebCrawler Template Apache NiFi In Depth

Slide 35

Slide 35 text

https://github.com/bbende/nifi-streaming-examples

Slide 36

Slide 36 text

36 https://github.com/bbende/nifi-streaming-examples

Slide 37

Slide 37 text

NiFi flow design is like software development “Best practices and lessons learnt from Running Apache NiFi” Datawork Summit 2018 37 Design Code/develop Test Deploy Monitor Evolve/Refactor Programing language ● Integrated Development Environment (IDE) ● Algorithm, develop ● Functions/ Module (arguments, results) ● Libraries, packages ● ... Apache NiFi ● GUI ● Flow design, drop and drag ● Process groups (input/ output) ● Templates ● ...

Slide 38

Slide 38 text

38 Apache NiFi Processing Framework (Spark,Storm,..) Processing Messaging Bus (Kafka, JMS...) Enterprise Service Bus (Fuse(Camel), Mule,...) ETL tools (SSIS, Informatica, ....) NiFi Positioning

Slide 39

Slide 39 text

So, should I use it? ● Review what you need and choose what you need ● Do you need an enterprise dataflow platform ● Do you need to integrate with many different big data solutions ● You are suggested to have the ability to build the same pipeline without NiFi ● You may need enable administration functions in a large scale usage 39

Slide 40

Slide 40 text

Another way to access NiFi 40 Photo by Lachlan Donald on Unsplash

Slide 41

Slide 41 text

Rest API 41 https://nifi.apache.org/docs/nifi-docs/rest-api/index.html A Programmatic way to command and control a NiFi instance in real time

Slide 42

Slide 42 text

42 { "about": { "title": "NiFi", "version": "1.9.2", "uri": "http://localhost:38080/nifi-api/", "contentViewerUrl": "../nifi-content-viewer/", "timezone": "UTC", "buildTag": "nifi-1.9.2-RC2", "buildRevision": "ff01ff6", "buildBranch": "NIFI-6169-RC2", "buildTimestamp": "04/03/2019 15:25:53 UTC" } } Response http://NIFI_instance/nifi-api/flow/about Get NiFi instance information https://nifi.apache.org/docs/nifi-docs/rest-api/

Slide 43

Slide 43 text

43 http://NIFI_instance/nifi-api/flow/search-results?q=pycon> { "searchResultsDTO": { "processorResults": [], "connectionResults": [], "processGroupResults": [ { "id": "2a5d288c-016d-1000-7ede-e736a588559b", "groupId": "382a2072-016c-1000-4c24-50b6a90ec204", "parentGroup": { "id":"382a2072-016c-1000-4c24-50b6a90ec204", "name": "NiFi Flow"}, "name": "PyconTW 2019", "matches": [ "Name: PyconTW 2019" ]}], "inputPortResults": [], "outputPortResults": [], "remoteProcessGroupResults": [], "funnelResults": [] } } Response Search https://nifi.apache.org/docs/nifi-docs/rest-api/

Slide 44

Slide 44 text

44 PUT /nifi-api/processors/2b5da9d5-016d-1000-d67f-bcd140eccb41 HTTP/1.1 Host: localhost:38080 Content-Type: application/json Cache-Control: no-cache { "revision": { "clientId": "2a4758dc-016d-1000-56b9-2fc8baffbc07", "version": 7 }, "component":{ "id": "2b5da9d5-016d-1000-d67f-bcd140eccb41", "state":"STOPPED" } } Stop a processor processor_id = 2b5da9d5-016d-1000-d67f-bcd140eccb41 https://nifi.apache.org/docs/nifi-docs/rest-api/

Slide 45

Slide 45 text

45 NiPyApi Nifi-Python-Api: A rich Apache NiFi Python Client SDK ● Detailed documentation of the full SDK at all levels ● CRUD wrappers for common task areas like Processor Groups, Processors, Templates, Registry Clients, Registry Buckets, Registry Flows, etc. ● Convenience functions for inventory tasks, such as recursively retrieving the entire canvas, or a flat list of all Process Groups ● Support for scheduling and purging flows, controller services, and connections ● Support for fetching and updating Variable Registries ● Support for import/export of Versioned Flows from NiFi-Registry ● Docker Compose configurations for testing and deployment ● A scripted deployment of an interactive environment, and a secured configuration, for testing and demonstration purposes By Dan Chaffelson

Slide 46

Slide 46 text

46 Interactions with the NiFi Canvas Canvas Client SDK modules NiPyApi Config Security System Templates Utils Versioning A set of defaults and parameters Secure connectivity management For system and cluster level functions in NiFi For managing flow deployments Convenience utility functions for NiPyApi ( not intended for external use) For interactions with the NiFi Registry Service and related functions https://github.com/sucitw/python-script-in-NiFi/blob/master/RESTAPI/nipyapi_demo.ipynb

Slide 47

Slide 47 text

47 Photo by Mitch Lensink on Unsplash Choose what you need from Many processors Dynamic vs Compiled processor

Slide 48

Slide 48 text

300+ Processors for Data Ecosystem Integration 48 2019, DataWorks Summit, Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi “Swiss Army Knife of Data Movement”

Slide 49

Slide 49 text

Processors 49 Data Ingestion ● GenerateFlowFiles ● GetFile ● GetFTP ● GetJMSQueue ● GetHDFS ● GetKafka ● GetMongo ● GetTwitter ● ListHDFS/FetchHDFS Data Transformation ● ReplaceText ● ConvertRecord ● UpdateRecord ● ConvertJSONtoSQL ● CompressContent ● ConvertCharacterSet ● TransformXml ● ... Data Egress Send Data ● PutEmail ● PutFile ● PutFTP ● PutSQL ● PutMongo ● PutHDFS ● PutHiveQL ● ... Routing and Mediation ● ControlRate ● RouteOnAttribute ● RouteOnContent ● ValidateCSV ● DeteDuplicate ● DistributeLoad ● ... Database Access ● ExecuteSQL ● ListDatabaseTables ● ... Attribute Extraction ● EvaluateJsonPath ● ExtractText ● UpdateAttribute ● ... System Interaction ● ExecuteProcess ● ExecuStreamCommand ● ExecuteScript ● ... Splitting and Aggregation ● SplitText ● SplitJson ● MergeContent ● ... HTTP ● GetHTTP ● ListenHTTP ● InvokeHTTP ● ... AWS ● FetchS3Object ● GetSQS ● GetDynamoDB ● ... There are more...

Slide 50

Slide 50 text

50 https://nifi.apache.org/docs.html

Slide 51

Slide 51 text

Still looking for What you really need

Slide 52

Slide 52 text

Have you heard about Custom Development? Photo by Jan Kopřiva on Unsplash

Slide 53

Slide 53 text

53 Capabilities of Custom development ● ExecuteProcess / ExecuteStreamCommand ● ExecuteScript / InvokeScriptedProcessor ○ ExecuteGroovyScript ● Custom Processor (Java) BYOP: Custom Processor Development with Apache NiFi, Andy LoPresto, DataWork summit, 2019

Slide 54

Slide 54 text

ExecuteProcess & ExecuteStreamCommand ● Executes arbitrary process and captures output to a FlowFile ● No incoming flowfile ( No upstream connections ) 54 ● On the contents of a flow file ○ Pipes FlowFile content to STDIN ● Creates a new flow file with the results of the command ○ Populates outgoing flowfile content from STDOUT ○ Can also direct to named attribute Executes an external command ● Ex, external Python/Shell script

Slide 55

Slide 55 text

55

Slide 56

Slide 56 text

56 Photo by Thabang Mokoena on Unsplash ExecuteScript InvokeScriptedProcessor (ISP) Deliver custom logic

Slide 57

Slide 57 text

ExecuteScript (ES) ● Clojure, Groovy, Ruby, Python (Jython - no C libs), Lua, Javascript ○ JSR-223 compatible language ● Easy to use, fast development ● Source is inline or file system ● ExecuteScript => evaluate “onTrigger()” ● Only two relationships available (REL_SUCCESS and REL_FAILURE) ● Poor performance 57 https://nifi.apache.org/docs.html (ExecuteScript) https://github.com/sucitw/python-script-in-NiFi

Slide 58

Slide 58 text

ExecuteScript in Python (Jython) 58 ● ● ProcessorSession ● ProcessContext ● ComponentLog ● Relationship ● ... Basic NiFi APIs you need to know https://nifi.apache.org/developer-guide.html

Slide 59

Slide 59 text

Hello-World ExecuteScript in Python 59 Update Attribute https://github.com/sucitw/python-script-in-NiFi/blob/master/hello_world(update_attribute).py

Slide 60

Slide 60 text

InvokeScriptedProcessor (ISP) ● Faster ExecuteScript (ES) ● Difference with ExecuteScript ○ Supports custom properties and relationships ○ ES handles the session.commit() for you, but ISP does not. ○ ES has a "session" variable, where the ISP onTrigger() method must call sessionFactory.createSession() 60 https://nifi.apache.org/docs.html (InvokeScriptedProcessor) https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-processors/src/test/resources/jython

Slide 61

Slide 61 text

61 https://github.com/sucitw/python-script-in-NiFi/blob/mast er/InvokeScripted_hello_world(update_attribute).py Hello-World InvokeScriptedProcessor in Python

Slide 62

Slide 62 text

Photo by Ben Weber on Unsplash Tips and Tricks Suggestions

Slide 63

Slide 63 text

Suggestions 63 ● Able to do the same ETL/ELT without NiFi ● Do it like software development ○ Focus on business logic first ○ Do it after design ○ Keep it simple ○ Separate environments: Dev, Test, Production ○ Versioning ○ Build once, develop many ● Extract -Load - Transform ( with external tools) ● Design for Failure (fault tolerance) ○ Good retry/error handling mechanism ○ Monitoring ○ Clustering in Production at least ● Use processor groups (PGs) ● PGs => Ingestion, test, and monitoring ● Naming convention ● Put labels/comments ● Use variables, use funnel ● Scheduling -> Automation ● Define SLA with you clients ● Do monitoring and alerting ● Use for handling routing and formatting ● Try to not queue data ● Control FlowFile stream rate ● ...

Slide 64

Slide 64 text

Recap ● ETL/ELT pipelines and why they are hard ● How NiFi may help you ● Hello-World NiFi ● Key concepts and terms of NiFi (Flowfile, Processors, Connections) ● RestAPI with examples (NiPyapi) ● ExecuteProcess/ExecuteStreamCommand ● ExecuteScript/InvokeScriptedProcessor with Python ● Suggestions on NiFi 64

Slide 65

Slide 65 text

Reference 65 1. How Apache Nifi works — surf on your dataflow, don’t drown in it by François Paupier 2. Pipeline Builder: Micron’s Journey Automating the Global Data Warehouse, Datawork Summit 2019 3. Best practices for running Apache NiFi in production - 3 takeaways from real world projects /“Best practices and lessons learnt from Running Apache NiFi” Datawork Summit 2018 4. BYOP: Custom Processor Development in Apache NiFi, Dataworks Summit Barcelona 2019 5. Hello-World in Apache NiFi 6. Nifi 開發小技巧(Funnel)

Slide 66

Slide 66 text

More about NiFi ● Apache NiFi In Depth ● Play with more processors ○ Choose what you need and do it right ● Learn from Dataflow Templates ● NiFi Express Language ● Build Your Own Processor ○ BYOP: Custom Processor Development with Apache NiFi ● MiNiFi ● NiFi Registry 66