Lurking in PyHug, Taipei.py and various Meetups About Me 2 Scrum Master + Data Engineer in a manufacturing company Working with • data and people Focus on • Agile/Engineering culture • Streaming process • IoT applications • Data visualization Shuhsi Lin sucitw gmail.com https://medium.com/@suci/
3 Agenda What we will focus on 1. Data Pipelines/Data Flow 2. Key Concepts of NiFi 3. Why you may need NiFi 4. Control NiFi with Python 5. Work in NiFi with Python 6. A path to dive into NiFi
4 What we will not focus on 1. Deployment/Operation/Monitoring/Administration 2. Infrastructure Details 3. NiFi Registry 4. MiniNiFi 5. Advanced Data Flow Design
Data pipeline https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini Hello ETL/ELT World Extract Transform Load Load Transform Extract Input Data Output Data Pipe Operation
Simplistic Data Flow 7 Data Store/Application B Acquire/Ingest Data Process and Analyzed Data Data Store/Application A ● Data movement as flow ● Moving data content from A to B
Data Pipeline Transform/Process Parse Filter Split/merge route Data Store Data Pipeline 8 BI tool Target data store Load/Ingest/Move Application Extra/Acquire Data Data Pipeline Data Pipeline diverse sources diverse targets
NiFi’s Features ● Server/Data Center class ● Web UI & REST API ● Highly configurable ● Data Provenance ● Designed for extension ● Secure Created in 2006 at NSA, donated to ASF at 2014 Similar concepts of Flow Based Programming (FBP) 12
Process and Distribute Data ● Moves data around systems and gives you tools to process this data 13 https://medium.com/free-code-camp/nifi-surf-on-your-dataflow-4f3343c50aa2 ● Deal a great variety of data sources and format ● From one source, transform it, and push it to a different data sink
Why NiFi 14 ● Single ingestion platform for ○ Various data source and target ○ Resilient to update components ( 300 +processors, data source, transformation logic, Job scheduling ...) ○ Access control (Group, Users, LDAP) ● Scalable (Clustering) ● Reliability ○ Guaranteed delivery (no data loss) ○ Data provenance/lineage (auditing pipeline) ● Fast develop/deploy cycle ● Better data flow visibility (across Multi-disciplinary teams/peers ) ● Data Processing without Programming The four Vs of Big Data https://medium.com/free-code-camp/nifi-surf-on-your-dataflow-4f3343c50aa2
NiFi Web UI 22 https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#versioning_dataflow ● Drag and drop processors to build a flow ● Start and Stop and configure components in real time ● View errors and corresponding error messages ● View statistics and health of data flow ● create templates of common processor & connections
Hello-World NiFi ● Drag and two components into the canvas. ● Two processors linked together by one connection (queue) 23 Processor 1- Connection - Processor 2 Simple data flow https://medium.com/@suci/hello-world-nifi-dcafcba0fdb0 Generate Flowfile (generate content) PutFile (Store content) connector Processor 2 Processor 1 I am flowfile
25 FlowFile ● Attributes ○ key/value pairs ○ JVM heap memory space (swappable) ○ Ex, the file name, file path, and a unique identifier are standard attributes ● Content ○ a reference to the stream of bytes compose the FlowFile content FileFile object is immutable How Apache Nifi works — surf on your dataflow, don’t drown in it
27 FlowFile Processor ● Black box/high-level abstractions of data operation Three different kinds of processors How Apache Nifi works — surf on your dataflow, don’t drown in it connection
28 P1 not scheduled until the connector goes back below its threshold. Two processors linked by a connector with its limit respected. Number of FlowFiles below the threshold. The Flow Controller schedules the P1 for execution again Backpressure
29 Processor Group: Building a new processor from the existing processors How Apache Nifi works — surf on your dataflow, don’t drown in it Processor Group
30 FlowFile Content in Content Repository pointer references https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html#deeper-view-content-claim How Apache Nifi works — surf on your dataflow, don’t drown in it
Copy-on-write in NiFi The original content is still present in the repository after a FlowFile modification. How Apache Nifi works — surf on your dataflow, don’t drown in it Compresses the content
32 FlowFile Repository The FlowFile Repository contains metadata about the files currently in the flow. How Apache Nifi works — surf on your dataflow, don’t drown in it
33 Provenance Repository Provenance Event ● Track the complete history of all the FlowFiles in the flow ● The metadata and context information of each FlowFile ● Replay the data from any point in time How Apache Nifi works — surf on your dataflow, don’t drown in it
So, should I use it? ● Review what you need and choose what you need ● Do you need an enterprise dataflow platform ● Do you need to integrate with many different big data solutions ● You are suggested to have the ability to build the same pipeline without NiFi ● You may need enable administration functions in a large scale usage 39
45 NiPyApi Nifi-Python-Api: A rich Apache NiFi Python Client SDK ● Detailed documentation of the full SDK at all levels ● CRUD wrappers for common task areas like Processor Groups, Processors, Templates, Registry Clients, Registry Buckets, Registry Flows, etc. ● Convenience functions for inventory tasks, such as recursively retrieving the entire canvas, or a flat list of all Process Groups ● Support for scheduling and purging flows, controller services, and connections ● Support for fetching and updating Variable Registries ● Support for import/export of Versioned Flows from NiFi-Registry ● Docker Compose configurations for testing and deployment ● A scripted deployment of an interactive environment, and a secured configuration, for testing and demonstration purposes By Dan Chaffelson
46 Interactions with the NiFi Canvas Canvas Client SDK modules NiPyApi Config Security System Templates Utils Versioning A set of defaults and parameters Secure connectivity management For system and cluster level functions in NiFi For managing flow deployments Convenience utility functions for NiPyApi ( not intended for external use) For interactions with the NiFi Registry Service and related functions https://github.com/sucitw/python-script-in-NiFi/blob/master/RESTAPI/nipyapi_demo.ipynb
300+ Processors for Data Ecosystem Integration 48 2019, DataWorks Summit, Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi “Swiss Army Knife of Data Movement”
ExecuteProcess & ExecuteStreamCommand ● Executes arbitrary process and captures output to a FlowFile ● No incoming flowfile ( No upstream connections ) 54 ● On the contents of a flow file ○ Pipes FlowFile content to STDIN ● Creates a new flow file with the results of the command ○ Populates outgoing flowfile content from STDOUT ○ Can also direct to named attribute Executes an external command ● Ex, external Python/Shell script
ExecuteScript (ES) ● Clojure, Groovy, Ruby, Python (Jython - no C libs), Lua, Javascript ○ JSR-223 compatible language ● Easy to use, fast development ● Source is inline or file system ● ExecuteScript => evaluate “onTrigger()” ● Only two relationships available (REL_SUCCESS and REL_FAILURE) ● Poor performance 57 https://nifi.apache.org/docs.html (ExecuteScript) https://github.com/sucitw/python-script-in-NiFi
InvokeScriptedProcessor (ISP) ● Faster ExecuteScript (ES) ● Difference with ExecuteScript ○ Supports custom properties and relationships ○ ES handles the session.commit() for you, but ISP does not. ○ ES has a "session" variable, where the ISP onTrigger() method must call sessionFactory.createSession() 60 https://nifi.apache.org/docs.html (InvokeScriptedProcessor) https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-processors/src/test/resources/jython
Suggestions 63 ● Able to do the same ETL/ELT without NiFi ● Do it like software development ○ Focus on business logic first ○ Do it after design ○ Keep it simple ○ Separate environments: Dev, Test, Production ○ Versioning ○ Build once, develop many ● Extract -Load - Transform ( with external tools) ● Design for Failure (fault tolerance) ○ Good retry/error handling mechanism ○ Monitoring ○ Clustering in Production at least ● Use processor groups (PGs) ● PGs => Ingestion, test, and monitoring ● Naming convention ● Put labels/comments ● Use variables, use funnel ● Scheduling -> Automation ● Define SLA with you clients ● Do monitoring and alerting ● Use for handling routing and formatting ● Try to not queue data ● Control FlowFile stream rate ● ...
Recap ● ETL/ELT pipelines and why they are hard ● How NiFi may help you ● Hello-World NiFi ● Key concepts and terms of NiFi (Flowfile, Processors, Connections) ● RestAPI with examples (NiPyapi) ● ExecuteProcess/ExecuteStreamCommand ● ExecuteScript/InvokeScriptedProcessor with Python ● Suggestions on NiFi 64
Reference 65 1. How Apache Nifi works — surf on your dataflow, don’t drown in it by François Paupier 2. Pipeline Builder: Micron’s Journey Automating the Global Data Warehouse, Datawork Summit 2019 3. Best practices for running Apache NiFi in production - 3 takeaways from real world projects /“Best practices and lessons learnt from Running Apache NiFi” Datawork Summit 2018 4. BYOP: Custom Processor Development in Apache NiFi, Dataworks Summit Barcelona 2019 5. Hello-World in Apache NiFi 6. Nifi 開發小技巧(Funnel)
More about NiFi ● Apache NiFi In Depth ● Play with more processors ○ Choose what you need and do it right ● Learn from Dataflow Templates ● NiFi Express Language ● Build Your Own Processor ○ BYOP: Custom Processor Development with Apache NiFi ● MiNiFi ● NiFi Registry 66