Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Data Pipelines on Apache NiFi with Python

suci
September 21, 2019

Building Data Pipelines on Apache NiFi with Python

What is ETL
What is Apache NiFi
How do Apache NiFi and python work together

suci

September 21, 2019
Tweet

More Decks by suci

Other Decks in Programming

Transcript

  1. Lurking in PyHug, Taipei.py and various Meetups About Me 2

    Scrum Master + Data Engineer in a manufacturing company Working with • data and people Focus on • Agile/Engineering culture • Streaming process • IoT applications • Data visualization Shuhsi Lin sucitw gmail.com https://medium.com/@suci/
  2. 3 Agenda What we will focus on 1. Data Pipelines/Data

    Flow 2. Key Concepts of NiFi 3. Why you may need NiFi 4. Control NiFi with Python 5. Work in NiFi with Python 6. A path to dive into NiFi
  3. 4 What we will not focus on 1. Deployment/Operation/Monitoring/Administration 2.

    Infrastructure Details 3. NiFi Registry 4. MiniNiFi 5. Advanced Data Flow Design
  4. Simplistic Data Flow 7 Data Store/Application B Acquire/Ingest Data Process

    and Analyzed Data Data Store/Application A • Data movement as flow • Moving data content from A to B
  5. Data Pipeline Transform/Process Parse Filter Split/merge route Data Store Data

    Pipeline 8 BI tool Target data store Load/Ingest/Move Application Extra/Acquire Data Data Pipeline Data Pipeline diverse sources diverse targets
  6. What Often Happen in (Large Data) Complex Data Pipelines •

    Data governance • An architecture/platform, not just pipelines • Data quality issues of all kinds. • Custom -> Competing standards • Data auditing and Security • Project/Requirement management • Maintenance/operation Apache Nifi Crash Course, DataWorks Summit 2018 Data • Standards • Schemas/Formats/Protocols • Veracity • Validity • Data auditing • Partitioning/ Bundling • Increase in data velocity • Unclear/uncontrollable data source Infrastructure/Operation support • “Exactly Once” Delivery • Ensuring Security • Overcoming Security • Credential Management • Network • Long-term maintenance • Scalability People • Compliance • Person| team | group • Consumers Changes • Requirements Unclear/Change • “Exactly Once” Delivery Req 10
  7. NiFi’s Features • Server/Data Center class • Web UI &

    REST API • Highly configurable • Data Provenance • Designed for extension • Secure Created in 2006 at NSA, donated to ASF at 2014 Similar concepts of Flow Based Programming (FBP) 12
  8. Process and Distribute Data • Moves data around systems and

    gives you tools to process this data 13 https://medium.com/free-code-camp/nifi-surf-on-your-dataflow-4f3343c50aa2 • Deal a great variety of data sources and format • From one source, transform it, and push it to a different data sink
  9. Why NiFi 14 • Single ingestion platform for ◦ Various

    data source and target ◦ Resilient to update components ( 300 +processors, data source, transformation logic, Job scheduling ...) ◦ Access control (Group, Users, LDAP) • Scalable (Clustering) • Reliability ◦ Guaranteed delivery (no data loss) ◦ Data provenance/lineage (auditing pipeline) • Fast develop/deploy cycle • Better data flow visibility (across Multi-disciplinary teams/peers ) • Data Processing without Programming The four Vs of Big Data https://medium.com/free-code-camp/nifi-surf-on-your-dataflow-4f3343c50aa2
  10. 17 “Best practices and lessons learnt from Running Apache NiFi

    at Renault” Datawork Summit 2018 NiFi Values for Renault
  11. NiFi Web UI 22 https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#versioning_dataflow • Drag and drop processors

    to build a flow • Start and Stop and configure components in real time • View errors and corresponding error messages • View statistics and health of data flow • create templates of common processor & connections
  12. Hello-World NiFi • Drag and two components into the canvas.

    • Two processors linked together by one connection (queue) 23 Processor 1- Connection - Processor 2 Simple data flow https://medium.com/@suci/hello-world-nifi-dcafcba0fdb0 Generate Flowfile (generate content) PutFile (Store content) connector Processor 2 Processor 1 I am flowfile
  13. 24 • Dataflow (NiFi pipeline) • FlowFile • Processor/ Processor

    group • Connection • FlowFlie Controller • Provenance Nifi Architecture/terminology
  14. 25 FlowFile • Attributes ◦ key/value pairs ◦ JVM heap

    memory space (swappable) ◦ Ex, the file name, file path, and a unique identifier are standard attributes • Content ◦ a reference to the stream of bytes compose the FlowFile content FileFile object is immutable How Apache Nifi works — surf on your dataflow, don’t drown in it
  15. 26 https://en.wikipedia.org/wiki/HTTP_message_body Hands-on with Apache NiFi and MiNiFi | Berlin

    Buzzwords 2017 Attributes Content/payload FlowFile HTTP/1.1 200 OK Date: Sun, 10 Oct 2010 23:26:07 GMT Server: Apache/2.2.8 (Ubuntu) mod_ssl/2.2.8 OpenSSL/0.9.8g Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT ETag: "45b6-834-49130cc1182c0" Accept-Ranges: bytes Content-Length: 12 Connection: close Content-Type: text/html Header Message body HTTP Hello world!
  16. 27 FlowFile Processor • Black box/high-level abstractions of data operation

    Three different kinds of processors How Apache Nifi works — surf on your dataflow, don’t drown in it connection
  17. 28 P1 not scheduled until the connector goes back below

    its threshold. Two processors linked by a connector with its limit respected. Number of FlowFiles below the threshold. The Flow Controller schedules the P1 for execution again Backpressure
  18. 29 Processor Group: Building a new processor from the existing

    processors How Apache Nifi works — surf on your dataflow, don’t drown in it Processor Group
  19. Copy-on-write in NiFi The original content is still present in

    the repository after a FlowFile modification. How Apache Nifi works — surf on your dataflow, don’t drown in it Compresses the content
  20. 32 FlowFile Repository The FlowFile Repository contains metadata about the

    files currently in the flow. How Apache Nifi works — surf on your dataflow, don’t drown in it
  21. 33 Provenance Repository Provenance Event • Track the complete history

    of all the FlowFiles in the flow • The metadata and context information of each FlowFile • Replay the data from any point in time How Apache Nifi works — surf on your dataflow, don’t drown in it
  22. NiFi flow design is like software development “Best practices and

    lessons learnt from Running Apache NiFi” Datawork Summit 2018 37 Design Code/develop Test Deploy Monitor Evolve/Refactor Programing language • Integrated Development Environment (IDE) • Algorithm, develop • Functions/ Module (arguments, results) • Libraries, packages • ... Apache NiFi • GUI • Flow design, drop and drag • Process groups (input/ output) • Templates • ...
  23. 38 Apache NiFi Processing Framework (Spark,Storm,..) Processing Messaging Bus (Kafka,

    JMS...) Enterprise Service Bus (Fuse(Camel), Mule,...) ETL tools (SSIS, Informatica, ....) NiFi Positioning
  24. So, should I use it? • Review what you need

    and choose what you need • Do you need an enterprise dataflow platform • Do you need to integrate with many different big data solutions • You are suggested to have the ability to build the same pipeline without NiFi • You may need enable administration functions in a large scale usage 39
  25. 42 { "about": { "title": "NiFi", "version": "1.9.2", "uri": "http://localhost:38080/nifi-api/",

    "contentViewerUrl": "../nifi-content-viewer/", "timezone": "UTC", "buildTag": "nifi-1.9.2-RC2", "buildRevision": "ff01ff6", "buildBranch": "NIFI-6169-RC2", "buildTimestamp": "04/03/2019 15:25:53 UTC" } } Response http://NIFI_instance/nifi-api/flow/about Get NiFi instance information https://nifi.apache.org/docs/nifi-docs/rest-api/
  26. 43 http://NIFI_instance/nifi-api/flow/search-results?q=pycon> { "searchResultsDTO": { "processorResults": [], "connectionResults": [], "processGroupResults":

    [ { "id": "2a5d288c-016d-1000-7ede-e736a588559b", "groupId": "382a2072-016c-1000-4c24-50b6a90ec204", "parentGroup": { "id":"382a2072-016c-1000-4c24-50b6a90ec204", "name": "NiFi Flow"}, "name": "PyconTW 2019", "matches": [ "Name: PyconTW 2019" ]}], "inputPortResults": [], "outputPortResults": [], "remoteProcessGroupResults": [], "funnelResults": [] } } Response Search https://nifi.apache.org/docs/nifi-docs/rest-api/
  27. 44 PUT /nifi-api/processors/2b5da9d5-016d-1000-d67f-bcd140eccb41 HTTP/1.1 Host: localhost:38080 Content-Type: application/json Cache-Control: no-cache

    { "revision": { "clientId": "2a4758dc-016d-1000-56b9-2fc8baffbc07", "version": 7 }, "component":{ "id": "2b5da9d5-016d-1000-d67f-bcd140eccb41", "state":"STOPPED" } } Stop a processor processor_id = 2b5da9d5-016d-1000-d67f-bcd140eccb41 https://nifi.apache.org/docs/nifi-docs/rest-api/
  28. 45 NiPyApi Nifi-Python-Api: A rich Apache NiFi Python Client SDK

    • Detailed documentation of the full SDK at all levels • CRUD wrappers for common task areas like Processor Groups, Processors, Templates, Registry Clients, Registry Buckets, Registry Flows, etc. • Convenience functions for inventory tasks, such as recursively retrieving the entire canvas, or a flat list of all Process Groups • Support for scheduling and purging flows, controller services, and connections • Support for fetching and updating Variable Registries • Support for import/export of Versioned Flows from NiFi-Registry • Docker Compose configurations for testing and deployment • A scripted deployment of an interactive environment, and a secured configuration, for testing and demonstration purposes By Dan Chaffelson
  29. 46 Interactions with the NiFi Canvas Canvas Client SDK modules

    NiPyApi Config Security System Templates Utils Versioning A set of defaults and parameters Secure connectivity management For system and cluster level functions in NiFi For managing flow deployments Convenience utility functions for NiPyApi ( not intended for external use) For interactions with the NiFi Registry Service and related functions https://github.com/sucitw/python-script-in-NiFi/blob/master/RESTAPI/nipyapi_demo.ipynb
  30. 47 Photo by Mitch Lensink on Unsplash Choose what you

    need from Many processors Dynamic vs Compiled processor
  31. 300+ Processors for Data Ecosystem Integration 48 2019, DataWorks Summit,

    Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi “Swiss Army Knife of Data Movement”
  32. Processors 49 Data Ingestion • GenerateFlowFiles • GetFile • GetFTP

    • GetJMSQueue • GetHDFS • GetKafka • GetMongo • GetTwitter • ListHDFS/FetchHDFS Data Transformation • ReplaceText • ConvertRecord • UpdateRecord • ConvertJSONtoSQL • CompressContent • ConvertCharacterSet • TransformXml • ... Data Egress Send Data • PutEmail • PutFile • PutFTP • PutSQL • PutMongo • PutHDFS • PutHiveQL • ... Routing and Mediation • ControlRate • RouteOnAttribute • RouteOnContent • ValidateCSV • DeteDuplicate • DistributeLoad • ... Database Access • ExecuteSQL • ListDatabaseTables • ... Attribute Extraction • EvaluateJsonPath • ExtractText • UpdateAttribute • ... System Interaction • ExecuteProcess • ExecuStreamCommand • ExecuteScript • ... Splitting and Aggregation • SplitText • SplitJson • MergeContent • ... HTTP • GetHTTP • ListenHTTP • InvokeHTTP • ... AWS • FetchS3Object • GetSQS • GetDynamoDB • ... There are more...
  33. 53 Capabilities of Custom development • ExecuteProcess / ExecuteStreamCommand •

    ExecuteScript / InvokeScriptedProcessor ◦ ExecuteGroovyScript • Custom Processor (Java) BYOP: Custom Processor Development with Apache NiFi, Andy LoPresto, DataWork summit, 2019
  34. ExecuteProcess & ExecuteStreamCommand • Executes arbitrary process and captures output

    to a FlowFile • No incoming flowfile ( No upstream connections ) 54 • On the contents of a flow file ◦ Pipes FlowFile content to STDIN • Creates a new flow file with the results of the command ◦ Populates outgoing flowfile content from STDOUT ◦ Can also direct to named attribute Executes an external command • Ex, external Python/Shell script
  35. 55

  36. ExecuteScript (ES) • Clojure, Groovy, Ruby, Python (Jython - no

    C libs), Lua, Javascript ◦ JSR-223 compatible language • Easy to use, fast development • Source is inline or file system • ExecuteScript => evaluate “onTrigger()” • Only two relationships available (REL_SUCCESS and REL_FAILURE) • Poor performance 57 https://nifi.apache.org/docs.html (ExecuteScript) https://github.com/sucitw/python-script-in-NiFi
  37. ExecuteScript in Python (Jython) 58 • • ProcessorSession • ProcessContext

    • ComponentLog • Relationship • ... Basic NiFi APIs you need to know https://nifi.apache.org/developer-guide.html
  38. InvokeScriptedProcessor (ISP) • Faster ExecuteScript (ES) • Difference with ExecuteScript

    ◦ Supports custom properties and relationships ◦ ES handles the session.commit() for you, but ISP does not. ◦ ES has a "session" variable, where the ISP onTrigger() method must call sessionFactory.createSession() 60 https://nifi.apache.org/docs.html (InvokeScriptedProcessor) https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-processors/src/test/resources/jython
  39. Suggestions 63 • Able to do the same ETL/ELT without

    NiFi • Do it like software development ◦ Focus on business logic first ◦ Do it after design ◦ Keep it simple ◦ Separate environments: Dev, Test, Production ◦ Versioning ◦ Build once, develop many • Extract -Load - Transform ( with external tools) • Design for Failure (fault tolerance) ◦ Good retry/error handling mechanism ◦ Monitoring ◦ Clustering in Production at least • Use processor groups (PGs) • PGs => Ingestion, test, and monitoring • Naming convention • Put labels/comments • Use variables, use funnel • Scheduling -> Automation • Define SLA with you clients • Do monitoring and alerting • Use for handling routing and formatting • Try to not queue data • Control FlowFile stream rate • ...
  40. Recap • ETL/ELT pipelines and why they are hard •

    How NiFi may help you • Hello-World NiFi • Key concepts and terms of NiFi (Flowfile, Processors, Connections) • RestAPI with examples (NiPyapi) • ExecuteProcess/ExecuteStreamCommand • ExecuteScript/InvokeScriptedProcessor with Python • Suggestions on NiFi 64
  41. Reference 65 1. How Apache Nifi works — surf on

    your dataflow, don’t drown in it by François Paupier 2. Pipeline Builder: Micron’s Journey Automating the Global Data Warehouse, Datawork Summit 2019 3. Best practices for running Apache NiFi in production - 3 takeaways from real world projects /“Best practices and lessons learnt from Running Apache NiFi” Datawork Summit 2018 4. BYOP: Custom Processor Development in Apache NiFi, Dataworks Summit Barcelona 2019 5. Hello-World in Apache NiFi 6. Nifi 開發小技巧(Funnel)
  42. More about NiFi • Apache NiFi In Depth • Play

    with more processors ◦ Choose what you need and do it right • Learn from Dataflow Templates • NiFi Express Language • Build Your Own Processor ◦ BYOP: Custom Processor Development with Apache NiFi • MiNiFi • NiFi Registry 66