Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building Data Pipelines on Apache NiFi with Python

suci
September 21, 2019

Building Data Pipelines on Apache NiFi with Python

What is ETL
What is Apache NiFi
How do Apache NiFi and python work together

suci

September 21, 2019
Tweet

More Decks by suci

Other Decks in Programming

Transcript

  1. Building Data Pipelines
    on Apache NiFi
    with
    Shuhsi Lin
    20190921 at PyCon TW

    View Slide

  2. Lurking in PyHug, Taipei.py and various Meetups
    About Me
    2
    Scrum Master + Data Engineer
    in a manufacturing company
    Working with
    • data and people
    Focus on
    • Agile/Engineering culture
    • Streaming process
    • IoT applications
    • Data visualization
    Shuhsi Lin
    sucitw gmail.com
    https://medium.com/@suci/

    View Slide

  3. 3
    Agenda What we will focus on
    1. Data Pipelines/Data Flow
    2. Key Concepts of NiFi
    3. Why you may need NiFi
    4. Control NiFi with Python
    5. Work in NiFi with Python
    6. A path to dive into NiFi

    View Slide

  4. 4
    What we will not focus on
    1. Deployment/Operation/Monitoring/Administration
    2. Infrastructure Details
    3. NiFi Registry
    4. MiniNiFi
    5. Advanced Data Flow Design

    View Slide

  5. Data pipeline
    https://www.linkedin.com/pulse/data-pipeline-hadoop-part-1-2-birender-saini
    Hello ETL/ELT World
    Extract Transform Load
    Load Transform
    Extract
    Input
    Data
    Output
    Data
    Pipe
    Operation

    View Slide

  6. 3 Paradigms for Programming
    Interactive
    Request/Response
    https://qconnewyork.com/ny2016/ny2016/presentation/large-scale-stream-processing-apache-kafka.html
    Batch Stream Processing
    Input Output

    View Slide

  7. Simplistic Data Flow
    7
    Data Store/Application B
    Acquire/Ingest Data Process and Analyzed Data
    Data Store/Application A
    ● Data movement as flow
    ● Moving data content from A to B

    View Slide

  8. Data Pipeline
    Transform/Process
    Parse
    Filter
    Split/merge
    route
    Data Store
    Data Pipeline
    8
    BI tool
    Target data store
    Load/Ingest/Move
    Application
    Extra/Acquire Data
    Data Pipeline
    Data Pipeline
    diverse sources diverse targets

    View Slide

  9. Many Flow-like Data in a Real World
    9
    Across Organizations/ Business unit/ Geographic locations

    View Slide

  10. What Often Happen in (Large Data) Complex Data Pipelines
    ● Data governance
    ● An architecture/platform, not just pipelines
    ● Data quality issues of all kinds.
    ● Custom -> Competing standards
    ● Data auditing and Security
    ● Project/Requirement management
    ● Maintenance/operation
    Apache Nifi Crash Course, DataWorks Summit 2018
    Data
    ● Standards
    ● Schemas/Formats/Protocols
    ● Veracity
    ● Validity
    ● Data auditing
    ● Partitioning/ Bundling
    ● Increase in data velocity
    ● Unclear/uncontrollable data source
    Infrastructure/Operation support
    ● “Exactly Once” Delivery
    ● Ensuring Security
    ● Overcoming Security
    ● Credential Management
    ● Network
    ● Long-term maintenance
    ● Scalability
    People
    ● Compliance
    ● Person| team | group
    ● Consumers Changes
    ● Requirements
    Unclear/Change
    ● “Exactly Once” Delivery Req
    10

    View Slide

  11. So, What and Why is
    NiFi
    preferred pronunciation- >"nye fye" (nī fī)

    View Slide

  12. NiFi’s Features
    ● Server/Data Center class
    ● Web UI & REST API
    ● Highly configurable
    ● Data Provenance
    ● Designed for extension
    ● Secure
    Created in 2006 at NSA, donated to ASF at 2014
    Similar concepts of Flow Based Programming (FBP)
    12

    View Slide

  13. Process and Distribute Data
    ● Moves data around systems and gives you tools to process this data
    13
    https://medium.com/free-code-camp/nifi-surf-on-your-dataflow-4f3343c50aa2
    ● Deal a great variety of data
    sources and format
    ● From one source, transform it,
    and push it to a different data
    sink

    View Slide

  14. Why NiFi
    14
    ● Single ingestion platform for
    ○ Various data source and target
    ○ Resilient to update components ( 300 +processors, data source,
    transformation logic, Job scheduling ...)
    ○ Access control (Group, Users, LDAP)
    ● Scalable (Clustering)
    ● Reliability
    ○ Guaranteed delivery (no data loss)
    ○ Data provenance/lineage (auditing pipeline)
    ● Fast develop/deploy cycle
    ● Better data flow visibility (across Multi-disciplinary teams/peers )
    ● Data Processing without Programming
    The four Vs of Big Data
    https://medium.com/free-code-camp/nifi-surf-on-your-dataflow-4f3343c50aa2

    View Slide

  15. Let’s See Some Cases
    Photo by Jared Erondu on Unsplash

    View Slide

  16. 16
    https://nifi.apache.org/powered-by-nifi.html

    View Slide

  17. 17
    “Best practices and lessons learnt from Running Apache NiFi at Renault” Datawork Summit 2018
    NiFi Values for Renault

    View Slide

  18. 18
    “Best practices and lessons learnt from Running Apache NiFi at Renault” Datawork Summit 2018

    View Slide

  19. 19
    https://kylo.readthedocs.io/en/v0.10.0/
    Kylo uses Apache NiFi for
    orchestrating data pipelines.

    View Slide

  20. 20
    Pipeline Builder: Micron’s Journey Automating the Global Data Warehouse, Datawork Summit 2019

    View Slide

  21. How NiFi Works
    21
    Photo by Tomas Robertson on Unsplash

    View Slide

  22. NiFi Web UI
    22 https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#versioning_dataflow
    ● Drag and drop processors to build a flow
    ● Start and Stop and configure components in real time
    ● View errors and corresponding error messages
    ● View statistics and health of data flow
    ● create templates of common processor & connections

    View Slide

  23. Hello-World NiFi
    ● Drag and two components into the canvas.
    ● Two processors linked together by one connection (queue)
    23
    Processor 1- Connection - Processor 2
    Simple data flow
    https://medium.com/@suci/hello-world-nifi-dcafcba0fdb0
    Generate Flowfile (generate content)
    PutFile (Store content)
    connector
    Processor 2
    Processor 1
    I am flowfile

    View Slide

  24. 24
    ● Dataflow (NiFi pipeline)
    ● FlowFile
    ● Processor/ Processor group
    ● Connection
    ● FlowFlie Controller
    ● Provenance
    Nifi Architecture/terminology

    View Slide

  25. 25
    FlowFile
    ● Attributes
    ○ key/value pairs
    ○ JVM heap memory space
    (swappable)
    ○ Ex, the file name, file path, and a
    unique identifier are standard
    attributes
    ● Content
    ○ a reference to the stream of bytes
    compose the FlowFile content
    FileFile object is immutable
    How Apache Nifi works — surf on your dataflow, don’t drown in it

    View Slide

  26. 26
    https://en.wikipedia.org/wiki/HTTP_message_body
    Hands-on with Apache NiFi and MiNiFi | Berlin Buzzwords 2017
    Attributes
    Content/payload
    FlowFile
    HTTP/1.1 200 OK
    Date: Sun, 10 Oct 2010 23:26:07 GMT
    Server: Apache/2.2.8 (Ubuntu) mod_ssl/2.2.8
    OpenSSL/0.9.8g
    Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT
    ETag: "45b6-834-49130cc1182c0"
    Accept-Ranges: bytes
    Content-Length: 12
    Connection: close
    Content-Type: text/html
    Header
    Message body
    HTTP
    Hello world!

    View Slide

  27. 27
    FlowFile Processor
    ● Black box/high-level abstractions of data operation
    Three different kinds of processors
    How Apache Nifi works — surf on your dataflow, don’t drown in it
    connection

    View Slide

  28. 28
    P1 not scheduled until the connector
    goes back below its threshold.
    Two processors linked by a connector with its limit respected.
    Number of FlowFiles below the threshold.
    The Flow Controller schedules the P1 for execution again
    Backpressure

    View Slide

  29. 29
    Processor Group: Building a new processor from the existing processors
    How Apache Nifi works — surf on your dataflow, don’t drown in it
    Processor Group

    View Slide

  30. 30
    FlowFile Content in Content Repository
    pointer
    references
    https://nifi.apache.org/docs/nifi-docs/html/nifi-in-depth.html#deeper-view-content-claim
    How Apache Nifi works — surf on your dataflow, don’t drown in it

    View Slide

  31. Copy-on-write in NiFi
    The original content is still present in the repository after a FlowFile modification.
    How Apache Nifi works — surf on your dataflow, don’t drown in it
    Compresses the content

    View Slide

  32. 32
    FlowFile Repository
    The FlowFile Repository contains metadata about the files currently in the flow.
    How Apache Nifi works — surf on your dataflow, don’t drown in it

    View Slide

  33. 33
    Provenance Repository
    Provenance Event
    ● Track the complete history of all the FlowFiles in the flow
    ● The metadata and context information of each FlowFile
    ● Replay the data from any point in time
    How Apache Nifi works — surf on your dataflow, don’t drown in it

    View Slide

  34. 34
    WebCrawler Template
    Apache NiFi In Depth

    View Slide

  35. https://github.com/bbende/nifi-streaming-examples

    View Slide

  36. 36
    https://github.com/bbende/nifi-streaming-examples

    View Slide

  37. NiFi flow design is like software development
    “Best practices and lessons learnt from Running Apache NiFi” Datawork Summit 2018 37
    Design Code/develop Test Deploy Monitor Evolve/Refactor
    Programing language
    ● Integrated Development Environment (IDE)
    ● Algorithm, develop
    ● Functions/ Module (arguments, results)
    ● Libraries, packages
    ● ...
    Apache NiFi
    ● GUI
    ● Flow design, drop and drag
    ● Process groups (input/ output)
    ● Templates
    ● ...

    View Slide

  38. 38
    Apache NiFi
    Processing
    Framework
    (Spark,Storm,..)
    Processing
    Messaging Bus
    (Kafka, JMS...)
    Enterprise
    Service Bus
    (Fuse(Camel),
    Mule,...)
    ETL tools
    (SSIS,
    Informatica, ....)
    NiFi Positioning

    View Slide

  39. So, should I use it?
    ● Review what you need and choose what you need
    ● Do you need an enterprise dataflow platform
    ● Do you need to integrate with many different big data solutions
    ● You are suggested to have the ability to build the same pipeline
    without NiFi
    ● You may need enable administration functions in a large scale usage
    39

    View Slide

  40. Another way to
    access NiFi
    40
    Photo by Lachlan Donald on Unsplash

    View Slide

  41. Rest API
    41
    https://nifi.apache.org/docs/nifi-docs/rest-api/index.html
    A Programmatic way to command
    and control a NiFi instance in real
    time

    View Slide

  42. 42
    {
    "about": {
    "title": "NiFi",
    "version": "1.9.2",
    "uri": "http://localhost:38080/nifi-api/",
    "contentViewerUrl": "../nifi-content-viewer/",
    "timezone": "UTC",
    "buildTag": "nifi-1.9.2-RC2",
    "buildRevision": "ff01ff6",
    "buildBranch": "NIFI-6169-RC2",
    "buildTimestamp": "04/03/2019 15:25:53
    UTC"
    }
    }
    Response
    http://NIFI_instance/nifi-api/flow/about
    Get NiFi instance information
    https://nifi.apache.org/docs/nifi-docs/rest-api/

    View Slide

  43. 43
    http://NIFI_instance/nifi-api/flow/search-results?q=pycon>
    {
    "searchResultsDTO": {
    "processorResults": [],
    "connectionResults": [],
    "processGroupResults": [
    {
    "id":
    "2a5d288c-016d-1000-7ede-e736a588559b",
    "groupId":
    "382a2072-016c-1000-4c24-50b6a90ec204",
    "parentGroup": {
    "id":"382a2072-016c-1000-4c24-50b6a90ec204",
    "name": "NiFi Flow"},
    "name": "PyconTW 2019",
    "matches": [
    "Name: PyconTW 2019" ]}],
    "inputPortResults": [],
    "outputPortResults": [],
    "remoteProcessGroupResults": [],
    "funnelResults": []
    }
    }
    Response
    Search
    https://nifi.apache.org/docs/nifi-docs/rest-api/

    View Slide

  44. 44
    PUT
    /nifi-api/processors/2b5da9d5-016d-1000-d67f-bcd140eccb41
    HTTP/1.1
    Host: localhost:38080
    Content-Type: application/json
    Cache-Control: no-cache
    {
    "revision": {
    "clientId": "2a4758dc-016d-1000-56b9-2fc8baffbc07",
    "version": 7
    },
    "component":{
    "id": "2b5da9d5-016d-1000-d67f-bcd140eccb41",
    "state":"STOPPED"
    }
    }
    Stop a processor
    processor_id = 2b5da9d5-016d-1000-d67f-bcd140eccb41
    https://nifi.apache.org/docs/nifi-docs/rest-api/

    View Slide

  45. 45
    NiPyApi
    Nifi-Python-Api: A rich Apache NiFi Python Client SDK
    ● Detailed documentation of the full SDK at all levels
    ● CRUD wrappers for common task areas like Processor Groups, Processors, Templates,
    Registry Clients, Registry Buckets, Registry Flows, etc.
    ● Convenience functions for inventory tasks, such as recursively retrieving the entire
    canvas, or a flat list of all Process Groups
    ● Support for scheduling and purging flows, controller services, and connections
    ● Support for fetching and updating Variable Registries
    ● Support for import/export of Versioned Flows from NiFi-Registry
    ● Docker Compose configurations for testing and deployment
    ● A scripted deployment of an interactive environment, and a secured configuration, for
    testing and demonstration purposes
    By Dan Chaffelson

    View Slide

  46. 46
    Interactions with the NiFi Canvas
    Canvas
    Client SDK modules
    NiPyApi
    Config
    Security
    System
    Templates
    Utils
    Versioning
    A set of defaults and parameters
    Secure connectivity management
    For system and cluster level functions in NiFi
    For managing flow deployments
    Convenience utility functions for NiPyApi ( not intended for external use)
    For interactions with the NiFi Registry Service and related functions
    https://github.com/sucitw/python-script-in-NiFi/blob/master/RESTAPI/nipyapi_demo.ipynb

    View Slide

  47. 47
    Photo by Mitch Lensink on Unsplash
    Choose what you need
    from Many processors
    Dynamic vs Compiled processor

    View Slide

  48. 300+ Processors for Data Ecosystem
    Integration
    48
    2019, DataWorks Summit, Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
    “Swiss Army Knife of Data Movement”

    View Slide

  49. Processors
    49
    Data Ingestion
    ● GenerateFlowFiles
    ● GetFile
    ● GetFTP
    ● GetJMSQueue
    ● GetHDFS
    ● GetKafka
    ● GetMongo
    ● GetTwitter
    ● ListHDFS/FetchHDFS
    Data Transformation
    ● ReplaceText
    ● ConvertRecord
    ● UpdateRecord
    ● ConvertJSONtoSQL
    ● CompressContent
    ● ConvertCharacterSet
    ● TransformXml
    ● ...
    Data Egress
    Send Data
    ● PutEmail
    ● PutFile
    ● PutFTP
    ● PutSQL
    ● PutMongo
    ● PutHDFS
    ● PutHiveQL
    ● ...
    Routing and Mediation
    ● ControlRate
    ● RouteOnAttribute
    ● RouteOnContent
    ● ValidateCSV
    ● DeteDuplicate
    ● DistributeLoad
    ● ...
    Database Access
    ● ExecuteSQL
    ● ListDatabaseTables
    ● ...
    Attribute Extraction
    ● EvaluateJsonPath
    ● ExtractText
    ● UpdateAttribute
    ● ...
    System Interaction
    ● ExecuteProcess
    ● ExecuStreamCommand
    ● ExecuteScript
    ● ...
    Splitting and Aggregation
    ● SplitText
    ● SplitJson
    ● MergeContent
    ● ...
    HTTP
    ● GetHTTP
    ● ListenHTTP
    ● InvokeHTTP
    ● ...
    AWS
    ● FetchS3Object
    ● GetSQS
    ● GetDynamoDB
    ● ...
    There are more...

    View Slide

  50. 50
    https://nifi.apache.org/docs.html

    View Slide

  51. Still looking for
    What you really need

    View Slide

  52. Have you heard about
    Custom Development?
    Photo by Jan Kopřiva on Unsplash

    View Slide

  53. 53
    Capabilities of Custom development
    ● ExecuteProcess / ExecuteStreamCommand
    ● ExecuteScript / InvokeScriptedProcessor
    ○ ExecuteGroovyScript
    ● Custom Processor (Java)
    BYOP: Custom Processor Development with Apache NiFi, Andy LoPresto, DataWork summit, 2019

    View Slide

  54. ExecuteProcess & ExecuteStreamCommand
    ● Executes arbitrary process
    and captures output to a
    FlowFile
    ● No incoming flowfile
    ( No upstream connections )
    54
    ● On the contents of a flow file
    ○ Pipes FlowFile content to STDIN
    ● Creates a new flow file with the results of the
    command
    ○ Populates outgoing flowfile content
    from STDOUT
    ○ Can also direct to named attribute
    Executes an external command
    ● Ex, external Python/Shell script

    View Slide

  55. 55

    View Slide

  56. 56
    Photo by Thabang Mokoena on Unsplash
    ExecuteScript
    InvokeScriptedProcessor (ISP)
    Deliver custom logic

    View Slide

  57. ExecuteScript (ES)
    ● Clojure, Groovy, Ruby, Python (Jython - no C libs), Lua, Javascript
    ○ JSR-223 compatible language
    ● Easy to use, fast development
    ● Source is inline or file system
    ● ExecuteScript => evaluate “onTrigger()”
    ● Only two relationships available (REL_SUCCESS and REL_FAILURE)
    ● Poor performance
    57
    https://nifi.apache.org/docs.html (ExecuteScript)
    https://github.com/sucitw/python-script-in-NiFi

    View Slide

  58. ExecuteScript in Python (Jython)
    58

    ● ProcessorSession
    ● ProcessContext
    ● ComponentLog
    ● Relationship
    ● ...
    Basic NiFi APIs you need to know
    https://nifi.apache.org/developer-guide.html

    View Slide

  59. Hello-World ExecuteScript in Python
    59
    Update Attribute
    https://github.com/sucitw/python-script-in-NiFi/blob/master/hello_world(update_attribute).py

    View Slide

  60. InvokeScriptedProcessor (ISP)
    ● Faster ExecuteScript (ES)
    ● Difference with ExecuteScript
    ○ Supports custom properties and relationships
    ○ ES handles the session.commit() for you, but ISP does not.
    ○ ES has a "session" variable, where the ISP onTrigger() method must
    call sessionFactory.createSession()
    60
    https://nifi.apache.org/docs.html (InvokeScriptedProcessor)
    https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-scripting-bundle/nifi-scripting-processors/src/test/resources/jython

    View Slide

  61. 61
    https://github.com/sucitw/python-script-in-NiFi/blob/mast
    er/InvokeScripted_hello_world(update_attribute).py
    Hello-World
    InvokeScriptedProcessor
    in Python

    View Slide

  62. Photo by Ben Weber on Unsplash
    Tips and Tricks
    Suggestions

    View Slide

  63. Suggestions
    63
    ● Able to do the same ETL/ELT without NiFi
    ● Do it like software development
    ○ Focus on business logic first
    ○ Do it after design
    ○ Keep it simple
    ○ Separate environments: Dev, Test, Production
    ○ Versioning
    ○ Build once, develop many
    ● Extract -Load - Transform ( with external tools)
    ● Design for Failure (fault tolerance)
    ○ Good retry/error handling mechanism
    ○ Monitoring
    ○ Clustering in Production at least
    ● Use processor groups (PGs)
    ● PGs => Ingestion, test, and monitoring
    ● Naming convention
    ● Put labels/comments
    ● Use variables, use funnel
    ● Scheduling -> Automation
    ● Define SLA with you clients
    ● Do monitoring and alerting
    ● Use for handling routing and formatting
    ● Try to not queue data
    ● Control FlowFile stream rate
    ● ...

    View Slide

  64. Recap
    ● ETL/ELT pipelines and why they are hard
    ● How NiFi may help you
    ● Hello-World NiFi
    ● Key concepts and terms of NiFi (Flowfile, Processors, Connections)
    ● RestAPI with examples (NiPyapi)
    ● ExecuteProcess/ExecuteStreamCommand
    ● ExecuteScript/InvokeScriptedProcessor with Python
    ● Suggestions on NiFi
    64

    View Slide

  65. Reference
    65
    1. How Apache Nifi works — surf on your dataflow, don’t drown in it by François Paupier
    2. Pipeline Builder: Micron’s Journey Automating the Global Data Warehouse, Datawork
    Summit 2019
    3. Best practices for running Apache NiFi in production - 3 takeaways from real world
    projects /“Best practices and lessons learnt from Running Apache NiFi” Datawork
    Summit 2018
    4. BYOP: Custom Processor Development in Apache NiFi, Dataworks Summit Barcelona
    2019
    5. Hello-World in Apache NiFi
    6. Nifi 開發小技巧(Funnel)

    View Slide

  66. More about NiFi
    ● Apache NiFi In Depth
    ● Play with more processors
    ○ Choose what you need and do it right
    ● Learn from Dataflow Templates
    ● NiFi Express Language
    ● Build Your Own Processor
    ○ BYOP: Custom Processor Development with Apache NiFi
    ● MiNiFi
    ● NiFi Registry
    66

    View Slide