Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Engineering without borders

Data Engineering without borders

Presentation about all facts about data engineering done during a Goto night in Berlin.

https://www.meetup.com/GOTO-Nights-Berlin/events/240727991/

Pere Urbón

July 04, 2017
Tweet

More Decks by Pere Urbón

Other Decks in Technology

Transcript

  1. Data Engineering without
    borders
    Pere Urbon Bayes
    Data Wrangler
    pere.urbon @ { gmail.com, acm.org, springernature.com}

    View Slide

  2. About me
    Architecting and Building data centric applications since the beginning of the 2000
    All things around search and data processing
    Free and Open Source contributor and enthusiast
    Wannabe speaker
    #Dadops practitioner
    Can talk faster than you think!
    Long distance runner
    TV series aficionado

    View Slide

  3. Working at Springer Nature @ Berlin
    We are looking for:
    Lead QA
    Java/Scala Developers
    IT Agile Business Analyst
    Frontend Developers
    Infrastructure & Cloud Engineer
    Platform Engineer

    View Slide

  4. Topics of today
    Data engineering, What is this?
    The data centric organization
    From the past to the current days
    Facts and Skills for Data Engineers
    Hiring and developing in house talent

    View Slide

  5. Data Engineering?

    View Slide

  6. Data Engineering, What is this?
    Nothing new under the sun, common roles has extended to Data Engineer, Data Scientist,
    BI and Data Analyst
    Roles tend to be blurry, with not exact boundaries between them.
    Asking How? Or Why? You are one or the other
    With the advent of “Big Data” the role of data engineers is used in the industry to define the
    specialized Software Engineers that take care of the infrastructure required to fulfill the
    requirements of data intensive applications.
    Data Engineers usually deal with things such as databases of several kinds (relational and
    NoSQL), Hadoop, Spark, Flink, Elasticsearch, Redis, Kafka, Java, Scala and others

    View Slide

  7. Data Engineering, What is this?
    Data Engineers find themselves more often than not dealing with (big) data—
    from acquisition over cleaning, conversion, disambiguation, de-duplication—and
    also developing & deploying solutions.
    The role of a Data Scientist, without doubt requires a mathematical base
    education.
    https://medium.com/@mhausenblas/got-a-data-in-your-job-title-9b4c8919973b
    While the Data Engineer need a solid understanding of several topics such as
    distributed systems, databases, operating systems and software engineering without
    forgetting the human interactions.

    View Slide

  8. Data Engineering, What is this?
    Data Engineer Data Scientist
    Software background with touches of
    systems
    Strong background in mathematics
    Databases, NoSQL, Big Data, Batch
    Processing, Streams, Distributed Systems,
    Scalability, HA,...
    Statistics, Models, Machine Learning,
    Predictions, Inferences, ...
    Data Ingestion, Cleansing, ETL
    Warehousing, Modeling,...
    Data analysis, Machine Learning,
    Research, Business
    Basic understanding of Scientist tasks is a
    desired skill
    Communication with Engineers is crucial
    Architecture and Infrastructure Data analysis

    View Slide

  9. The data
    organization

    View Slide

  10. All started with...

    View Slide

  11. ….To be continued with...

    View Slide

  12. The era of the monolith application

    View Slide

  13. Everything changed in the 2000’s
    The advent of NoSQL and Big Data

    View Slide

  14. View Slide

  15. The rise of data democracy

    View Slide

  16. On the human side of things….

    View Slide

  17. The data anarchism

    View Slide

  18. Democracy should be governed

    View Slide

  19. From Anarchism to Governance
    • The Single Application principle
    • Everyone is living a happy life
    • The raise of multiple applications
    • The service oriented architecture
    • Introduction of an Analytics Team (BI / Data
    Science / …)
    • Data need to be prepared and integrated for
    analytic
    • Integration with external data providers
    • Data import, normalization, validation at
    scale
    • [Semantic] Data Integration
    • Company merges
    • Uniformization of similar or
    heterogeneous datasets

    View Slide

  20. A framework for Data Governance
    ISO 8000:150 as a framework for data governance

    View Slide

  21. Squads, Chapters, Tribes, …. for Data
    This ideas lead you to have this set of data centric roles
    Infrastructure specialist (database, brokers, virtualization and cloud)
    The data technicians
    Data Processing
    Data Modeling and Administration
    Data Discovery
    Data Architecture
    The data quality control
    How do you organize this specialities? There is usually not a perfect solution for everybody however….

    View Slide

  22. Squads, Chapters, Tribes, …. for Data
    Role Organization
    Infrastructure specialists Specific teams are created adoc, highly
    specialized.
    Processing Specific team, with broader scope
    embedded in specific product teams.
    Modeling and Administration Part of specific product teams.
    Discovery Specific team, highly specialized.
    Architecture Part of the global architecture team as
    domain experts.
    Quality Control Global chapter part of product team.

    View Slide

  23. Have all responsibilities extended?
    • Relational databases are strong and without any sign
    to decay.
    • SQL is not going away
    • Mass adoption of NoSQL solutions.
    • Schema on write vs Schema on read
    • Data Integration and Standardization
    • Atomicity, Consistency, Isolation and Durability
    (The ACID properties)
    • The CAP theorem
    • Ever increasing need to process data
    • Batch processing, Stream Processing,
    Hybrid approaches
    • Interaction with external systems (SOAP,
    REST, JSON, XML,...)
    • Machine learning is everywhere

    View Slide

  24. Have all responsibilities extended?
    • From Data warehouse to Data Lake
    • schema on write vs schema on read
    • Relational and bare metal vs NoSQL
    and cloud
    • ….

    View Slide

  25. Architecture patterns and tips

    View Slide

  26. Architecture patterns and tips

    View Slide

  27. Architecture patterns and tips
    Dedicated data ingestion platform (dedicated but generic)
    Have a common format, at least (for example: avro)
    Schema information is propagated continuously
    Apply data governance principles by introducing explicit data dependencies
    Data is managed by their most experienced entity
    Data is either push or stream into the next step in the flow
    Model flow orchestration explicitly
    Facilitate error handling with immutability
    Setup with versatility in mind

    View Slide

  28. Facts and Skills for
    Data Engineers

    View Slide

  29. Number of data engineers over time

    View Slide

  30. Prior role for data engineers

    View Slide

  31. Worldwide location

    View Slide

  32. Distribution by industry

    View Slide

  33. Distribution per company

    View Slide

  34. Top Skills

    View Slide

  35. Skills based on company size

    View Slide

  36. Data Engineering vs Data Scientists (Skills)

    View Slide

  37. Hiring and Inhouse
    talent

    View Slide

  38. Hiring and Retaining Data Engineers
    Some factors that will help you recruite and build a good team:
    Build an enviable team, smart engineers wanna work with people they can learn
    from.
    Remove the walls. Data engineers and Data Scientist shuold work in close
    collaboration.
    Have an environment where creatibity is free to flow.

    View Slide

  39. Hiring and Retaining Data Engineers
    Focus on the right kind of engineers:
    • BI, Warehouse, Data lake, Ingestion, strong focus on data processing
    ananalytics.
    • Tools: Specialized on a certain tool (Spark, Hadoop, …)
    • Architecture: End to end thinker, going from data processing to how teams use
    it.
    • Infrastructure: Focused on setting up databases and tools.

    View Slide

  40. References
    [1] Asim Jalis. The State of Data Engineering. stitchdata.com. [https://
    www.stitchdata.com/resources/reports/the-state-of-data-engineering/]
    [2] Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald.
    Magic Quadrant for Operational Database Management Systems. Gartner. October
    5, 2016.
    [3] Thomas W. Oestreich, Andrew White. Must-Have Roles for Data and Analytics,
    2017. Garter 2017
    [4] Jay Kreps, Questioning the Lambda Architecture. O’Really Media, 2014. [https://
    www.oreilly.com/ideas/questioning-the-lambda-architecture]

    View Slide

  41. Thanks, questions?

    View Slide

  42. Data Engineering without
    borders
    Pere Urbon Bayes
    Data Wrangler
    pere.urbon @ { gmail.com, acm.org, springernature.com}

    View Slide