Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Accelerate Insights and Streamline Ingestion with a Data Lake on AWS

Accelerate Insights and Streamline Ingestion with a Data Lake on AWS

Slides from a webinar with AWS looking at big data, data warehouses, lakes and lake houses and how different types of data are treated to extract insights that lead to outcomes.

Helen Beal

April 09, 2021
Tweet

More Decks by Helen Beal

Other Decks in Technology

Transcript

  1. © 2020, Amazon Web Services, Inc. or its Affiliates. All

    rights reserved. Accelerate insights and streamline ingestion with a data lake on AWS
  2. Learn how to get the full benefits of cloud data

    lakes without compromising the productivity of your DevOps team. Presenters will explain how a modern data lake can: ◦ Realize the elasticity of the cloud and reduce costs through consumption utilization ◦ Prepare your structured, unstructured, and semi-structured data for integration with analytics and machine learning tools ◦ Support data democratization and enable collaboration while maintaining security and data governance Agenda Kanchan Waikar Solutions Architect, AWS Helen Beal Chief Ambassador, DevOps Institute
  3. Helen Beal Helen Beal is a DevOps and Ways of

    Working coach, Chief Ambassador at DevOps Institute and an ambassador for the Continuous Delivery Foundation. She is the Chair of the Value Stream Management Consortium and provides strategic advisory services to DevOps industry leaders such as Plutora and Moogsoft. She is also an analyst at Accelerated Strategies Group. She hosts the Day-to-Day DevOps webinar series for BrightTalk, speaks regularly on DevOps topics, is a DevOps editor for InfoQ and also writes for a number of other online platforms. She regularly appears in TechBeacon’s DevOps Top100 lists and was recognized as the Top DevOps Evangelist 2020 in the DevOps Dozen awards. Herder of Humans @bealhelen
  4. Flow: Talk Map Data is everywhere Extracting insights Processing ML,

    AI and DataOps DevOps Practices 3rd party data Batch Stream 3Vs
  5. PAGE | Data is Everywhere Every application, every service, every

    environment 5 Icons by Eucalyp and Freepik from Flaticon
  6. PAGE | Data from Third Parties Find, subscribe to, and

    use third-party data in the cloud 6 Icons by Freepik from Flaticon
  7. PAGE | Extracting Data Insights Outcomes Data Insights An insight

    is only of value when it has a positive outcome 7 Icons by Freepik from Flaticon Why is this so hard to do?
  8. PAGE | Different Data, Different Needs The 3 Vs of

    Big Data 8 VOLUME VELOCITY VARIETY 1 2 3 It’s a scaling problem
  9. PAGE | Stream Processing The goal is to process big

    data volumes and provide useful insights into the data prior to saving it to long-term storage 9 Dimension Batch Stream History Traditional Modern Data Processing Location System of Record Source In Event of Failure Restart batch Retry increment Pros Simple and robust Live, scalable, fault tolerant Cons Latency Complex, expensive? Use cases for stream processing are found when systems handle big data volumes and where real-time results matter. If the value of the information contained in the data stream decreases rapidly as it gets older, stream processing is appropriate. E.g.: • Real-time analytics • Anomaly, fraud or pattern detection • Complex event processing • Real-time statistics and dashboards • Real-time extract, transform, load (ETL) • Implementing event driven architectures
  10. PAGE | Aggregation and Batch Processing • Use Batch Processing

    jobs to prepare large, bulk datasets for downstream analytics • Avoid just lifting and shifting batch processing to AWS - use the opportunity to improve the service • Automate and orchestrate everywhere • Use Spot Instances to save on flexible batch processing jobs • Continuously monitor and improve batch processing • Redshift for datawarehousing needs 10
  11. PAGE | Structured and Unstructured Data Semistructured data uses tagging

    systems or other markers, separating different elements and enabling search (self-describing); think JSON, CSV, XML 11 Dimension Structured Unstructured Format Defined Undefined Type Qualitative Quantitative Usually Stored Data warehouses Data lakes Search and Analyze Easy More work Database RDBMS NoSQL Programming Language SQL Various Analysis Regression, classification, clustering Data mining, data stacking Customer Insights High level Deeper insights into sentiment and behavior 20%- 80%+ Unstructured data: • Documents • Publications • Reports • Emails • Social media • Videos • Images • Audio • Mobile activity • Satellite imagery • Sensors
  12. PAGE | Centrally Managed Data Data warehouses, lakes and lake

    houses: key to enabling analytics 12 • Predictive analysis • Join LoBs • Cross-organizational insights • Make better business decisions • Automate DSS • Improve customer interactions • Improve R&D innovation choices • Increase operational efficiencies
  13. PAGE | ML, AI and DataOps 13 A DevOps team

    quickly builds a high-quality, device-friendly app on a cloud-based platform designed with developer and consumer usability in mind and makes it available via self-service. Icons by Smashicons, Freepic, Dimitry Miroliubov, Eucalyp from Flaticon Deep exploration Personalized Insights Real-time Queries Transparency The business doesn’t need to be data engineers or scientists to search using natural language for the answers they need and gain the insights that will enable them to make intelligent, data-driven business decisions. Context Anomaly Detection Causal Relationships Trend Isolation Noise Reduction Segmentation
  14. PAGE | Leveraging DevOps Practices DevOps DataOps Incremental , continuous

    change Data needs to be mined and business intelligence analyzed at speed and with adaptability too. Systems from backlog to deployment must handle data needs. CICD & DevOps Toolchains Teams working with data need to leverage the power of automation to maximise throughput and stability and provide CICD capabilities and limited blast radius. The Three Ways We want to accelerate flow, amplify feedback and use our data to drive experiments too. Monitoring and observability are key with AI for feedback. A high-trust, collaborative culture In order to build trust in a DevOps culture we have data-driven, not opinion driven conversations. Data must be available real-time, on demand and via self-service. Value stream centric working Truly understanding flow, means all people in the value stream have a profound understanding of the end-to-end system and this is driven by data insights. “We build it, we own it.” Teams must be multifunctional, cross-skilling must be standard practice, it must be quick and easy to get results from tools - choose those designed with usability. Focus on value outcomes Insights lead to decisions lead to measurement experience improvements for the customer: AI accelerates mean time to outcome (MTTO). 14
  15. PAGE | Key Takeaways • There is a LOT of

    data, coming from many different sources in multiple formats at variable speeds • Remember the 3Vs: Volume, Velocity and Variety; these demand scalability • Different data has different needs Accelerate Insights and Streamline Ingestion with a Data Lake on AWS 15 • Making data available centrally is key for efficient processing and access • The objective is to make better business decisions • Those business decisions must result in sublime customer experiences • AI/ML and predictive analytics accelerate time to insight and time to outcome • This makes more innovation time available to build differentiating features • DataOps accelerates the data pipeline The 3 Vs Data as a Service Augmented Analytics