Accelerate Insights and Streamline Ingestion with a Data Lake on AWS
Slides from a webinar with AWS looking at big data, data warehouses, lakes and lake houses and how different types of data are treated to extract insights that lead to outcomes.
lakes without compromising the productivity of your DevOps team. Presenters will explain how a modern data lake can: ◦ Realize the elasticity of the cloud and reduce costs through consumption utilization ◦ Prepare your structured, unstructured, and semi-structured data for integration with analytics and machine learning tools ◦ Support data democratization and enable collaboration while maintaining security and data governance Agenda Kanchan Waikar Solutions Architect, AWS Helen Beal Chief Ambassador, DevOps Institute
Working coach, Chief Ambassador at DevOps Institute and an ambassador for the Continuous Delivery Foundation. She is the Chair of the Value Stream Management Consortium and provides strategic advisory services to DevOps industry leaders such as Plutora and Moogsoft. She is also an analyst at Accelerated Strategies Group. She hosts the Day-to-Day DevOps webinar series for BrightTalk, speaks regularly on DevOps topics, is a DevOps editor for InfoQ and also writes for a number of other online platforms. She regularly appears in TechBeacon’s DevOps Top100 lists and was recognized as the Top DevOps Evangelist 2020 in the DevOps Dozen awards. Herder of Humans @bealhelen
data volumes and provide useful insights into the data prior to saving it to long-term storage 9 Dimension Batch Stream History Traditional Modern Data Processing Location System of Record Source In Event of Failure Restart batch Retry increment Pros Simple and robust Live, scalable, fault tolerant Cons Latency Complex, expensive? Use cases for stream processing are found when systems handle big data volumes and where real-time results matter. If the value of the information contained in the data stream decreases rapidly as it gets older, stream processing is appropriate. E.g.: • Real-time analytics • Anomaly, fraud or pattern detection • Complex event processing • Real-time statistics and dashboards • Real-time extract, transform, load (ETL) • Implementing event driven architectures
jobs to prepare large, bulk datasets for downstream analytics • Avoid just lifting and shifting batch processing to AWS - use the opportunity to improve the service • Automate and orchestrate everywhere • Use Spot Instances to save on flexible batch processing jobs • Continuously monitor and improve batch processing • Redshift for datawarehousing needs 10
systems or other markers, separating different elements and enabling search (self-describing); think JSON, CSV, XML 11 Dimension Structured Unstructured Format Defined Undefined Type Qualitative Quantitative Usually Stored Data warehouses Data lakes Search and Analyze Easy More work Database RDBMS NoSQL Programming Language SQL Various Analysis Regression, classification, clustering Data mining, data stacking Customer Insights High level Deeper insights into sentiment and behavior 20%- 80%+ Unstructured data: • Documents • Publications • Reports • Emails • Social media • Videos • Images • Audio • Mobile activity • Satellite imagery • Sensors
quickly builds a high-quality, device-friendly app on a cloud-based platform designed with developer and consumer usability in mind and makes it available via self-service. Icons by Smashicons, Freepic, Dimitry Miroliubov, Eucalyp from Flaticon Deep exploration Personalized Insights Real-time Queries Transparency The business doesn’t need to be data engineers or scientists to search using natural language for the answers they need and gain the insights that will enable them to make intelligent, data-driven business decisions. Context Anomaly Detection Causal Relationships Trend Isolation Noise Reduction Segmentation
change Data needs to be mined and business intelligence analyzed at speed and with adaptability too. Systems from backlog to deployment must handle data needs. CICD & DevOps Toolchains Teams working with data need to leverage the power of automation to maximise throughput and stability and provide CICD capabilities and limited blast radius. The Three Ways We want to accelerate flow, amplify feedback and use our data to drive experiments too. Monitoring and observability are key with AI for feedback. A high-trust, collaborative culture In order to build trust in a DevOps culture we have data-driven, not opinion driven conversations. Data must be available real-time, on demand and via self-service. Value stream centric working Truly understanding flow, means all people in the value stream have a profound understanding of the end-to-end system and this is driven by data insights. “We build it, we own it.” Teams must be multifunctional, cross-skilling must be standard practice, it must be quick and easy to get results from tools - choose those designed with usability. Focus on value outcomes Insights lead to decisions lead to measurement experience improvements for the customer: AI accelerates mean time to outcome (MTTO). 14
data, coming from many different sources in multiple formats at variable speeds • Remember the 3Vs: Volume, Velocity and Variety; these demand scalability • Different data has different needs Accelerate Insights and Streamline Ingestion with a Data Lake on AWS 15 • Making data available centrally is key for efficient processing and access • The objective is to make better business decisions • Those business decisions must result in sublime customer experiences • AI/ML and predictive analytics accelerate time to insight and time to outcome • This makes more innovation time available to build differentiating features • DataOps accelerates the data pipeline The 3 Vs Data as a Service Augmented Analytics