EXTENT Trading Technology Trends & Quality Assurance Conference in Obninsk, 2 March, 2013 Managing Uncertain Data at Scale Nikolay Marin
• Click to add text© 2013 IBM CorporationManaging Uncertain Data at ScaleNikolay Marin
View Slide
© 2013 3IBM CorporationManaging Uncertain Data at Scale2Managing Uncertain Data at ScaleTrend: Most of theworld’s analyzeddata will be uncertain By 2015, 80% of the world’s data will be uncertain Uncertain data management requires new techniques These techniques are necessary for real-world Big Data AnalyticsOpportunity:Business leadershipusing Big DataAnalytics Robust, business-aware uncertain data management Use analytics over uncertain web, sensor, and human-generated data Enable good business decisions by understanding analysisconfidenceChallenge: TakingBig Data Analyticsinto an uncertainworld Analysis of text is highly nuanced; sensor-based data is imprecise Timely business decisions require efficient large-scale analytics It is more difficult to obtain insight about an individual than a group,especially if the source data is uncertain
© 2013 3IBM CorporationManaging Uncertain Data at Scale3* Truthfulness, accuracy or precision, correctnessThe fourth dimension of Big Data: Veracity – handling data in doubtVolume Velocity Veracity*VarietyData at RestTerabytes toexabytes of existingdata to processData in MotionStreaming data,milliseconds toseconds to respondData in ManyFormsStructured,unstructured, text,multimediaData in DoubtUncertainty due todata inconsistency& incompleteness,ambiguities, latency,deception, modelapproximations
© 2013 3IBM CorporationManaging Uncertain Data at Scale4Forecasting a hurricane(www.noaa.gov)Fitting a curve to dataModel UncertaintyAll modeling is approximateProcess UncertaintyProcesses contain“randomness”Uncertainty arises from many sourcesUncertain travel timesSemiconductor yieldIntendedSpelling Text EntryActualSpellingGPS Uncertainty???RumorsContaminated?{John Smith, Dallas}{John Smith, Kansas}Data UncertaintyData input is uncertainAmbiguity{Paris Airport}TestimonyConflicting Data???
© 2013 3IBM CorporationManaging Uncertain Data at Scale5Global Data Volume in ExabytesSensors(Internet of Things)Multiple sources: IDC,Cisco100908070605040302010Aggregate Uncertainty %VoIP90008000700060005000400030002000100002005 2010 2015By 2015, 80% of all available data will be uncertainEnterprise DataData quality solutions exist forenterprise data like customer,product, and address data, butthis is only a fraction of thetotal enterprise data.By 2015 the number of networked devices willbe double the entire global population. Allsensor data has uncertainty.Social Media(video, audio and text)The total number of social mediaaccounts exceeds the entire globalpopulation. This data is highly uncertainin both its expression and content.
© 2013 3IBM CorporationManaging Uncertain Data at Scale6Requires specific business process and industry contextHow to reduce uncertainty in processes, models, and dataConstructing context for better understanding Extract as much information as feasible from each source Combine (condense) data from multiple sources More data from more sources is better– Gathers more evidence for statistical methodsUsing statistical methods scaled for Big Data Stochastic techniques efficiently reason about uncertainty Monte Carlo techniques explore many possible scenariosin order to gain insight
© 2013 3IBM CorporationManaging Uncertain Data at Scale7AttributesTrouble ticketsHelp agent findsimilar tickets Improve suggestions for similar problems using corroborating data and better mathematical techniques Analyze all the data – do not subset Use related techniques to automate Level 1 support, finding problem clusters, etc.Use stochastic searchto find trouble ticketsthat are similarTrouble ticket attributes Some attributes such as server typeare precise Other attributes such as words introuble tickets may be impreciseindicators of the problemModel approximation Treat N attributes as Ndimensions in space Model similarity as closeness inthe N dimensional spacePrediction Improve predictability by gettingagent feedbackStatistical techniques reduce uncertainty in analytical models
© 2013 3IBM Corporation 8Managing Uncertain Data at ScaleAnalytics is broadly defined as the use of data and computation to makesmart decisionsDataHistoricalSimulatedText Video, Images Audio Data instances Reports and queries ondata aggregates Predictive models Answers and confidence Feedback and learningDecision point Possible outcomesOption 1Option 2Option 3
© 2013 3IBM Corporation 9Managing Uncertain Data at ScaleFuture of AnalyticsExplosion ofunstructured data Creates new analytics opportunities Addresses new enterprise needsConsistent,extensible, andconsumable analyticsplatform Reduces cost-to-value for enterprises Increases analytics solution coverage with limited supply of skillsOptimizing acrossthe stack to deployanalytics at scale Analytics becomes a dominant IT workload and drives HW design Opportunity to seamlessly scale from terascale to exascale
© 2013 3IBM Corporation 10Managing Uncertain Data at ScaleAnalytics toolkits will be expanded to support ingestion and interpretation ofunstructured data, and enable adaptation and learningExtended from: Competing on Analytics, Davenport and Harris, 2007Standard ReportingAd hoc ReportingQuery/Drill DownAlertsForecastingSimulationPredictive ModelingIn memory data, fuzzy search, geo spatialCausality, probabilistic, confidence levelsHigh fidelity, games, data farmingLarger data sets, nonlinear regressionRules/triggers, context sensitive, complex eventsQuery by example, user defined reportsReal time, visualizations, user interaction Report Decide and Act Understandand Predict Collect andIngest/Interpret LearnOptimizationOptimization under UncertaintyDecision complexity, solution speedQuantifying or mitigating riskAdaptive AnalysisContinual Analysis Responding to local change/feedbackResponding to contextEntity ResolutionAnnotation and TokenizationRelationship, Feature ExtractionPeople, roles, locations, thingsRules, semantic inferencing, matchingAutomated, crowd sourcedDecide what to count;enable accurate countingIn the context ofthe decisionprocessTradi-tionalNewMethodsNewData
© 2013 3IBM CorporationManaging Uncertain Data at Scale11Finally...what about a longer term view.... say the next 10-50 years?1. Artificial Intelligence2. Nano –“everything”3. Cognitive Computing4. Deep (Exascale) Computing5. Automic & Quantum Computing6. Human / Computer Interaction7. Machine to Machine Interaction8. BioTech / Human Augmentation9. Robots & Robotics10. Advanced / Predictive Analytics11. Security & Privacy12. 3-D Printing13. Video-enabled Business Processes14. Personalized Web/Assistants15. Ubiquitous Computing16. Gaming17. Simulation18. Virtual Computing (including virtual worlds, tele-presence, etc.)19. Augmented RealityIBM Academy of Technology and Global Technology Outlook can help you find some answers
© 2013 3IBM CorporationManaging Uncertain Data at Scale