Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Science Discovery with SQL-MapReduce

Data Science Discovery with SQL-MapReduce

Talk by Donal Daly and Chris Hillman @Terdata at Data Science London meetup 21/02/13

Data Science London

February 26, 2013
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. DATA SCIENCE DISCOVERY WITH SQL-MAPREDUCE AND SQL Donal Daly Director,

    Teradata Aster CoE EMEA Chris Hillman Principal Data Scientist, Teradata 21st February 2013
  2. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 2 • 

    Questions… -  What are the key attributes of a Discovery platform? •  Why should you care? •  What does it mean from a data scientist perspective? -  How to run MapReduce jobs using standard SQL with Aster SQL- MapReduce -  Running SQL queries on Hadoop data with Aster SQL-H -  Integrating Aster metadata with Hadoop HCatalog •  Unified Data Architecture •  Donal’s Toolbox •  Teradata Aster in Practice -  IPTV Quality of Service Analysis -  Predictive Models -  Web Site Path Analysis •  Q & A Agenda Some Customer sensitive slides removed
  3. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 4 Big

    Data: From Transactions to Interactions BIG DATA WEB Petabytes CRM Terabytes Gigabytes ERP Exabytes Increasing Data Variety and Complexity User Generated Content Mobile Web SMS/MMS Sentiment External Demographics HD Video Speech to Text Product/ Service Logs Social Network Business Data Feeds User Click Stream Web Logs Offer History A/B Testing Dynamic Pricing Affiliate Networks Search Marketing Behavioral Targeting Dynamic Funnels Payment Record Support Contacts Customer Touches Purchase Detail Purchase Record Offer Details Segmentation Behavioral Analytics
  4. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 5 Big

    Data Comes with BIG HEADACHES Even free software like Hadoop is causing companies to spend more money…Many CIOs believe data is inexpensive because storage has become inexpensive. But data is inherently messy— it can be wrong, it can be duplicative, and it can be irrelevant—which means it requires handling, which is where the real expenses come in. “ ” Through 2015, 85% of Fortune 500 organizations will be unable to exploit big data for competitive advantage. “ ” Source: The Wall Street Journal. “CIOs’ Big Problem with Big Data”. Aug 2012 Source: Gartner. “Information Innovation: Innovation Key Initiative Overview”. April 2012
  5. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 6 Need

    for a Unified Data Architecture for New Insights Enabling Any User for Any Data Type from Data Capture to Analysis Discover and Explore Reporting and Execution in the Enterprise Capture, Store and Refine Audio/ Video Images Docs Text Web & Social Machine Logs CRM SCM ERP Java, C/C++, Python, R, SAS, SQL, Excel, BI, Visualization
  6. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 7 AUDIO

    & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP UNIFIED DATA ARCHITECTURE LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS Engineers Data Scientists Business Analysts Front-Line Workers Customers / Partners Quants Operational Systems Marketing
  7. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 8 AUDIO

    & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP INTEGRATED DATA WAREHOUSE UNIFIED DATA ARCHITECTURE LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS Engineers Data Scientists Business Analysts Front-Line Workers Customers / Partners Quants Operational Systems Marketing
  8. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 9 • 

    Single View of Your Business •  Cross-Functional Analysis •  Shared Source for Analytics •  Balance the Books •  Customer/Product Profitability •  Fast Time-to-Market For New Apps •  Mission Critical Service Levels Integrated Data Warehouse Requirements
  9. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 10 AUDIO

    & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP DISCOVERY PLATFORM INTEGRATED DATA WAREHOUSE UNIFIED DATA ARCHITECTURE LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS Engineers Data Scientists Business Analysts Front-Line Workers Customers / Partners Quants Operational Systems Marketing
  10. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 11 Discovery

    Platform Requirements All Data Structured Data Multi- Structured Data Non- Relational Data DISCOVERY DISCOVERY TOOLS USERS Discovery Platform Data Scientist SQL MapReduce Statistical Functions OLTP DBMS’s •  Doesn’t require extensive modeling •  Doesn’t balance the books •  Data completeness can be good enough •  No stringent SLAs Behavioral Analytics •  Fraud •  Customer •  Supply chain Data Analyst
  11. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 12 AUDIO

    & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP DISCOVERY PLATFORM INTEGRATED DATA WAREHOUSE UNIFIED DATA ARCHITECTURE LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS Engineers Data Scientists Business Analysts Front-Line Workers Customers / Partners Quants Operational Systems Marketing CAPTURE | STORE | REFINE
  12. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 13 • 

    Land/source operational data -  Only one extract from source system •  History or long term storage -  Low cost storage •  Preprocess data -  Sessionize data, remove XML tags •  Transformations -  Structured and semi-structured data Big Data Management Requirements Machine Generated Social Media
  13. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 14 AUDIO

    & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP DISCOVERY PLATFORM INTEGRATED DATA WAREHOUSE UNIFIED DATA ARCHITECTURE LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS Engineers Data Scientists Business Analysts Front-Line Workers Customers / Partners Quants Operational Systems Marketing CAPTURE | STORE | REFINE Big Data Analytics Big Data Management
  14. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 15 AUDIO

    & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP DISCOVERY PLATFORM INTEGRATED DATA WAREHOUSE UNIFIED DATA ARCHITECTURE LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS Engineers Data Scientists Business Analysts Front-Line Workers Customers / Partners Quants Operational Systems Marketing CAPTURE | STORE | REFINE Big Data Analytics Big Data Management
  15. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 18 A

    Look Inside the Aster Analytic Platform •  Integrated hardware and software appliance •  Also available as software only or cloud •  MPP Relational-data architecture •  SQL-MapReduce framework •  Analyze both non-relational + relational data •  Linear, incremental scalability •  70+ pre-built analytic modules •  Visual IDE; develop apps in hours •  Many programming languages Data Scientist Analyst Customers Business Users Your Analytics and Advanced Reporting Applications Store Process Develop Pattern Matching Java, C, Python, Perl… Graph Statistical ELT SQL Platform Services (e.g. query planning, dynamic workload management, security…) SQL-MapReduce Row Store Column Store
  16. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 19 Golden

    Path Analysis of Top Site Paths Identifying Top Pathing Occurrences (for any event of interest) 1. Select Name of Operator/Function 2. Select Data Sets for Input 3. Identify Pattern of Interest 4. Provide Pattern Definition 5. Define Output desired SELECT click_path, count(*) as path_frequency FROM nPath( ON clicks PARTITION BY user_id ORDER BY timestamp MODE( overlapping )‏ PATTERN(‘(RELEVANT|IGNORE)*.BUY’)‏ SYMBOLS( page_type IN (‘help.asp’) AS IGNORE, page_type NOT IN (‘help.asp’) AS RELEVANT, page_type = ‘checkout’ as BUY)‏ RESULT( accum( page_id of RELEVANT) as click_path ) ) T GROUP BY click_path ORDER BY count(*) desc LIMIT 10;
  17. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 20 Golden

    Path Analysis of Top Site Paths Identifying Top Pathing Occurrences (for any event of interest) 1. Select Name of Operator/Function 2. Select Data Sets for Input 3. Identify Pattern of Interest 4. Provide Pattern Definition 5. Define Output desired SELECT click_path, count(*) as path_frequency FROM nPath( ON clicks PARTITION BY user_id ORDER BY timestamp MODE( overlapping )‏ PATTERN(‘(RELEVANT|IGNORE)*.BUY’)‏ SYMBOLS( page_type IN (‘help.asp’) AS IGNORE, page_type NOT IN (‘help.asp’) AS RELEVANT, page_type = ‘checkout’ as BUY)‏ RESULT( accum( page_id of RELEVANT) as click_path ) ) T GROUP BY click_path ORDER BY count(*) desc LIMIT 10; 1. Select Name of Operator/Function FROM nPath
  18. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 21 Golden

    Path Analysis of Top Site Paths Identifying Top Pathing Occurrences (for any event of interest) 1. Select Name of Operator/Function 2. Select Data Sets for Input 3. Identify Pattern of Interest 4. Provide Pattern Definition 5. Define Output desired SELECT click_path, count(*) as path_frequency FROM nPath( ON clicks PARTITION BY user_id ORDER BY timestamp MODE( overlapping )‏ PATTERN(‘(RELEVANT|IGNORE)*.BUY’)‏ SYMBOLS( page_type IN (‘help.asp’) AS IGNORE, page_type NOT IN (‘help.asp’) AS RELEVANT, page_type = ‘checkout’ as BUY)‏ RESULT( accum( page_id of RELEVANT) as click_path ) ) T GROUP BY click_path ORDER BY count(*) desc LIMIT 10; 2.  Select Data Sets for Input ON clicks PARTITION BY user_id ORDER BY timestamp
  19. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 22 Golden

    Path Analysis of Top Site Paths Identifying Top Pathing Occurrences (for any event of interest) 1. Select Name of Operator/Function 2. Select Data Sets for Input 3. Identify Pattern of Interest 4. Provide Pattern Definition 5. Define Output desired SELECT click_path, count(*) as path_frequency FROM nPath( ON clicks PARTITION BY user_id ORDER BY timestamp MODE( overlapping )‏ PATTERN(‘(RELEVANT|IGNORE)*.BUY’)‏ SYMBOLS( page_type IN (‘help.asp’) AS IGNORE, page_type NOT IN (‘help.asp’) AS RELEVANT, page_type = ‘checkout’ as BUY)‏ RESULT( accum( page_id of RELEVANT) as click_path ) ) T GROUP BY click_path ORDER BY count(*) desc LIMIT 10; 3.  Identify Pattern of Interest PATTERN(‘(RELEVANT|IGNORE)*.BUY’)‏
  20. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 23 Golden

    Path Analysis of Top Site Paths Identifying Top Pathing Occurrences (for any event of interest) 1. Select Name of Operator/Function 2. Select Data Sets for Input 3. Identify Pattern of Interest 4. Provide Pattern Definition 5. Define Output desired SELECT click_path, count(*) as path_frequency FROM nPath( ON clicks PARTITION BY user_id ORDER BY timestamp MODE( overlapping )‏ PATTERN(‘(RELEVANT|IGNORE)*.BUY’)‏ SYMBOLS( page_type IN (‘help.asp’) AS IGNORE, page_type NOT IN (‘help.asp’) AS RELEVANT, page_type = ‘checkout’ as BUY)‏ RESULT( accum( page_id of RELEVANT) as click_path ) ) T GROUP BY click_path ORDER BY count(*) desc LIMIT 10; 4.  Provide Pattern Definition SYMBOLS( page_type IN (‘help.asp’) AS IGNORE, page_type NOT IN (‘help.asp’) AS RELEVANT, page_type = ‘checkout’ as BUY)‏
  21. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 24 Golden

    Path Analysis of Top Site Paths Identifying Top Pathing Occurrences (for any event of interest) 1. Select Name of Operator/Function 2. Select Data Sets for Input 3. Identify Pattern of Interest 4. Provide Pattern Definition 5. Define Output desired SELECT click_path, count(*) as path_frequency FROM nPath( ON clicks PARTITION BY user_id ORDER BY timestamp MODE( overlapping )‏ PATTERN(‘(RELEVANT|IGNORE)*.BUY’)‏ SYMBOLS( page_type IN (‘help.asp’) AS IGNORE, page_type NOT IN (‘help.asp’) AS RELEVANT, page_type = ‘checkout’ as BUY)‏ RESULT( accum( page_id of RELEVANT) as click_path ) ) T GROUP BY click_path ORDER BY count(*) desc LIMIT 10; 5.  Define Output desired GROUP BY click_path ORDER BY count(*) desc LIMIT 10;
  22. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 25 Example:

    Network Security Risk Analysis Proxy Log Pattern Detection (customized as needed) CREATE TABLE npath_results DISTRIBUTE BY HASH(asset_tag) AS SELECT *, last_ts-first_ts AS pattern_duration FROM nPath( ON ( select p.*, d.mac_address, d.asset_tag FROM proxy P JOIN dhcp d ON p.ip = d.ip AND p.timestamp BETWEEN d.lease_start AND d.lease_end WHERE p.timestamp BETWEEN ‘2011-10-10’ AND ‘2011-10-17’) PARTITION BY asset_tag ORDER BY timestamp MODE( nonoverlapping )‏ PATTERN(‘SN{1,3}.A*.BT{3,}.A*.BIGMAIL’)‏ SYMBOLS( ( url LIKE ‘%twitter.com%’ OR url LIKE ‘%facebook.com%’ ) AS SN, -- social network true AS A, (url LIKE ‘%piratebay%’ OR url LIKE ‘%torrentz%’ ) AS BT, -- bit torrent (url LIKE ‘%mail.yahoo.com%’ AND bytes_sent > 50000 ) AS BIGMAIL ) RESULT( FIRST (asset_tag OF SN) AS asset_tag, ACCUMULATE (p.ip OF ANY(SN,BT,BIGMAIL)) AS ips, ACCUMULATE (url OF SN) AS sn_urls, ACCUMULATE (url OF BT) AS bt_urls, FIRST (timestamp OF SN) AS first_ts, LAST (timestamp OF BIGMAIL) AS last_ts ) ) T ; SYMBOLS( ( url LIKE ‘%twitter.com%’ OR url LIKE ‘%facebook.com%’ OR url LIKE ‘%linkedin.com%’) AS SN, -- social network PATTERN(‘SN{1,3}.A*.BT{3,}.A*.BIGMAIL’) ‏
  23. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 26 70+

    SQL-MapReduce Packaged Functions Modules SQL-MapReduce Analytic Functions Path Analysis Discover patterns in rows of sequential data • nPath: complex sequential analysis for time series and behavioral patterns • Sessionization: identifies sessions from time series data in single pass • Attribution: operator to help ad networks and websites to distribute “credit”; options such as Uniform, Weighted and Exponential by occurrence or time. Graph and Relational Analysis Analyze patterns across rows of data • Graph analysis: finds shortest path from distinct node to all other nodes in graph • nTree: new function for performing operations on tree hierarchies. • Other: triangle finding, square finding, clustering coefficient
  24. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 27 70+

    SQL-MapReduce Packaged Functions Modules SQL-MapReduce Analytic Functions Text Analysis Derive patterns in textual data • Sentiment Analysis: classify content is positive or negative (for product review, customer feedback) • Text Categorization: used to label content as spam/not spam • Entity Extraction/Rules Engine: identify addresses, phone number, names from textual data • Text Processing: counts occurrences of words, identifies roots, & tracks relative positions of words & multi-word phrases • nGram: split an input stream of text into individual words and phrases • Levenshtein Distance: computes the distance between two words Data Transformation Transform data for more advanced analysis • Pivot: convert columns to rows or rows to columns • Log parser: Generalized tool for parsing Apache logs • Unpack: extracts nested data for further analysis • Pack: compress multi-column data into a single column • Antiselect: returns all columns except for specified column • Multicase: case statement that supports row match for multiple cases
  25. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 28 70+

    SQL-MapReduce Packaged Functions Modules SQL-MapReduce Analytic Functions Statistical Analysis High-performance processing of common statistical calculations • GLM: generalized linear model function that supports logistic, linear, log-linear regression models. Returns all parameters similar to R/SAS • Naïve Bayes Classifier: simple probabilistic classifier; applies Bayes Theorem to data sets. • Support Vector Machines: a supervised learning method for classification and regression analysis • PCA: Principal Component Analysis -transforms a set of observations into a set of uncorrelated variables. • Histogram: function to assign values to bins • Decision Trees: creates model of decisions and their possible implications • Approximate percentiles and distinct counts: calculates within specific variance • Correlation: characterizes the strength of the relation between different columns • Regression: linear/logistic regression btwn output variable & set of input variables • Averages: moving, weighted, exponential or volume-weighted averages
  26. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 29 70+

    SQL-MapReduce Packaged Functions Modules SQL-MapReduce Analytic Functions Cluster Analysis Discover natural groupings of data points • k-Means: clusters data into a specified number of groupings • Canopy: partitions data into overlapping subsets where k-means is performed • Minhash: buckets highly-dimensional items for cluster analysis • Basket analysis: creates configurable groupings of related items from transaction records in single pass • Collaborative Filter: predicts the interests of a user by collecting interest information from many users
  27. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 30 Aster

    SQL-H™ A Business User’s Bridge to Analyze Hadoop Data Aster SQL-H Gives Analysts and Data Scientists a Better Way to Analyze Data Stored in Hadoop •  Allow standard ANSI SQL access to Hadoop data •  Leverage existing BI tool and enable self service •  Enable 70+ prebuilt SQL-MapReduce Apps and IDE Hadoop Layer: HDFS Pig Hive Hadoop MR Aster: SQL-H HCatalog Data Data Filtering
  28. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 31 Aster

    SQL-H™ Analytic Query on Hadoop Data Data Discovery SQL-H™ Makes Data Accessible for Multi-Channel Analysis in Aster Discovery Platform ORIGINAL CALL CENTER DATA STORED IN HADOOP IVR Log Files RECORDS TRANSFORMED IN HADOOP Converted to HCatalog SQL-H
  29. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 32 SQL-H

    In Action Join Teradata, Hadoop, Aster tables; feed into Map Reduce SELECT qrd_focus_area, count(*) FROM nPath( ON ( SELECT * FROM ( SELECT * FROM load_from_teradata( ON mr_driver TDPID(‘dbc’) USERNAME(‘name1’) PASSWORD(‘password1’) QUERY(‘SELECT * FROM owner.prod_own_fact’) ) ) AS td JOIN owner.prod_dim proddim ON td.prod_id = proddim.product_id JOIN ( SELECT * FROM load_from_hadoop( ON mr_driver SERVER ('10.10.3.139') USERNAME (‘name2') DBNAME (‘repair') TABLENAME ('transaction') ) ) AS sqlh ON sqlh.prod_ident_nbr = proddim.id ) PARTITION BY party_id, prod_id ORDER BY repair_dt MODE (OVERLAPPING) PATTERN ( ‘REPAIR{3}' ) SYMBOLS ( event = ‘REPAIR’ AS REPAIR ) RESULT (ACCUMULATE(qrd_focus_area OF ANY(REPAIR)) AS qrd_focus_area_path ) ) n GROUP BY 1 ORDER BY 2 desc ; SQL manipulation TD Connector to get OWNERSHIP data Any path analytics you want, specified with the power of regular expressions! Hadoop Connector to get WARRANTY data JOIN in local Aster tables nPath MapReduce
  30. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 33 AUDIO

    & VIDEO IMAGES TEXT WEB & SOCIAL MACHINE LOGS CRM SCM ERP DISCOVERY PLATFORM INTEGRATED DATA WAREHOUSE UNIFIED DATA ARCHITECTURE LANGUAGES MATH & STATS DATA MINING BUSINESS INTELLIGENCE APPLICATIONS Engineers Data Scientists Business Analysts Front-Line Workers Customers / Partners Quants Operational Systems Marketing CAPTURE | STORE | REFINE Aster Loader Teradata Loader Aster to Teradata SQL-H SQL-H
  31. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 34 • 

    Tightly aligned with core Apache code lines •  All code committed back to open source •  Engineered integration with Teradata Viewpoint and Ambari •  HCatalog - centralized metadata services for easy data sharing •  Dependable full stack high availability •  Capacity scheduler for better multi-tenancy •  Intuitive graphical data integration tools The ONLY 100% open source data platform for Hadoop Hortonworks Data Platform (HDP)
  32. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 36 • 

    nCluster_loader to load data into Aster nCluster -  Load data in its original format as much as possible -  Mac OSX, Linux, Windows and Unix clients •  Teradata Studio 14.10 to work easily with SQL / SQL-MR •  An Internet Browser to visualize the final result -  Administration – Aster Management Console / Ganglia •  My Sandbox -  Teradata Aster Express running on my MacBook Pro (8GB RAM) -  HDP Sandbox Donal’s Toolbox
  33. Advanced Analytics •  IPTV Quality of Service Analysis •  Predictive

    Models •  Web Site Path Analysis EXAMPLES…
  34. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 41 create

    table wrk.cih_dshb_ads as SELECT srv_id, sav_flag, offer, inseecode, code_postal, libelle, nom_dep, nom_region, longitude, latitude, coalesce(topo_nra, 'Unknown') as topo_nra, topo_dslam, coalesce(iad_hardwareversion, 'Unknown') as iad_hardwareversion, coalesce(iad_manufacturer, 'Unknown') as iad_manufacturer, coalesce(iad_modelname , 'Unknown') as iad_modelname, coalesce(iad_modemfirmwareversion , 'Unknown') as iad_modemfirmwareversion, coalesce(iad_productclass , 'Unknown') as iad_productclass, coalesce(iad_provisioningcode , 'Unknown') as iad_provisioningcode, coalesce(iad_softwareversion , 'Unknown') as iad_softwareversion, coalesce(iad_vendorconfigfiledescription_1 , 'Unknown') as iad_vendorconfigfiledescription_1, coalesce(iad_vendorconfigfilename_1 , 'Unknown') as iad_vendorconfigfilename_1, coalesce(iad_vendorconfigfilenumbofentries , 0) as iad_vendorconfigfilenumbofentries, coalesce(iad_vendorconfigfileversion_1 , 'Unknown') as iad_vendorconfigfileversion_1, coalesce(iad_x_000e50_boardversion , 'Unknown') as iad_x_000e50_boardversion, coalesce(stb_description , 'Unknown') as stb_description, coalesce(stb_devicestatus , 'Unknown') as stb_devicestatus, coalesce(stb_gwinfoproductclass , 'Unknown') as stb_gwinfoproductclass, coalesce(stb_hardwareversion , 'Unknown') as stb_hardwareversion, coalesce(stb_manufacturer , 'Unknown') as stb_manufacturer, coalesce(stb_productclass , 'Unknown') as stb_productclass, coalesce( stb_softwareversion, 'Unknown') as stb_softwareversion, dev_iad_uptime_diff,dsl_showtime_diff,dev_stb_uptime_diff, kpi_iad_uptime,kpi_iad_synctime,kpi_stb_uptime, dev_iad_uptime,dsl_showtime,dev_stb_uptime, dsl_downstr_att,dsl_downstr_cur,dsl_downstr_max, kpi_voip_nb_dropped_calls_diff,kpi_voip_nb_dropped_calls,kpi_dsl_nb_crc,kpi_dsl_dscurrate_ratio_qualite, kpi_voip_tx_appels_coupes,kpi_voip_qualite,kpi_voip_qualite_diff,kpi_iptv_plr_nb_bon,kpi_iptv_plr_nb_moyen, ,kpi_iptv_conso_heures,kpi_iptv_packetslosts,kpi_iptv_packetsreceived, kpi_dsl_dscurrate_before,kpi_dsl_dscurrate_after, FROM wrk.cih_dshb_bis where network = 'BYT' and stb_manufacturer is not null and topo_dslam is not null Input Data
  35. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 42 SELECT

    * FROM forest_drive (ON (SELECT 1) PARTITION BY 1 DATABASE('beehive') USERID('beehive') PASSWORD('beehive') INPUTTABLE('wrk.cih_dshb_tree_in') OUTPUTTABLE('wrk.cih_dshb_tree_out') RESPONSE('sav_flag') NUMERICINPUTS(‘KPI_SIGNAL') CATEGORICALINPUTS('offer', 'nom_dep', 'nom_region', 'topo_nra','topo_dslam' , 'iad_modemfirmwareversion', 'iad_vendorconfigfiledescription_1', 'iad_x_000e50_boardversion', 'stb_description', 'stb_productclass', 'stb_softwareversion', 'topo_dslam_brand') NUMTREES(4) ) Decision Trees
  36. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 43 CREATE

    TABLE wrk.cih_dshb_model (PARTITION KEY(class)) AS SELECT * FROM naiveBayesReduce( ON(SELECT * FROM naiveBayesMap( ON (select * from wrk.cih_dshb_ads_in_11 where kpi_iad_uptime is not null) RESPONSE('sav_flag') NUMERICINPUTS('dev_iad_uptime','dsl_showtime','dev_stb_uptime', 'dsl_downstr_att','dsl_downstr_cur','dsl_downstr_max', 'kpi_voip_nb_dropped_calls_diff','kpi_voip_nb_dropped_calls','kpi_dsl_nb_crc','kpi_dsl_d scurrate_ratio_qualite','kpi_voip_tx_appels_coupes','kpi_voip_qualite','kpi_voip_qualite_ diff','kpi_iptv_plr_nb_bon','kpi_iptv_plr_nb_moyen','kpi_iptv_plr_nb_mauvais', 'kpi_iptv_packetslosts','kpi_iptv_packetsreceived','kpi_stb_uptime','kpi_iad_synctime','kp i_iad_uptime') CATEGORICALINPUTS('offer', 'nom_dep', 'nom_region', 'topo_nra','topo_dslam' , 'iad_modemfirmwareversion','iad_vendorconfigfiledescription_1','iad_x_000e50_boardver sion', 'stb_description', 'stb_productclass', 'stb_softwareversion', 'topo_dslam_brand') ) )PARTITION BY class ); Naïve Bayes
  37. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 44 create

    table wrk.cih_svm_train2 distribute by hash(srv_id) as select srv_id, 'topo_nra_insee' as attr, topo_nra_insee::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union all select srv_id, 'code_postal' as attr, code_postal::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union all select srv_id, 'kpi_iad_uptime_avg' as attr, kpi_iad_uptime_avg::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union all select srv_id, 'dev_iad_uptime_diff_avg' as attr, dev_iad_uptime_diff_avg::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union all select srv_id, 'kpi_voip_nb_dropped_calls_diff_avg' as attr, kpi_voip_nb_dropped_calls_diff_avg::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union all select srv_id, 'sav_nb_contacts' as attr, sav_nb_contacts::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union all select srv_id, 'nb_tr' as attr, nb_tr::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train union all select srv_id, 'kpi_dsl_nb_crc_avg' as attr, kpi_dsl_nb_crc_avg::varchar as attr_value, sav_all_tgt FROM wrk.cih_sav_train; /*Run SVM*/ CREATE TABLE wrk.cih_svm_model3 (PARTITION KEY(vec_index)) AS SELECT vec_index, avg(vec_value) as vec_value FROM svm( ON wrk.cih_svm_train2 PARTITION BY srv_id OUTCOME( 'sav_flag' ) ATTRIBUTE_NAME( 'attr' ) ATTRIBUTE_VALUE( 'attr_value' ) )GROUP BY vec_index; Support Vector Machine
  38. Confidential and proprietary. Copyright © 2012 Teradata Corporation. 47 Teradata

    Confidential 21/02/2013 Questions and Answers Thank you! www.asterdata.com/download_aster_express/ [email protected]