Securing and Personalizing Commerce Using Identity Data Mining

Using Identity Data Mining Securing & Personalizing Commerce Jonathan LeBlanc
Developer Evangelist (PayPal) Github: http://github.com/jcleblanc Twitter: @jcleblanc

The Problem Commerce Relies on Static Data Contributions

Premise You can determine the personality profile of a person
based on their usage habits Personalization == Security

Technology was the Solution!

Then I Read This… Us & Them The Science of
Identity By David Berreby

The Different States of Knowledge What a person knows What
a person knows they don’t know What a person doesn’t know they don’t know

Technology was NOT the Solution Identity and discovery are NOT
a technology solution

Our Subject Material

Our Subject Material HTML content is poorly structured There are
some pretty bad web practices on the interwebz You can’t trust that anything semantically valid will be present

How We’ll Capture This Data Start with base linguistics Extend
with available extras

The Basic Pieces Page Data Scrapey Scrapey Keywords Without all
the fluff Weighting Word diets FTW

Capture Raw Page Data Semantic data on the web is
sucktastic Assume 5 year olds built the sites Language is the key

Extract Keywords We now have a big jumble of words.
Let’s extract Why is “and” a top word? Stop words = sad panda

Weight Keywords All content is not created equal Meta and
headers and semantics oh my! This is where we leech off the work of others

Questions to Keep in Mind Should I use regex to
parse web content? How do users interact with page content? What key identifiers can be monitored to detect interest?

Fetching the Data: The Request $html = file_get_contents('URL'); $c =
curl_init('URL'); The Simple Way The Controlled Way

Fetching the Data: cURL $req = curl_init($url); $options = array(
CURLOPT_URL => $url, CURLOPT_HEADER => $header, CURLOPT_RETURNTRANSFER => true, CURLOPT_FOLLOWLOCATION => true, CURLOPT_AUTOREFERER => true, CURLOPT_TIMEOUT => 15, CURLOPT_MAXREDIRS => 10 ); curl_setopt_array($req, $options);

//list of findable / replaceable string characters $find = array('/\r/',
'/\n/', '/\s\s+/'); $replace = array(' ', ' ', ' '); //perform page content modification $mod_content = preg_replace('#<script(.*?)>(.*?)</ script>#is', '', $page_content); $mod_content = preg_replace('#<style(.*?)>(.*?)</ style>#is', '', $mod_content); $mod_content = strip_tags($mod_content); $mod_content = strtolower($mod_content); $mod_content = preg_replace($find, $replace, $mod_content); $mod_content = trim($mod_content); $mod_content = explode(' ', $mod_content); natcasesort($mod_content);

//set up list of stop words and the final found
stopped list $common_words = array('a', ..., 'zero'); $searched_words = array(); //extract list of keywords with number of occurrences foreach($mod_content as $word) { $word = trim($word); if (preg_match('/[^a-zA-Z]/', $word) == 1){ $word = ''; } if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; } } arsort($searched_words, SORT_NUMERIC);

Scraping Site Meta Data //load scraped page data as a
valid DOM document $dom = new DOMDocument(); @$dom->loadHTML($page_content); //scrape title $title = $dom->getElementsByTagName("title"); $title = $title->item(0)->nodeValue;

//loop through all found meta tags $metas = $dom->getElementsByTagName("meta"); for
($i = 0; $i < $metas->length; $i++){ $meta = $metas->item($i); if($meta->getAttribute("property")){ if ($meta->getAttribute("property") == "og:description"){ $dataReturn["description"] = $meta->getAttribute("content"); } } else { if($meta->getAttribute("name") == "description"){ $dataReturn["description"] = $meta->getAttribute("content"); } else if($meta->getAttribute("name") == "keywords”){ $dataReturn[”keywords"] = $meta->getAttribute("content"); } } }

Weighting Important Data Tags you should care about: meta (include
OG), title, description, h1+, header Bonus points for adding in content location modifiers

Weighting Important Tags //our keyword weights $weights = array("keywords" =>
"3.0", "meta" => "2.0", "header1" => "1.5", "header2" => "1.2"); //add modifier here if(strlen($word) > 2 && !in_array($word, $common_words)){ $searched_words[$word]++; }

Expanding to Phrases 2-3 adjacent words, making up a direct
relevant callout Seems easy right? Just like single words Language gets wonky without stop words

Working with Unknown Users The majority of users won’t be
immediately targetable Use HTML5 LocalStorage & Cookie backup

Adding in Time Interactions Interaction with a site does not
necessarily mean interest in it Time needs to also include an interaction component Gift buying seasons see interest variations

Grouping Using Commonality Interests User A Interests User B
Interests Common

www.slideshare.com/jcleblanc Thank You! Questions? Jonathan LeBlanc Developer Evangelist (PayPal) Github:
http://github.com/jcleblanc Twitter: @jcleblanc

Securing and Personalizing Commerce Using Ident...

Securing and Personalizing Commerce Using Identity Data Mining

Jonathan LeBlanc

More Decks by Jonathan LeBlanc

Other Decks in Technology

Featured

Transcript

Using Identity Data Mining Securing & Personalizing Commerce Jonathan LeBlanc

The Problem Commerce Relies on Static Data Contributions

Premise You can determine the personality profile of a person

Technology was the Solution!

Then I Read This… Us & Them The Science of

The Different States of Knowledge What a person knows What

Technology was NOT the Solution Identity and discovery are NOT

Our Subject Material

Our Subject Material HTML content is poorly structured There are

How We’ll Capture This Data Start with base linguistics Extend

The Basic Pieces Page Data Scrapey Scrapey Keywords Without all

Capture Raw Page Data Semantic data on the web is

Extract Keywords We now have a big jumble of words.

Weight Keywords All content is not created equal Meta and

Questions to Keep in Mind Should I use regex to

Fetching the Data: The Request $html = file_get_contents('URL'); $c =

Fetching the Data: cURL $req = curl_init($url); $options = array(

//list of findable / replaceable string characters $find = array('/\r/',

//set up list of stop words and the final found

Scraping Site Meta Data //load scraped page data as a

//loop through all found meta tags $metas = $dom->getElementsByTagName("meta"); for

Weighting Important Data Tags you should care about: meta (include

Weighting Important Tags //our keyword weights $weights = array("keywords" =>

Expanding to Phrases 2-3 adjacent words, making up a direct

Working with Unknown Users The majority of users won’t be

Adding in Time Interactions Interaction with a site does not

Grouping Using Commonality Interests User A Interests User B

www.slideshare.com/jcleblanc Thank You! Questions? Jonathan LeBlanc Developer Evangelist (PayPal) Github: