Slide 1

Slide 1 text

Understanding Language & Fixing WP Search WordCamp Providence 2014

Slide 2

Slide 2 text

Xiao Yu Code Wrangler — Automattic @HypertextRanch [email protected] xyu.io xyu     

Slide 3

Slide 3 text

Jetpack Related Posts

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

Another Yet Another Related Posts Plugin?

Slide 8

Slide 8 text

Because finding related content is a hard problem to solve.

Slide 9

Slide 9 text

Natural Language Processing & Search Ranking           

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

WordPress on NGINX + HHVM with Heroku Buildpacks WordPress on NGINX + HHVM It’s been a year since I last made any major changes to my WordPress on Heroku build and in tech years that’s a lifetime. Since then Heroku has released a new PHP buildpack with nginx and HHVM built in. Much progress have also been made both HHVM and WordPress to make both compatible with each other. So it seems like now is as good a time as any to update the stack this site is running on. So without further ado I like to introduce: Heroku WP — A template for HHVM powered WordPress served by nginx. The Goal There are numerous other templates out there for running WordPress on Heroku and my main goals for this templates are: It should be simple — use the default buildpack provided by Heroku so there’s no other 3rd party dependency to implicitly trust or to maintain. It should be fast — use the latest technologies available to squeeze every last ounce of performance out of each Heroku Dyno. It should be secure — security is not an add-on, admin pages should be secure by default and database connections needs to be encrypted. It should scale — just because we can serve millions of page hits a day off a single Heroku Dyno does not mean we’ll stop there. The template should be made with cloud architecture in mind so that the number of Dynos can scale up and down without breaking. The Stack Standing on the shoulder of giants I was able to use the latest Heroku buildpack and get WordPress running on: NGINX — An event driven web server that was engineered for the modern day to replace Apache. This high performance web server is preferred by more top 1,000 sites then any other and it’s what’s used by the largest WordPress install out there, WordPress.com. HHVM — HipHop Virtual Machine, a JIT (just in time) compiler developed by Facebook to run PHP scripts which when tested with WordPress showed up to a 2x improvement. I have yet to run any statical analysis on performance however antidotally it feels a lot faster navigating WP admin and page generation times looks much better. I’m looking forward to running more tests and performance tuning this build in the coming weeks. Update: While still not a head-to-head test looking at the response times as reported by StatusCake for this site running on Heroku-WP and a mirror of this site that is running on the old Heroku LAMP stack with no load other then StatusCake pings shows a dramatic improvement:

Slide 12

Slide 12 text

SELECT COUNT(*)
 FROM wp_posts
 WHERE post_content LIKE "%WordPress%" SELECT COUNT(*)
 FROM wp_posts
 WHERE post_content LIKE "%on%" SELECT COUNT(*)
 FROM wp_posts
 WHERE post_content LIKE "%NGINX%" … SELECT COUNT(*)
 FROM wp_posts
 WHERE post_content LIKE "%improvement%"

Slide 13

Slide 13 text

NOPE NOPE

Slide 14

Slide 14 text

Natural Language Processing & Search Ranking           

Slide 15

Slide 15 text

WordPress on NGINX + HHVM with Heroku Buildpacks WordPress on NGINX + HHVM It’s been a year since I last made any major changes to my WordPress on Heroku build and in tech years that’s a lifetime. Since then Heroku has released a new PHP buildpack with nginx and HHVM built in. Much progress have also been made both HHVM and WordPress to make both compatible with each other. So it seems like now is as good a time as any to update the stack this site is running on. So without further ado I like to introduce: Heroku WP — A template for HHVM powered WordPress served by nginx. The Goal There are numerous other templates out there for running WordPress on Heroku and my main goals for this templates are: It should be simple — use the default buildpack provided by Heroku so there’s no other 3rd party dependency to implicitly trust or to maintain. It should be fast — use the latest technologies available to squeeze every last ounce of performance out of each Heroku Dyno. It should be secure — security is not an add-on, admin pages should be secure by default and database connections needs to be encrypted. It should scale — just because we can serve millions of page hits a day off a single Heroku Dyno does not mean we’ll stop there. The template should be made with cloud architecture in mind so that the number of Dynos can scale up and down without breaking. The Stack Standing on the shoulder of giants I was able to use the latest Heroku buildpack and get WordPress running on: NGINX — An event driven web server that was engineered for the modern day to replace Apache. This high performance web server is preferred by more top 1,000 sites then any other and it’s what’s used by the largest WordPress install out there, WordPress.com. HHVM — HipHop Virtual Machine, a JIT (just in time) compiler developed by Facebook to run PHP scripts which when tested with WordPress showed up to a 2x improvement. I have yet to run any statical analysis on performance however antidotally it feels a lot faster navigating WP admin and page generation times looks much better. I’m looking forward to running more tests and performance tuning this build in the coming weeks. Update: While still not a head-to-head test looking at the response times as reported by StatusCake for this site running on Heroku-WP and a mirror of this site that is running on the old Heroku LAMP stack with no load other then StatusCake pings shows a dramatic improvement:

Slide 16

Slide 16 text

WordPress on NGINX + HHVM with Heroku Buildpacks WordPress on NGINX + HHVM It’s been a year since I last made any major changes to my WordPress on Heroku build and in tech years that’s a lifetime. Since then Heroku has released a new PHP buildpack with nginx and HHVM built in. Much progress have also been made both HHVM and WordPress to make both compatible with each other. So it seems like now is as good a time as any to update the stack this site is running on. So without further ado I like to introduce: Heroku WP — A template for HHVM powered WordPress served by nginx. The Goal There are numerous other templates out there for running WordPress on Heroku and my main goals for this templates are: It should be simple — use the default buildpack provided by Heroku so there’s no other 3rd party dependency to implicitly trust or to maintain. It should be fast — use the latest technologies available to squeeze every last ounce of performance out of each Heroku Dyno. It should be secure — security is not an add-on, admin pages should be secure by default and database connections needs to be encrypted. It should scale — just because we can serve millions of page hits a day off a single Heroku Dyno does not mean we’ll stop there. The template should be made with cloud architecture in mind so that the number of Dynos can scale up and down without breaking. The Stack Standing on the shoulder of giants I was able to use the latest Heroku buildpack and get WordPress running on: NGINX — An event driven web server that was engineered for the modern day to replace Apache. This high performance web server is preferred by more top 1,000 sites then any other and it’s what’s used by the largest WordPress install out there, WordPress.com. HHVM — HipHop Virtual Machine, a JIT (just in time) compiler developed by Facebook to run PHP scripts which when tested with WordPress showed up to a 2x improvement. I have yet to run any statical analysis on performance however antidotally it feels a lot faster navigating WP admin and page generation times looks much better. I’m looking forward to running more tests and performance tuning this build in the coming weeks. Update: While still not a head-to-head test looking at the response times as reported by StatusCake for this site running on Heroku-WP and a mirror of this site that is running on the old Heroku LAMP stack with no load other then StatusCake pings shows a dramatic improvement:

Slide 17

Slide 17 text

SELECT *
 FROM wp_posts
 WHERE
 post_content LIKE "%WordPress%" OR
 post_content LIKE "%NGINX%" OR
 post_content LIKE "%HHVM%" OR
 post_content LIKE "%Heroku%" OR
 post_content LIKE "%performance%"
 


Slide 18

Slide 18 text

SELECT *
 FROM wp_posts
 WHERE
 post_content LIKE "%WordPress%" OR
 post_content LIKE "%NGINX%" OR
 post_content LIKE "%HHVM%" OR
 post_content LIKE "%Heroku%" OR
 post_content LIKE "%performance%"
 ORDER BY
 !?

Slide 19

Slide 19 text

Natural Language Processing & Search Ranking           

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

–Some Plugin Developer “Let’s just use tags & categories.”

Slide 22

Slide 22 text

–Some SEO Consultant’s Dream “Everyone tags & categorizes everything perfectly.”

Slide 23

Slide 23 text

NOPE NOPE

Slide 24

Slide 24 text

Ok, it's a hard problem to solve.

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

elasticsearch

Slide 27

Slide 27 text

–elasticsearch.org “Elasticsearch is a flexible and powerful open source, distributed, real-time search and analytics engine.”

Slide 28

Slide 28 text

2 Data Stores

Slide 29

Slide 29 text

2 Data Stores — more complexity

Slide 30

Slide 30 text

2 Data Stores — more complexity 2 Data Stores — more points of failure

Slide 31

Slide 31 text

2 Data Stores — more complexity 2 Data Stores — more points of failure 2 Data Stores — more cost

Slide 32

Slide 32 text

Worth it?

Slide 33

Slide 33 text

No content

Slide 34

Slide 34 text

 ❤️

Slide 35

Slide 35 text

Open & Inclusive  ❤️

Slide 36

Slide 36 text

Allows Us to Tinker & Inclusive  ❤️

Slide 37

Slide 37 text

Allows Us to Tinker & Optimized to Run Anywhere*  ❤️ * Almost

Slide 38

Slide 38 text

Openness & Inclusiveness  ❤️

Slide 39

Slide 39 text

Expressing Ourselves with Written Language  ❤️

Slide 40

Slide 40 text

howdy  ❤️

Slide 41

Slide 41 text

ابحرم  ❤️

Slide 42

Slide 42 text

 ❤️

Slide 43

Slide 43 text

Written Language  ❤️

Slide 44

Slide 44 text

Understands Strings

Slide 45

Slide 45 text

–My Blog to a Human “I almost ran into a swarm of baby ducks this morning…”

Slide 46

Slide 46 text

–My Blog to MySQL 0000000 49 20 61 6c 6d 6f 73 74 20 72 61 6e 20 69 6e 74
 0000010 6f 20 61 20 73 77 61 72 6d 20 6f 66 20 62 61 62
 0000020 79 20 64 75 63 6b 73 20 74 68 69 73 20 6d 6f 72
 0000030 6e 69 6e 67 e2 80 a6 …

Slide 47

Slide 47 text

–Some Awesome Human “Man, I wonder what fantastic insights Xiao has about ducks.”

Slide 48

Slide 48 text

SELECT *
 FROM wp_posts
 WHERE post_content LIKE "%ducks%"

Slide 49

Slide 49 text

0000000 49 20 61 6c 6d 6f 73 74 20 72 61 6e 20 69 6e 74
 0000010 6f 20 61 20 73 77 61 72 6d 20 6f 66 20 62 61 62
 0000020 79 20 64 75 63 6b 73 20 74 68 69 73 20 6d 6f 72
 0000030 6e 69 6e 67 e2 80 a6 …

Slide 50

Slide 50 text

0000000 49 20 61 6c 6d 6f 73 74 20 72 61 6e 20 69 6e 74
 0000010 6f 20 61 20 73 77 61 72 6d 20 6f 66 20 62 61 62
 0000020 79 20 64 75 63 6b 73 20 74 68 69 73 20 6d 6f 72
 0000030 6e 69 6e 67 e2 80 a6 …

Slide 51

Slide 51 text

“I almost ran into a swarm of baby ducks this morning…”

Slide 52

Slide 52 text

“I almost ran into a swarm of baby ducks this morning…” SELECT *
 FROM wp_posts
 WHERE post_content LIKE "%running%"

Slide 53

Slide 53 text

“I almost ran into a swarm of baby ducks this morning…” SELECT *
 FROM wp_posts
 WHERE post_content LIKE "%running%"

Slide 54

Slide 54 text

RUNNING != RAN RUNNING != RAN

Slide 55

Slide 55 text

Understands Language

Slide 56

Slide 56 text

Analyzing Textual Content

Slide 57

Slide 57 text

Elasticsearch Analyzer Chain Character Filters Raw Text Tokenizer Token Filters Terms

Slide 58

Slide 58 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms “The über-quick brown fox
 jumps over the lazy dogs.”

Slide 59

Slide 59 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms


 The über-quick brown fox
 jumps over the lazy dogs.


Slide 60

Slide 60 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms


 The über-quick brown fox
 jumps over the lazy dogs.


Slide 61

Slide 61 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 The über-quick brown fox
 jumps over the lazy dogs.


Slide 62

Slide 62 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 The über—quick brown fox 
 jumps over the lazy dogs.


Slide 63

Slide 63 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 The über quick brown fox
 jumps over the lazy dogs


Slide 64

Slide 64 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 The
 quick
 fox
 over
 lazy
 
 über
 brown
 jumps
 the
 dogs


Slide 65

Slide 65 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 The
 quick
 fox
 over
 lazy
 
 über
 brown
 jumps
 the
 dogs


Slide 66

Slide 66 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 the
 quick
 fox
 over
 lazy
 
 über
 brown
 jumps
 the
 dogs


Slide 67

Slide 67 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 the
 quick
 fox
 over
 lazy
 
 über
 brown
 jumps
 the
 dogs


Slide 68

Slide 68 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 the
 quick
 fox
 over
 lazy
 
 uber
 brown
 jumps
 the
 dogs


Slide 69

Slide 69 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 the
 quick
 fox
 over
 lazy
 
 uber
 brown
 jumps
 the
 dogs


Slide 70

Slide 70 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 the
 quick
 fox
 over
 lazy
 
 uber
 brown
 jump
 the
 dog


Slide 71

Slide 71 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 the
 quick
 fox
 over
 lazy
 
 uber
 brown
 jump
 the
 dog


Slide 72

Slide 72 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 
 quick
 fox
 over
 lazy
 
 uber
 brown
 jump
 
 dog


Slide 73

Slide 73 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 
 quick
 fox
 over
 lazy
 
 uber
 brown
 jump
 
 dog


Slide 74

Slide 74 text

Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer → Token Filters → Terms 
 
 quick
 fox vulpes
 over
 lazy
 
 uber
 brown
 jump
 
 dog canis


Slide 75

Slide 75 text

Elasticsearch Analyzer Chain 
 
 quick
 fox vulpes
 over
 lazy
 
 uber
 brown
 jump
 
 dog canis
 Raw Text → Character Filters → Tokenizer → Token Filters → Terms

Slide 76

Slide 76 text

Elasticsearch Analyzer Chain Terms Doc IDs brown 1 canis 1 dog 1 fox 1 jump 1 lazy 1 … over 1 quick 1 uber 1 vulpes 1

Slide 77

Slide 77 text

Elasticsearch Analyzer Chain Terms Doc IDs brown 1, 3, 6, … canis 1, 2, … dog 1, 2, 12… fox 1, 5, 7, … jump 1, 6, … lazy 1, 7, … … 3, 6, 7, … over 1, 3, 5, 6, … quick 1, 4, … uber 1, … vulpes 1, 5, 7, …

Slide 78

Slide 78 text

Elasticsearch Analyzer Chain — On Query Text Raw Text → Character Filters → Tokenizer → Token Filters → Terms “Jumping Foxes”

Slide 79

Slide 79 text

Elasticsearch Analyzer Chain — On Query Text Raw Text → Character Filters → Tokenizer → Token Filters → Terms jump fox vulpes

Slide 80

Slide 80 text

Elasticsearch Analyzer Chain — On Query Text Terms Doc IDs brown 1, 3, 6, … canis 1, 2, … dog 1, 2, 12… fox 1, 5, 7, … jump 1, 6, … lazy 1, 7, … … 3, 6, 7, … over 1, 3, 5, 6, … quick 1, 4, … uber 1, … vulpes 1, 5, 7, …

Slide 81

Slide 81 text

Understands Language via Customized Analyzers

Slide 82

Slide 82 text

Natural Language Processing & Search Ranking           

Slide 83

Slide 83 text

Queries & Relevancy

Slide 84

Slide 84 text

Elasticsearch Filters & Queries Filters Queries Speed Fast Slow(er) Cached Yes, With Bitsets! No Matching Boolean Yes/No Relevancy Score

Slide 85

Slide 85 text

Relevancy Score? TF-IDF

Slide 86

Slide 86 text

Relevancy Score? Term Frequency
 ×
 Inverse Document Frequency

Slide 87

Slide 87 text

Relevancy Score? Term Frequency
 ×
 Inverse Document Frequency

Slide 88

Slide 88 text

Relevancy Score? Term Frequency
 ×
 Inverse Document Frequency

Slide 89

Slide 89 text

Relevancy Score? Term Frequency
 ×
 Inverse Document Frequency

Slide 90

Slide 90 text

Relevancy Score? Filter to Reduce Possible Documents then Query to Calculate Match Relevancy

Slide 91

Slide 91 text

WordPress on NGINX + HHVM with Heroku Buildpacks WordPress on NGINX + HHVM It’s been a year since I last made any major changes to my WordPress on Heroku build and in tech years that’s a lifetime. Since then Heroku has released a new PHP buildpack with nginx and HHVM built in. Much progress have also been made both HHVM and WordPress to make both compatible with each other. So it seems like now is as good a time as any to update the stack this site is running on. So without further ado I like to introduce: Heroku WP — A template for HHVM powered WordPress served by nginx. The Goal There are numerous other templates out there for running WordPress on Heroku and my main goals for this templates are: It should be simple — use the default buildpack provided by Heroku so there’s no other 3rd party dependency to implicitly trust or to maintain. It should be fast — use the latest technologies available to squeeze every last ounce of performance out of each Heroku Dyno. It should be secure — security is not an add-on, admin pages should be secure by default and database connections needs to be encrypted. It should scale — just because we can serve millions of page hits a day off a single Heroku Dyno does not mean we’ll stop there. The template should be made with cloud architecture in mind so that the number of Dynos can scale up and down without breaking. The Stack Standing on the shoulder of giants I was able to use the latest Heroku buildpack and get WordPress running on: NGINX — An event driven web server that was engineered for the modern day to replace Apache. This high performance web server is preferred by more top 1,000 sites then any other and it’s what’s used by the largest WordPress install out there, WordPress.com. HHVM — HipHop Virtual Machine, a JIT (just in time) compiler developed by Facebook to run PHP scripts which when tested with WordPress showed up to a 2x improvement. I have yet to run any statical analysis on performance however antidotally it feels a lot faster navigating WP admin and page generation times looks much better. I’m looking forward to running more tests and performance tuning this build in the coming weeks. Update: While still not a head-to-head test looking at the response times as reported by StatusCake for this site running on Heroku-WP and a mirror of this site that is running on the old Heroku LAMP stack with no load other then StatusCake pings shows a dramatic improvement:

Slide 92

Slide 92 text

WordPress on NGINX + HHVM with Heroku Buildpacks WordPress on NGINX + HHVM It’s been a year since I last made any major changes to my WordPress on Heroku build and in tech years that’s a lifetime. Since then Heroku has released a new PHP buildpack with nginx and HHVM built in. Much progress have also been made both HHVM and WordPress to make both compatible with each other. So it seems like now is as good a time as any to update the stack this site is running on. So without further ado I like to introduce: Heroku WP — A template for HHVM powered WordPress served by nginx. The Goal There are numerous other templates out there for running WordPress on Heroku and my main goals for this templates are: It should be simple — use the default buildpack provided by Heroku so there’s no other 3rd party dependency to implicitly trust or to maintain. It should be fast — use the latest technologies available to squeeze every last ounce of performance out of each Heroku Dyno. It should be secure — security is not an add-on, admin pages should be secure by default and database connections needs to be encrypted. It should scale — just because we can serve millions of page hits a day off a single Heroku Dyno does not mean we’ll stop there. The template should be made with cloud architecture in mind so that the number of Dynos can scale up and down without breaking. The Stack Standing on the shoulder of giants I was able to use the latest Heroku buildpack and get WordPress running on: NGINX — An event driven web server that was engineered for the modern day to replace Apache. This high performance web server is preferred by more top 1,000 sites then any other and it’s what’s used by the largest WordPress install out there, WordPress.com. HHVM — HipHop Virtual Machine, a JIT (just in time) compiler developed by Facebook to run PHP scripts which when tested with WordPress showed up to a 2x improvement. I have yet to run any statical analysis on performance however antidotally it feels a lot faster navigating WP admin and page generation times looks much better. I’m looking forward to running more tests and performance tuning this build in the coming weeks. Update: While still not a head-to-head test looking at the response times as reported by StatusCake for this site running on Heroku-WP and a mirror of this site that is running on the old Heroku LAMP stack with no load other then StatusCake pings shows a dramatic improvement:

Slide 93

Slide 93 text

Natural Language Processing & Search Ranking           

Slide 94

Slide 94 text

No content

Slide 95

Slide 95 text

No content

Slide 96

Slide 96 text

Slide 97

Slide 97 text

 

Slide 98

Slide 98 text

No content

Slide 99

Slide 99 text

Related Posts github.com/Automattic/jetpack/blob/master/modules/related-posts

Slide 100

Slide 100 text

curl -XPOST https://public-api.wordpress.com/rest/v1/ sites/www.xyu.io/posts/2361/related -d '{ "size" : 5, "filter" : { "and" : [ { "terms" : { "post_format" : [ "image", "gallery", "video" ] } }, { "geo_distance" : { "distance" : "25mi", "location": [ 41.8236, -71.4222 ] } } ] } }' developer.wordpress.com/docs/elasticsearch

Slide 101

Slide 101 text

http://automattic.com/work-with-us/

Slide 102

Slide 102 text

Thanks! Code Wrangler — Automattic @HypertextRanch [email protected] xyu.io xyu     