Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding Language and Fixing WP Search

25e2ecf9b520e06d71e47ab083924300?s=47 xyu
September 27, 2014

Understanding Language and Fixing WP Search

Let's face it, search in WordPress is kinda terrible, and trying to find related content is even worse. The current WordPress tech stack is restricting us to some extremely outdated search methods. Thankfully help is on the way. In this talk Xiao explains what makes Elasticsearch such a powerful tool and how it complements almost any WordPress install.

25e2ecf9b520e06d71e47ab083924300?s=128

xyu

September 27, 2014
Tweet

More Decks by xyu

Other Decks in Technology

Transcript

  1. Understanding Language & Fixing WP Search WordCamp Providence 2014

  2. Xiao Yu Code Wrangler — Automattic @HypertextRanch me@xyu.io xyu.io xyu

        
  3. Jetpack Related Posts

  4. None
  5. None
  6. None
  7. Another Yet Another Related Posts Plugin?

  8. Because finding related content is a hard problem to solve.

  9. Natural Language Processing & Search Ranking    

          
  10. None
  11. WordPress on NGINX + HHVM with Heroku Buildpacks WordPress on

    NGINX + HHVM It’s been a year since I last made any major changes to my WordPress on Heroku build and in tech years that’s a lifetime. Since then Heroku has released a new PHP buildpack with nginx and HHVM built in. Much progress have also been made both HHVM and WordPress to make both compatible with each other. So it seems like now is as good a time as any to update the stack this site is running on. So without further ado I like to introduce: Heroku WP — A template for HHVM powered WordPress served by nginx. The Goal There are numerous other templates out there for running WordPress on Heroku and my main goals for this templates are: It should be simple — use the default buildpack provided by Heroku so there’s no other 3rd party dependency to implicitly trust or to maintain. It should be fast — use the latest technologies available to squeeze every last ounce of performance out of each Heroku Dyno. It should be secure — security is not an add-on, admin pages should be secure by default and database connections needs to be encrypted. It should scale — just because we can serve millions of page hits a day off a single Heroku Dyno does not mean we’ll stop there. The template should be made with cloud architecture in mind so that the number of Dynos can scale up and down without breaking. The Stack Standing on the shoulder of giants I was able to use the latest Heroku buildpack and get WordPress running on: NGINX — An event driven web server that was engineered for the modern day to replace Apache. This high performance web server is preferred by more top 1,000 sites then any other and it’s what’s used by the largest WordPress install out there, WordPress.com. HHVM — HipHop Virtual Machine, a JIT (just in time) compiler developed by Facebook to run PHP scripts which when tested with WordPress showed up to a 2x improvement. I have yet to run any statical analysis on performance however antidotally it feels a lot faster navigating WP admin and page generation times looks much better. I’m looking forward to running more tests and performance tuning this build in the coming weeks. Update: While still not a head-to-head test looking at the response times as reported by StatusCake for this site running on Heroku-WP and a mirror of this site that is running on the old Heroku LAMP stack with no load other then StatusCake pings shows a dramatic improvement:
  12. SELECT COUNT(*)
 FROM wp_posts
 WHERE post_content LIKE "%WordPress%" SELECT COUNT(*)


    FROM wp_posts
 WHERE post_content LIKE "%on%" SELECT COUNT(*)
 FROM wp_posts
 WHERE post_content LIKE "%NGINX%" … SELECT COUNT(*)
 FROM wp_posts
 WHERE post_content LIKE "%improvement%"
  13. NOPE NOPE

  14. Natural Language Processing & Search Ranking    

          
  15. WordPress on NGINX + HHVM with Heroku Buildpacks WordPress on

    NGINX + HHVM It’s been a year since I last made any major changes to my WordPress on Heroku build and in tech years that’s a lifetime. Since then Heroku has released a new PHP buildpack with nginx and HHVM built in. Much progress have also been made both HHVM and WordPress to make both compatible with each other. So it seems like now is as good a time as any to update the stack this site is running on. So without further ado I like to introduce: Heroku WP — A template for HHVM powered WordPress served by nginx. The Goal There are numerous other templates out there for running WordPress on Heroku and my main goals for this templates are: It should be simple — use the default buildpack provided by Heroku so there’s no other 3rd party dependency to implicitly trust or to maintain. It should be fast — use the latest technologies available to squeeze every last ounce of performance out of each Heroku Dyno. It should be secure — security is not an add-on, admin pages should be secure by default and database connections needs to be encrypted. It should scale — just because we can serve millions of page hits a day off a single Heroku Dyno does not mean we’ll stop there. The template should be made with cloud architecture in mind so that the number of Dynos can scale up and down without breaking. The Stack Standing on the shoulder of giants I was able to use the latest Heroku buildpack and get WordPress running on: NGINX — An event driven web server that was engineered for the modern day to replace Apache. This high performance web server is preferred by more top 1,000 sites then any other and it’s what’s used by the largest WordPress install out there, WordPress.com. HHVM — HipHop Virtual Machine, a JIT (just in time) compiler developed by Facebook to run PHP scripts which when tested with WordPress showed up to a 2x improvement. I have yet to run any statical analysis on performance however antidotally it feels a lot faster navigating WP admin and page generation times looks much better. I’m looking forward to running more tests and performance tuning this build in the coming weeks. Update: While still not a head-to-head test looking at the response times as reported by StatusCake for this site running on Heroku-WP and a mirror of this site that is running on the old Heroku LAMP stack with no load other then StatusCake pings shows a dramatic improvement:
  16. WordPress on NGINX + HHVM with Heroku Buildpacks WordPress on

    NGINX + HHVM It’s been a year since I last made any major changes to my WordPress on Heroku build and in tech years that’s a lifetime. Since then Heroku has released a new PHP buildpack with nginx and HHVM built in. Much progress have also been made both HHVM and WordPress to make both compatible with each other. So it seems like now is as good a time as any to update the stack this site is running on. So without further ado I like to introduce: Heroku WP — A template for HHVM powered WordPress served by nginx. The Goal There are numerous other templates out there for running WordPress on Heroku and my main goals for this templates are: It should be simple — use the default buildpack provided by Heroku so there’s no other 3rd party dependency to implicitly trust or to maintain. It should be fast — use the latest technologies available to squeeze every last ounce of performance out of each Heroku Dyno. It should be secure — security is not an add-on, admin pages should be secure by default and database connections needs to be encrypted. It should scale — just because we can serve millions of page hits a day off a single Heroku Dyno does not mean we’ll stop there. The template should be made with cloud architecture in mind so that the number of Dynos can scale up and down without breaking. The Stack Standing on the shoulder of giants I was able to use the latest Heroku buildpack and get WordPress running on: NGINX — An event driven web server that was engineered for the modern day to replace Apache. This high performance web server is preferred by more top 1,000 sites then any other and it’s what’s used by the largest WordPress install out there, WordPress.com. HHVM — HipHop Virtual Machine, a JIT (just in time) compiler developed by Facebook to run PHP scripts which when tested with WordPress showed up to a 2x improvement. I have yet to run any statical analysis on performance however antidotally it feels a lot faster navigating WP admin and page generation times looks much better. I’m looking forward to running more tests and performance tuning this build in the coming weeks. Update: While still not a head-to-head test looking at the response times as reported by StatusCake for this site running on Heroku-WP and a mirror of this site that is running on the old Heroku LAMP stack with no load other then StatusCake pings shows a dramatic improvement:
  17. SELECT *
 FROM wp_posts
 WHERE
 post_content LIKE "%WordPress%" OR
 post_content

    LIKE "%NGINX%" OR
 post_content LIKE "%HHVM%" OR
 post_content LIKE "%Heroku%" OR
 post_content LIKE "%performance%"
 

  18. SELECT *
 FROM wp_posts
 WHERE
 post_content LIKE "%WordPress%" OR
 post_content

    LIKE "%NGINX%" OR
 post_content LIKE "%HHVM%" OR
 post_content LIKE "%Heroku%" OR
 post_content LIKE "%performance%"
 ORDER BY
 !?
  19. Natural Language Processing & Search Ranking    

          
  20. None
  21. –Some Plugin Developer “Let’s just use tags & categories.”

  22. –Some SEO Consultant’s Dream “Everyone tags & categorizes everything perfectly.”

  23. NOPE NOPE

  24. Ok, it's a hard problem to solve.

  25. None
  26. elasticsearch

  27. –elasticsearch.org “Elasticsearch is a flexible and powerful open source, distributed,

    real-time search and analytics engine.”
  28. 2 Data Stores

  29. 2 Data Stores — more complexity

  30. 2 Data Stores — more complexity 2 Data Stores —

    more points of failure
  31. 2 Data Stores — more complexity 2 Data Stores —

    more points of failure 2 Data Stores — more cost
  32. Worth it?

  33. None
  34.  ❤️

  35. Open & Inclusive  ❤️

  36. Allows Us to Tinker & Inclusive  ❤️

  37. Allows Us to Tinker & Optimized to Run Anywhere* 

    ❤️ * Almost
  38. Openness & Inclusiveness  ❤️

  39. Expressing Ourselves with Written Language  ❤️

  40. howdy  ❤️

  41. ابحرم  ❤️

  42.  ❤️

  43. Written Language  ❤️

  44. Understands Strings

  45. –My Blog to a Human “I almost ran into a

    swarm of baby ducks this morning…”
  46. –My Blog to MySQL 0000000 49 20 61 6c 6d

    6f 73 74 20 72 61 6e 20 69 6e 74
 0000010 6f 20 61 20 73 77 61 72 6d 20 6f 66 20 62 61 62
 0000020 79 20 64 75 63 6b 73 20 74 68 69 73 20 6d 6f 72
 0000030 6e 69 6e 67 e2 80 a6 …
  47. –Some Awesome Human “Man, I wonder what fantastic insights Xiao

    has about ducks.”
  48. SELECT *
 FROM wp_posts
 WHERE post_content LIKE "%ducks%"

  49. 0000000 49 20 61 6c 6d 6f 73 74 20

    72 61 6e 20 69 6e 74
 0000010 6f 20 61 20 73 77 61 72 6d 20 6f 66 20 62 61 62
 0000020 79 20 64 75 63 6b 73 20 74 68 69 73 20 6d 6f 72
 0000030 6e 69 6e 67 e2 80 a6 …
  50. 0000000 49 20 61 6c 6d 6f 73 74 20

    72 61 6e 20 69 6e 74
 0000010 6f 20 61 20 73 77 61 72 6d 20 6f 66 20 62 61 62
 0000020 79 20 64 75 63 6b 73 20 74 68 69 73 20 6d 6f 72
 0000030 6e 69 6e 67 e2 80 a6 …
  51. “I almost ran into a swarm of baby ducks this

    morning…”
  52. “I almost ran into a swarm of baby ducks this

    morning…” SELECT *
 FROM wp_posts
 WHERE post_content LIKE "%running%"
  53. “I almost ran into a swarm of baby ducks this

    morning…” SELECT *
 FROM wp_posts
 WHERE post_content LIKE "%running%"
  54. RUNNING != RAN RUNNING != RAN

  55. Understands Language

  56. Analyzing Textual Content

  57. Elasticsearch Analyzer Chain Character Filters Raw Text Tokenizer Token Filters

    Terms
  58. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms “The über-quick brown fox
 jumps over the lazy dogs.”
  59. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms <p>
 The &uuml;ber-quick brown fox
 jumps over the lazy dogs.
 </p>
  60. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms <p>
 The &uuml;ber-quick brown fox
 jumps over the lazy dogs.
 </p>
  61. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 The über-quick brown fox
 jumps over the lazy dogs.

  62. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 The über—quick brown fox 
 jumps over the lazy dogs.

  63. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 The über quick brown fox
 jumps over the lazy dogs

  64. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 The
 quick
 fox
 over
 lazy
 
 über
 brown
 jumps
 the
 dogs

  65. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 The
 quick
 fox
 over
 lazy
 
 über
 brown
 jumps
 the
 dogs

  66. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 the
 quick
 fox
 over
 lazy
 
 über
 brown
 jumps
 the
 dogs

  67. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 the
 quick
 fox
 over
 lazy
 
 über
 brown
 jumps
 the
 dogs

  68. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 the
 quick
 fox
 over
 lazy
 
 uber
 brown
 jumps
 the
 dogs

  69. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 the
 quick
 fox
 over
 lazy
 
 uber
 brown
 jumps
 the
 dogs

  70. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 the
 quick
 fox
 over
 lazy
 
 uber
 brown
 jump
 the
 dog

  71. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 the
 quick
 fox
 over
 lazy
 
 uber
 brown
 jump
 the
 dog

  72. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 
 quick
 fox
 over
 lazy
 
 uber
 brown
 jump
 
 dog

  73. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 
 quick
 fox
 over
 lazy
 
 uber
 brown
 jump
 
 dog

  74. Elasticsearch Analyzer Chain Raw Text → Character Filters → Tokenizer

    → Token Filters → Terms 
 
 quick
 fox vulpes
 over
 lazy
 
 uber
 brown
 jump
 
 dog canis

  75. Elasticsearch Analyzer Chain 
 
 quick
 fox vulpes
 over
 lazy


    
 uber
 brown
 jump
 
 dog canis
 Raw Text → Character Filters → Tokenizer → Token Filters → Terms
  76. Elasticsearch Analyzer Chain Terms Doc IDs brown 1 canis 1

    dog 1 fox 1 jump 1 lazy 1 … over 1 quick 1 uber 1 vulpes 1
  77. Elasticsearch Analyzer Chain Terms Doc IDs brown 1, 3, 6,

    … canis 1, 2, … dog 1, 2, 12… fox 1, 5, 7, … jump 1, 6, … lazy 1, 7, … … 3, 6, 7, … over 1, 3, 5, 6, … quick 1, 4, … uber 1, … vulpes 1, 5, 7, …
  78. Elasticsearch Analyzer Chain — On Query Text Raw Text →

    Character Filters → Tokenizer → Token Filters → Terms “Jumping Foxes”
  79. Elasticsearch Analyzer Chain — On Query Text Raw Text →

    Character Filters → Tokenizer → Token Filters → Terms jump fox vulpes
  80. Elasticsearch Analyzer Chain — On Query Text Terms Doc IDs

    brown 1, 3, 6, … canis 1, 2, … dog 1, 2, 12… fox 1, 5, 7, … jump 1, 6, … lazy 1, 7, … … 3, 6, 7, … over 1, 3, 5, 6, … quick 1, 4, … uber 1, … vulpes 1, 5, 7, …
  81. Understands Language via Customized Analyzers

  82. Natural Language Processing & Search Ranking    

          
  83. Queries & Relevancy

  84. Elasticsearch Filters & Queries Filters Queries Speed Fast Slow(er) Cached

    Yes, With Bitsets! No Matching Boolean Yes/No Relevancy Score
  85. Relevancy Score? TF-IDF

  86. Relevancy Score? Term Frequency
 ×
 Inverse Document Frequency

  87. Relevancy Score? Term Frequency
 ×
 Inverse Document Frequency

  88. Relevancy Score? Term Frequency
 ×
 Inverse Document Frequency

  89. Relevancy Score? Term Frequency
 ×
 Inverse Document Frequency

  90. Relevancy Score? Filter to Reduce Possible Documents then Query to

    Calculate Match Relevancy
  91. WordPress on NGINX + HHVM with Heroku Buildpacks WordPress on

    NGINX + HHVM It’s been a year since I last made any major changes to my WordPress on Heroku build and in tech years that’s a lifetime. Since then Heroku has released a new PHP buildpack with nginx and HHVM built in. Much progress have also been made both HHVM and WordPress to make both compatible with each other. So it seems like now is as good a time as any to update the stack this site is running on. So without further ado I like to introduce: Heroku WP — A template for HHVM powered WordPress served by nginx. The Goal There are numerous other templates out there for running WordPress on Heroku and my main goals for this templates are: It should be simple — use the default buildpack provided by Heroku so there’s no other 3rd party dependency to implicitly trust or to maintain. It should be fast — use the latest technologies available to squeeze every last ounce of performance out of each Heroku Dyno. It should be secure — security is not an add-on, admin pages should be secure by default and database connections needs to be encrypted. It should scale — just because we can serve millions of page hits a day off a single Heroku Dyno does not mean we’ll stop there. The template should be made with cloud architecture in mind so that the number of Dynos can scale up and down without breaking. The Stack Standing on the shoulder of giants I was able to use the latest Heroku buildpack and get WordPress running on: NGINX — An event driven web server that was engineered for the modern day to replace Apache. This high performance web server is preferred by more top 1,000 sites then any other and it’s what’s used by the largest WordPress install out there, WordPress.com. HHVM — HipHop Virtual Machine, a JIT (just in time) compiler developed by Facebook to run PHP scripts which when tested with WordPress showed up to a 2x improvement. I have yet to run any statical analysis on performance however antidotally it feels a lot faster navigating WP admin and page generation times looks much better. I’m looking forward to running more tests and performance tuning this build in the coming weeks. Update: While still not a head-to-head test looking at the response times as reported by StatusCake for this site running on Heroku-WP and a mirror of this site that is running on the old Heroku LAMP stack with no load other then StatusCake pings shows a dramatic improvement:
  92. WordPress on NGINX + HHVM with Heroku Buildpacks WordPress on

    NGINX + HHVM It’s been a year since I last made any major changes to my WordPress on Heroku build and in tech years that’s a lifetime. Since then Heroku has released a new PHP buildpack with nginx and HHVM built in. Much progress have also been made both HHVM and WordPress to make both compatible with each other. So it seems like now is as good a time as any to update the stack this site is running on. So without further ado I like to introduce: Heroku WP — A template for HHVM powered WordPress served by nginx. The Goal There are numerous other templates out there for running WordPress on Heroku and my main goals for this templates are: It should be simple — use the default buildpack provided by Heroku so there’s no other 3rd party dependency to implicitly trust or to maintain. It should be fast — use the latest technologies available to squeeze every last ounce of performance out of each Heroku Dyno. It should be secure — security is not an add-on, admin pages should be secure by default and database connections needs to be encrypted. It should scale — just because we can serve millions of page hits a day off a single Heroku Dyno does not mean we’ll stop there. The template should be made with cloud architecture in mind so that the number of Dynos can scale up and down without breaking. The Stack Standing on the shoulder of giants I was able to use the latest Heroku buildpack and get WordPress running on: NGINX — An event driven web server that was engineered for the modern day to replace Apache. This high performance web server is preferred by more top 1,000 sites then any other and it’s what’s used by the largest WordPress install out there, WordPress.com. HHVM — HipHop Virtual Machine, a JIT (just in time) compiler developed by Facebook to run PHP scripts which when tested with WordPress showed up to a 2x improvement. I have yet to run any statical analysis on performance however antidotally it feels a lot faster navigating WP admin and page generation times looks much better. I’m looking forward to running more tests and performance tuning this build in the coming weeks. Update: While still not a head-to-head test looking at the response times as reported by StatusCake for this site running on Heroku-WP and a mirror of this site that is running on the old Heroku LAMP stack with no load other then StatusCake pings shows a dramatic improvement:
  93. Natural Language Processing & Search Ranking    

          
  94. None
  95. None
  96.  

  97. None
  98. Related Posts github.com/Automattic/jetpack/blob/master/modules/related-posts

  99. curl -XPOST https://public-api.wordpress.com/rest/v1/ sites/www.xyu.io/posts/2361/related -d '{ "size" : 5, "filter"

    : { "and" : [ { "terms" : { "post_format" : [ "image", "gallery", "video" ] } }, { "geo_distance" : { "distance" : "25mi", "location": [ 41.8236, -71.4222 ] } } ] } }' developer.wordpress.com/docs/elasticsearch
  100. http://automattic.com/work-with-us/

  101. Thanks! Code Wrangler — Automattic @HypertextRanch me@xyu.io xyu.io xyu 

       