Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Serverless SEO - SMX Advanced Europe 2021

Serverless SEO - SMX Advanced Europe 2021

My talk from SMX Advanced 2021 outlining how to use Cloudflare Workers to overcome challenges and limitations with popular CMS and ecommerce platforms.

Bastian Grimm

November 07, 2022
Tweet

More Decks by Bastian Grimm

Other Decks in Technology

Transcript

  1. Before we talk about Workers, we need to talk HTTP

    requests – and CDNs: Establishing a common ground
  2. pa.ag @peakaceag 13 A (very) simplified request lifecycle Your computer

    Your browser Database server (in most cases) DNS server e.g. to translate domain<>IP Web server aka “origin server”
  3. If you’re not familiar with the term CDN: “A content

    delivery network (CDN) is a globally distributed network of servers deployed in multiple data centers around the globe.” Let's introduce a CDN to the mix
  4. pa.ag @peakaceag 15 Using a CDN, all requests will pass

    through “edge servers“ When we ignore DNS, databases etc for a minute, this is what it would look like: First request, ever. peakace.js is not cached on edge server yet Origin server Request: peakace.js Request: peakace.js peakace.js delivered from origin server Response: peakace.js peakace.js gets cached on edge server
  5. pa.ag @peakaceag 16 Using a CDN, all requests will pass

    through “edge servers“ When we ignore DNS, databases etc for a minute, this is what it would look like: Origin server Request: peakace.js peakace.js delivered from edge server peakace.js is cached on edge server Second request (independent of user)
  6. pa.ag @peakaceag 17 Especially for global businesses, CDNs can be

    a great help Use CDNPerf.com to find the one that suits you best, depending on where you are and which regions/countries you serve most. This will positively impact TTFB! Give it a try: https://www.cdnperf.com/ vs
  7. pa.ag @peakaceag 18 CDNs at a glance Some of the

    most popular CDN providers out there
  8. pa.ag @peakaceag 19 Back in Sep. 2017, Cloudflare introduced their

    “Workers“ Which ultimately became publicly available in March 2018: Source: https://blog.cloudflare.com/introducing-cloudflare-workers/
  9. Workers use the V8 JavaScript engine built by Google and

    run globally on Cloudflare's edge servers. A typical Worker script executes in <1ms – that’s fast! So… what‘s a Worker?
  10. Also, you can do multiple requests, in series or parallel

    and combine the results Send requests to 3rd-party servers
  11. Intercept and modify HTTP request and response URLs, status, headers,

    and body content. Seriously though, this is WILD!
  12. Essentially, you can do almost anything – because you have

    full access to the request and response objects! Inject/remove (body) content
  13. pa.ag @peakaceag 33 However, does this only work with Cloudflare?

    Similar implementations are also available with other CDN providers: Lambda@Edge Compute@Edge Edge Workers Cloudflare Workers
  14. pa.ag @peakaceag 34 But today it‘s all about Cloudflare, because:

    The top 3 providers (CF, AWS, Akamai) have 89% of all customers; Cloudflare alone is used by 81% of all sites that rely on a CDN (according to W3Techs): Source: https://pa.ag/2U9kvAh
  15. A practical and hands-on guide to setting up and running

    Cloudflare Workers for your SEO Excited? Let‘s go!
  16. pa.ag @peakaceag 36 Go create your own (free) account over

    at cloudflare.com Once your account is activated, you can add your first site/domain: Add your domain name - it can be registered anywhere (as long as you can change the DNS at your current provider)
  17. pa.ag @peakaceag 37 To play with it, the free account

    + $0 plan is sufficient: This is good enough for testing things out…!
  18. pa.ag @peakaceag 38 Next, you‘ll get to see the current

    DNS configuration Yours should look a little like this: at least two records, one for the root-domain, one for the www sub-domain, both pointing to the IP address of your hosting provider: On to the next screen!
  19. pa.ag @peakaceag 39 Now, CF will show you which nameservers

    to use instead: Nameservers with the current provider, in my case nsX.inwx.de My new nameservers with Cloudflare to be used instead
  20. pa.ag @peakaceag 40 Switching existing nameservers over to Cloudflare At

    my hosting provider, it looks like this: My new nameservers Cloudflare told me to use instead (see prev. screen)
  21. pa.ag @peakaceag 41 Switch back to tell them you’re all

    set: Nameservers with the current provider, in my case nsX.inwx.de My new nameservers with Cloudflare, to be used instead
  22. pa.ag @peakaceag 42 Cloudflare is going to email you when

    things are ready: Beware, this can take up to 24hrs depending on the registrars and nameservers: Your CF dashboard should look like this after the successful NS change.
  23. pa.ag @peakaceag 43 Speaking of nameservers – are you already

    using 1.1.1.1 ? Cloudflare runs the fastest DNS resolver available. Why wouldn‘t you use it? More: https://pa.ag/3zueHRX
  24. pa.ag @peakaceag 45 Can‘t wait – or just want to

    check DNS records? Free tool recommendation: MxToolbox > DNS Lookup Source: https://pa.ag/3vuBObV
  25. pa.ag @peakaceag 47 A Worker, in its simplest form: This

    function defines triggers for a Worker script to execute. In this case, we intercept the request and send a (custom) response. Our custom response is defined here, for now we simply: (6) log the request object (7) fetch the requested URL from the origin server (8) log the response object (10) send the (unmodified) response back to the client
  26. pa.ag @peakaceag 53 Let‘s live-deploy the Worker to Cloudflare's edge

    servers Select your domain > Workers > Manage Workers
  27. pa.ag @peakaceag 54 Here’s how to add a Worker You‘ll

    be redirected from the “all Workers“ overview to the following mask: Give your Worker a unique name Copy & paste the Workers code you just tested on the Playground
  28. pa.ag @peakaceag 56 Comparison: left (original), right (Worker-enabled) Double-check live!

    Also, don‘t fall victim to caching, use “Disable Cache“ (see: Network tab) in Chrome Dev Tools to be sure you‘re seeing the latest version: vs
  29. Please understand that all scripts / source codes are meant

    as examples only. Ensure you know what you’re doing when using them in a production environment! Warning: (maybe) not production-ready!
  30. pa.ag @peakaceag 60 Redirects on the edge using the Response

    API To execute any type of HTTP redirect, we need to use the Response Runtime API which – conveniently – also provides a static method called “redirect()”: Source: https://pa.ag/3gvXYoL let response = new Response(body, options) return Response.redirect(destination, status) or just:
  31. pa.ag @peakaceag 61 The Cloudflare Workers Docs is a solid

    starting point: More: https://pa.ag/3gNd8Gn
  32. pa.ag @peakaceag 62 Different types of implementations at a glance

    (#18): 302 redirect, (#22): 301 redirect, (#26): a reverse proxy call and (#31-36): multiple redirects, selecting a single destination from a map based on a URL parameter:
  33. pa.ag @peakaceag 63 A quick overview to see how things

    are working… Source: https://httpstatus.io Correct, in fact this is not a redirect ID is not configured in redirectMap
  34. pa.ag @peakaceag 64 To “reverse proxy” a request, you can

    use the Fetch API It provides an interface for (asynchronously) fetching resources via HTTP requests inside of a Worker script: Source: https://pa.ag/3wpS3YT const response = await fetch(URL, options) Asynchronous tasks, such as fetch, are not executed at the top level in a Worker script and must be executed within a FetchEvent handler.
  35. pa.ag @peakaceag 65 return await fetch(“https://example.com ”) Easily “migrate” a

    blog hosted on a sub-domain to a sub-folder on your main domain – without actually moving it. Great tutorial: https://pa.ag/2Tw7LD8 Content shown from example.com Request sent from bastiangrimm.dev
  36. pa.ag @peakaceag 66 Verifying that this all happens “on the

    edge“: Zoom into any of the response headers for an originally requested URL such as bastiangrimm.dev/redirects/302:
  37. pa.ag @peakaceag 68 Which version would you like to wake

    up to? Preventing “SEO heart attacks“ using a Worker to monitor and safeguard your robots.txt file is one of many use-cases that are super easy to do: This is how I uploaded the robots.txt file to my test server This is what the Worker running in the background changed the output to vs
  38. pa.ag @peakaceag 69 Preventing a global “Disallow: /“ in robots.txt

    (#5-6): define defaults, (#15-16): if robots.txt returns 200, read its content, (#19-24): replace if global disallow exists, (#27-29): return default “allow all” if file doesn’t exist
  39. pa.ag @peakaceag 71 For demonstration purposes only: UA-based delivery (#10):

    get User-Agent, (#16-17): add dynamic Sitemap-link if UA contains “googlebot”
  40. pa.ag @peakaceag 72 vs Live-test & compare robots.txt using technicalseo.com

    Left screen shows bastiangrimm.dev/robots.txt being requested using a Googlebot User-Agent string, right screen is the default output: Free testing tool: https://technicalseo.com/tools/robots-txt/ Or use…
  41. Some systems cause endless headaches for SEOs – routing them

    through Cloudflare and using a Worker works very well! Easily overwrite files which are “not meant” to be changed?
  42. pa.ag @peakaceag 75 Say hello to the HTMLRewriter class! The

    HTMLRewriter allows you to build comprehensive and expressive HTML parsers inside of a Cloudflare Workers application: Source: https://pa.ag/2RTpqEt new HTMLRewriter() .on("*", new ElementHandler()) .onDocument(new DocumentHandler())
  43. pa.ag @peakaceag 76 Let's give it a try and work

    with <head> and <meta> first (#24-25): pass tags to ElementHandler, (#9-11): if it’s <meta name=“robots”>, set it to “index,nofollow”, (#14-16): if it’s <head>, add another directive for bingbot:
  44. pa.ag @peakaceag 77 I mean, this should be clear –

    but just in case: Verifying presence of Worker-modified robots meta directives via GSC
  45. pa.ag @peakaceag 78 If you want to work with/on every

    HTML element… This selector would pass every HTML element to your ElementHandler. By using element.tagName, you could then identify which element has been passed along: return new HTMLRewriter() .on("*", new ElementHandler()) .transform(response)
  46. Of course, updating, changing or entirely replacing both elements is

    also possible! 4. Title and meta description
  47. pa.ag @peakaceag 80 Using element selectors in HTMLRewriter Often, you

    want only to process very specific elements, e.g. <meta> tags – but not all of them. Maybe it’s just the meta description you care about? new HTMLRewriter() .on('meta[name="description"]', new ElementHandler()) .transform(response) More on selectors: https://pa.ag/35xw073
  48. pa.ag @peakaceag 81 Updating or replacing titles and descriptions is

    easy… (#10): forced <title> overwrite, (#14-22): conditional changes to the meta description Element selectors are super powerful yet easy to use:
  49. pa.ag @peakaceag 83 Maybe you should have listened to me

    in the first place!? Check out my presentation over at SlideShare: Slides: http://pa.ag/migration_search_y
  50. pa.ag @peakaceag 84 HTMLRewriter listening to <a> and <img> tags

    (#29-30): Passing href/src attributes to a class which (#20): replaces oldURL with newUrl and (#16-18): ensures https-availability Based on: https://pa.ag/35llTSo
  51. pa.ag @peakaceag 86 HTTP hreflang annotations on the edge We‘ve

    just had plenty of HTML, so let‘s use HTTP headers instead – of course both ways work just fine though:
  52. pa.ag @peakaceag 87 Verify, e.g. using Chrome Developer Console: Network

    > %URL (disable cache) > Headers > Response Headers
  53. pa.ag @peakaceag 88 Before you ask: X-Robots directives are also

    possible… … and the same is true for X-Robots Rel-Canonical annotations:
  54. pa.ag @peakaceag 90 Combining HTTP 503 error with a Retry-After

    header Retry-After indicates how long the UA should wait before making a follow-up request: The server is currently unable to handle the request due to a temporary overloading or maintenance of the server […]. If known, the length of the delay MAY be indicated in a Retry-After header.
  55. Just in case I’ve somehow not made my point yet

    – you can do REALLY cool stuff and have control over the full HTML response – so adding content is easy. 8. Injecting content
  56. pa.ag @peakaceag 93 You could also (dynamically) read from an

    external feed Feeding in content from other sources is simple; below shows reading a JSON feed, parsing the input and inject to the <h1> of the target page:
  57. One of the key challenges when using CDNs: logfiles are

    literally everywhere – and a lot of requests don‘t even make it to the origin server… 9. Collecting logfiles
  58. pa.ag @peakaceag 95 Cloudflare provides extensive possibilities for logfiles What

    I really love about this: direct integration with Google Cloud products! Note: you need the Enterprise plan for this. More: https://pa.ag/3gnj8GF
  59. pa.ag @peakaceag 96 Peak Ace log file auditing stack. Interested?

    > [email protected] Log files are stored in Google Cloud Storage, processed in Dataprep, exported to BigQuery and visualised in Data Studio via BigQuery Connector. 8 Google Data Studio Data transmission Display data Import / API Google Dataprep 6 7 Google BigQuery 1 Log files GSC API v3 GA API v4 GA GSC 2 3 6 5 Google Apps Script API 4
  60. pa.ag @peakaceag 97 New to logfile auditing? No worries, I

    got you covered: Check out my presentation over at SlideShare: Slides: http://pa.ag/slides
  61. Yeah… actually this is how it all started; and still

    it‘s (one of) the most powerful tools to use for it! 10. Web performance
  62. pa.ag @peakaceag 99 Add native lazy loading for images to

    your HMTL mark-up Keep in mind: you don‘t want to lazy load all of your images (e.g. not the hero image); also, if you‘re using iframes, you might want to pass “iframe“ to the HTMLRewriter:
  63. pa.ag @peakaceag 100 Cleaning up HTML code for performance reasons

    E.g. by removing unwanted pre*-stages, or by adding async/defer to JS calls: More clean-up Worker scripts: https://gist.github.com/Nooshu
  64. pa.ag @peakaceag 101 A detailed guide on how to cache

    HTML with CF Workers More: https://pa.ag/3xk8rdt
  65. You can use Workers to fix broken tracking, allow for

    better accessibility, and much more. And tons of other things…
  66. Some stuff to make your (Worker) life just a bit

    easier… Tool recommendations
  67. pa.ag @peakaceag 104 Sloth: an advanced CF Worker Code Generator

    & CMS A very handy (and free) UI to manage Workers for changing robots.txt, titles & descriptions, redirects, hreflang, and much more: Check it out: https://sloth.cloud
  68. pa.ag @peakaceag 105 Tool recommendation: Lil Redirector “Lil Redirector works

    by persisting and querying redirects inside of Workers KV, and includes an administrator UI for creating, modifying, and deleting redirects.” More: https://pa.ag/3q3EZGx
  69. pa.ag @peakaceag 106 Workers KV – wait, what? Source: https://pa.ag/3vmTiXB

    Workers KV is a global, low-latency, key- value data store. It supports exceptionally high read volumes […] Workers KV is generally good for use-cases where you need to write relatively infrequently, but read quickly and frequently. It is optimised for these high-read applications.
  70. pa.ag @peakaceag 107 Web Scraper based on Cloudflare Workers “Web

    Scraper makes it effortless to scrape websites. You provide a URL & CSS selector, and it will return you JSON containing the text contents of the matching elements.” More: https://pa.ag/3woCv7T
  71. pa.ag @peakaceag 108 Technically not a tool, but a very

    comprehensive guide: More: https://pa.ag/3xnWDqy
  72. pa.ag @peakaceag 110 [This] dates back to the time of

    the French Revolution At least, if you believe Wikipedia that is… Source: https://pa.ag/35nQSx6 With great power comes great responsibility.
  73. pa.ag @peakaceag 112 Great summary over at ContentKing, well worth

    a read: What are the downsides […] What risks are involved? Source: https://pa.ag/3xhYUUk
  74. 10 million requests are included, every 1 million currently costs

    $0.50 extra - not crazy expensive, but in larger-scale setups certainly means additional costs. Risk of costs
  75. This might interfere with current processes, or at the very

    least, ensure Workers become part of a standardised process (e.g. deployment). PCI compliance
  76. The underlying codebase might do/require something that could accidentally be

    overwritten on the edge Potential conflict in code
  77. Additional modifications on the edge could result in massive debugging.

    Again: proper documentation and processes are crucial! Potential to introduce frontend bugs
  78. pa.ag @peakaceag 119 Yep, you can do evil things with

    Workers for sure: Source: https://pa.ag/3cFq0Nq
  79. pa.ag @peakaceag 120 Dynamically creating links to “Baccarat Sites” “[…]

    at the CF Workers management area, there was a suspicious Worker listed called hang. It had been set to run on any URL route requests to the website:” Source: https://pa.ag/3cFq0Nq After further investigation [by sucuri], it was found that the website was actually loading SEO spam content through Cloudflare’s Workers service. This service allows someone to load external third-party JavaScript that’s not on their website’s hosting server.
  80. pa.ag @peakaceag 121 The suspicious “hang” Worker injection in detail:

    Source: https://pa.ag/3cFq0Nq ▪ The JavaScript Worker first checks for the HTTP request’s user-agent and identifies whether it contains Google/Googlebot or naver within the string text. ▪ If the user-agent string contains either of these keywords, then the JavaScript makes a request to the malicious domain naverbot[.]live to generate the SEO spam links to be injected into the victim’s website. ▪ After this step, the Worker then injects the retrieved SEO spam link data right before the final </body> tag on the infected website’s HTML source. ▪ The malicious JavaScript can also be triggered if the user-agent matches a crawler that is entirely separate from Googlebot: naver.
  81. pa.ag @peakaceag 122 If you‘re now wondering how to distribute

    Workers…? Source: https://pa.ag/3zq0Mwd