Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scraping the Web with AWS Lambda and PhantomJS

Scraping the Web with AWS Lambda and PhantomJS

A talk given at Greater Philadelphia AWS User Group meetup on May 25, 2016. The talk summarizes our experience creating a scalable website scraper and the many iterations of technology we went through to achieve our final product.

You can find the source code of PhantomJS/Node.js web scraper for AWS Lambda at https://github.com/akrylysov/lambda-phantom-scraper.

Artem Krylysov

May 25, 2016
Tweet

Other Decks in Programming

Transcript

  1. A look at the experience of creating a scalable website

    scraper and the many iterations of technology we went through to achieve our final product.
  2. blockbust.io Blockbust is a service which scans a website and

    helps to identify if any of its HTML elements may be blocked by an ad blocker like Adblock, Adblock Plus, uBlock or many others. Blockbust fetches the content of the web page and checks it against the list of known rules.
  3. First attempt The Blockbust backend is written in Go, the

    first decision was to use the easiest and the fastest way - http.Get. http.Get makes an HTTP request to a web server and you can read the response.
  4. First attempt It was very fast but didn't work well.

    Almost all modern websites use JavaScript to generate some additional content.
  5. First attempt For example, for facebook.com, the size of the

    initial HTML page returned by the server is 300 KB. If you open the same page in a browser and wait until it’s completely loaded, the size of the HTML content doubles to 600 KB.
  6. PhantomJS PhantomJS is a headless version of Chromium browser. It

    provides a JavaScript API for navigating web pages, interacting with a DOM and taking screenshots. http://phantomjs.org/
  7. PhantomJS var page = require('webpage').create(); page.open('http://phantomjs.org/documentation/', function(status) { if (status

    === 'success') { var title = page.evaluate(function() { return document.title; }); console.log('Title: ' + title); phantom.exit(0); } else { phantom.exit(1); } });
  8. PhantomJS PhantomJS is not a Node.js module, it is a

    standalone application. If you want to use it from your Node.js script you have to execute the phantomjs binary as a child process. You can communicate with it using stdin and stdout.
  9. PhantomJS We use phantomjs-prebuilt package, it provides an easy way

    to install PhantomJS binaries using NPM and use it from Node.js. https://github.com/Medium/phantomjs
  10. Docker and ECS ECS is a service for dunning and

    managing Docker containers on EC2 cluster.
  11. Docker and ECS Didn’t provide a way to scale the

    number of instances of a specified Docker container automatically. You could place your EC2 container cluster into an auto scaling group and it would add a new EC2 instance to the cluster, but it wouldn’t launch a new Docker container.
  12. Queue Our next option was to add a queue in

    front of the scraper and limit the number of concurrent requests. In this case, if we had a huge spike in traffic, users would have to wait in a line for their results and it might take a long time. We wanted to avoid such kind of bad user experience.
  13. AWS Lambda We already had an experience of using AWS

    Lambda in a few internal projects: • Amazon S3 event handlers. • Amazon SNS notification handlers. • Slack bots.
  14. What is Lambda? AWS Lambda is a service which runs,

    manages and automatically scales your code.
  15. AWS Lambda You can run Lambda: • As a handler

    for S3, SNS, CloudWatch and many other AWS events. • On a schedule using cron-like syntax. • As an HTTP request handler using API Gateway.
  16. AWS Lambda exports.handler = function(event, context, callback) { callback(null, 'hello');

    } handler - the main function which is called by AWS Lambda. event - AWS uses this parameter to pass the event data (e.g. POST data for API Gateway). callback - you can use this parameter to return the data.
  17. Testing Lambda Amazon doesn't provide any tools for testing lambdas

    on a local development environment. You can easily emulate the Lambda runtime using a simple Express. js server.
  18. Testing Lambda var express = require('express'); var bodyParser = require('body-parser');

    var lambda = require('./lambda'); var app = express(); app.use(bodyParser.json()); app.post('/', function(req, res) { lambda.handler(req.body, {}, function(err, result) { if (err) { return res.send(err); } res.send(result); }); }); app.listen(3000);
  19. Deploying First of all, to deploy your code on Amazon

    servers you need to build a deployment package. The deployment package is a ZIP archive which contains the application code and the node_modules directory if you have any dependencies.
  20. Deploying You can deploy the package using the web interface

    or using awscli tool: aws lambda update-function-code \ --region us-east-1 \ --function-name lambda-phantom-scraper \ --zip-file fileb://$PWD/lambda-phantom-scraper.zip
  21. Tweaking PhantomJS We successfully moved our code to Lambda, and

    stopped seeing any stability or performance issues, but sometimes we still were getting inaccurate results for some websites.
  22. Tweaking PhantomJS: issue 1 The default stdout buffer size in

    Node.js is 200 KB. Use maxBuffer to increase the buffer size when you create a child process: var phantom = childProcess.execFile(phantomJsPath, childArgs, { env: {URL: url}, maxBuffer: 2048*1024 });
  23. Tweaking PhantomJS: issue 2 In PhantomJS the default screen size

    is 400 to 300 pixels, that makes some websites think that our scraper is a mobile browser. You can change the screen size using viewportSize property: page.viewportSize = {width: 1366, height: 768};
  24. Tweaking PhantomJS: issue 3 The default PhantomJS user agent is

    Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/538.1 (KHTML, like Gecko) PhantomJS/2.1.1 Safari/538.1. Change it to a user agent from a real browser: page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36';
  25. Debugging Lambda Amazon doesn’t provide any ways to debug your

    code. You have to use console.log like in early JavaScript days.
  26. Lambda Limits The maximum number of concurrent executions is 100

    by default. You need to ask the AWS support if you want to increase the limit.
  27. Lambda cold start The execution time is usually from 1

    to 4 seconds longer for a cold start - a moment when AWS internally spins a new “container” for your application. It happens e.g. when the function wasn’t called in a while or when AWS needs to scale your Lambda function. There is no solution for this problem at the moment.
  28. How much does it cost? The Lambda cost depends on

    3 major factors: • Number of executions • Amount of allocated memory • Execution time
  29. How much does it cost? 128MB - $0.000000208 per 100ms

    512MB - $0.000000834 per 100ms 1024MB - $0.000001667 per 100ms 1536MB - $0.000002501 per 100ms Full pricing table https://aws.amazon.com/lambda/pricing/.
  30. How much does it cost? We configured our Lambda function

    to use 1GB of memory. In average, a website takes about 4 seconds to load. For 100000 calls we payed: 0.000001667 * 40 * 100000 = $6.668
  31. Lambda Pros • Don’t pay for it if you don’t

    use it • Easy to manage and deploy • Scales automatically