Slide 1

Slide 1 text

Scraping the Web with AWS Lambda and PhantomJS Artem Krylysov

Slide 2

Slide 2 text

A look at the experience of creating a scalable website scraper and the many iterations of technology we went through to achieve our final product.

Slide 3

Slide 3 text

blockbust.io

Slide 4

Slide 4 text

blockbust.io Blockbust is a service which scans a website and helps to identify if any of its HTML elements may be blocked by an ad blocker like Adblock, Adblock Plus, uBlock or many others. Blockbust fetches the content of the web page and checks it against the list of known rules.

Slide 5

Slide 5 text

blockbust.io

Slide 6

Slide 6 text

First attempt

Slide 7

Slide 7 text

First attempt The Blockbust backend is written in Go, the first decision was to use the easiest and the fastest way - http.Get. http.Get makes an HTTP request to a web server and you can read the response.

Slide 8

Slide 8 text

First attempt It was very fast but didn't work well. Almost all modern websites use JavaScript to generate some additional content.

Slide 9

Slide 9 text

First attempt For example, for facebook.com, the size of the initial HTML page returned by the server is 300 KB. If you open the same page in a browser and wait until it’s completely loaded, the size of the HTML content doubles to 600 KB.

Slide 10

Slide 10 text

PhantomJS

Slide 11

Slide 11 text

PhantomJS PhantomJS is a headless version of Chromium browser. It provides a JavaScript API for navigating web pages, interacting with a DOM and taking screenshots. http://phantomjs.org/

Slide 12

Slide 12 text

PhantomJS var page = require('webpage').create(); page.open('http://phantomjs.org/documentation/', function(status) { if (status === 'success') { var title = page.evaluate(function() { return document.title; }); console.log('Title: ' + title); phantom.exit(0); } else { phantom.exit(1); } });

Slide 13

Slide 13 text

PhantomJS $ phantomjs example1.js Title: Documentation | PhantomJS

Slide 14

Slide 14 text

PhantomJS PhantomJS is not a Node.js module, it is a standalone application. If you want to use it from your Node.js script you have to execute the phantomjs binary as a child process. You can communicate with it using stdin and stdout.

Slide 15

Slide 15 text

PhantomJS We use phantomjs-prebuilt package, it provides an easy way to install PhantomJS binaries using NPM and use it from Node.js. https://github.com/Medium/phantomjs

Slide 16

Slide 16 text

Docker and ECS

Slide 17

Slide 17 text

Docker and ECS ECS is a service for dunning and managing Docker containers on EC2 cluster.

Slide 18

Slide 18 text

Docker and ECS Didn’t provide a way to scale the number of instances of a specified Docker container automatically. You could place your EC2 container cluster into an auto scaling group and it would add a new EC2 instance to the cluster, but it wouldn’t launch a new Docker container.

Slide 19

Slide 19 text

Queue

Slide 20

Slide 20 text

Queue Our next option was to add a queue in front of the scraper and limit the number of concurrent requests. In this case, if we had a huge spike in traffic, users would have to wait in a line for their results and it might take a long time. We wanted to avoid such kind of bad user experience.

Slide 21

Slide 21 text

AWS Lambda

Slide 22

Slide 22 text

AWS Lambda We already had an experience of using AWS Lambda in a few internal projects: ● Amazon S3 event handlers. ● Amazon SNS notification handlers. ● Slack bots.

Slide 23

Slide 23 text

What is Lambda? AWS Lambda is a service which runs, manages and automatically scales your code.

Slide 24

Slide 24 text

AWS Lambda Supports Node.js, Python and Java.

Slide 25

Slide 25 text

AWS Lambda You can run Lambda: ● As a handler for S3, SNS, CloudWatch and many other AWS events. ● On a schedule using cron-like syntax. ● As an HTTP request handler using API Gateway.

Slide 26

Slide 26 text

AWS Lambda exports.handler = function(event, context, callback) { callback(null, 'hello'); } handler - the main function which is called by AWS Lambda. event - AWS uses this parameter to pass the event data (e.g. POST data for API Gateway). callback - you can use this parameter to return the data.

Slide 27

Slide 27 text

Testing Lambda

Slide 28

Slide 28 text

Testing Lambda Amazon doesn't provide any tools for testing lambdas on a local development environment. You can easily emulate the Lambda runtime using a simple Express. js server.

Slide 29

Slide 29 text

Testing Lambda var express = require('express'); var bodyParser = require('body-parser'); var lambda = require('./lambda'); var app = express(); app.use(bodyParser.json()); app.post('/', function(req, res) { lambda.handler(req.body, {}, function(err, result) { if (err) { return res.send(err); } res.send(result); }); }); app.listen(3000);

Slide 30

Slide 30 text

Deploying

Slide 31

Slide 31 text

Deploying First of all, to deploy your code on Amazon servers you need to build a deployment package. The deployment package is a ZIP archive which contains the application code and the node_modules directory if you have any dependencies.

Slide 32

Slide 32 text

Deploying You can deploy the package using the web interface or using awscli tool: aws lambda update-function-code \ --region us-east-1 \ --function-name lambda-phantom-scraper \ --zip-file fileb://$PWD/lambda-phantom-scraper.zip

Slide 33

Slide 33 text

Tweaking PhantomJS

Slide 34

Slide 34 text

Tweaking PhantomJS We successfully moved our code to Lambda, and stopped seeing any stability or performance issues, but sometimes we still were getting inaccurate results for some websites.

Slide 35

Slide 35 text

Tweaking PhantomJS: issue 1 The default stdout buffer size in Node.js is 200 KB. Use maxBuffer to increase the buffer size when you create a child process: var phantom = childProcess.execFile(phantomJsPath, childArgs, { env: {URL: url}, maxBuffer: 2048*1024 });

Slide 36

Slide 36 text

Tweaking PhantomJS: issue 2 In PhantomJS the default screen size is 400 to 300 pixels, that makes some websites think that our scraper is a mobile browser. You can change the screen size using viewportSize property: page.viewportSize = {width: 1366, height: 768};

Slide 37

Slide 37 text

Tweaking PhantomJS: issue 3 The default PhantomJS user agent is Mozilla/5.0 (Macintosh; Intel Mac OS X) AppleWebKit/538.1 (KHTML, like Gecko) PhantomJS/2.1.1 Safari/538.1. Change it to a user agent from a real browser: page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36';

Slide 38

Slide 38 text

Lambda limitations

Slide 39

Slide 39 text

Debugging Lambda Amazon doesn’t provide any ways to debug your code. You have to use console.log like in early JavaScript days.

Slide 40

Slide 40 text

Lambda Limits The maximum number of concurrent executions is 100 by default. You need to ask the AWS support if you want to increase the limit.

Slide 41

Slide 41 text

Lambda cold start The execution time is usually from 1 to 4 seconds longer for a cold start - a moment when AWS internally spins a new “container” for your application. It happens e.g. when the function wasn’t called in a while or when AWS needs to scale your Lambda function. There is no solution for this problem at the moment.

Slide 42

Slide 42 text

How much does it cost?

Slide 43

Slide 43 text

How much does it cost? The Lambda cost depends on 3 major factors: ● Number of executions ● Amount of allocated memory ● Execution time

Slide 44

Slide 44 text

How much does it cost? 128MB - $0.000000208 per 100ms 512MB - $0.000000834 per 100ms 1024MB - $0.000001667 per 100ms 1536MB - $0.000002501 per 100ms Full pricing table https://aws.amazon.com/lambda/pricing/.

Slide 45

Slide 45 text

How much does it cost? We configured our Lambda function to use 1GB of memory. In average, a website takes about 4 seconds to load. For 100000 calls we payed: 0.000001667 * 40 * 100000 = $6.668

Slide 46

Slide 46 text

Conclusions

Slide 47

Slide 47 text

Lambda Cons ● Cold start latency ● Debugging

Slide 48

Slide 48 text

Lambda Pros ● Don’t pay for it if you don’t use it ● Easy to manage and deploy ● Scales automatically

Slide 49

Slide 49 text

Questions?

Slide 50

Slide 50 text

Thanks! https://github.com/akrylysov/lambda-phantom-scraper