Advanced Web Scraping - Speaker Deck

Slide 1

Slide 1 text

Advanced Web Scraping or How To Make Internet Your Database by @estevecastells & @NachoMascort

Slide 2

Slide 2 text

I’m Esteve Castells International SEO Specialist @ Softonic You can find me on @estevecastells https://estevecastells.com/ Newsletter: http://bit.ly/Seopatia Hi!

Slide 3

Slide 3 text

Hi! I’m Nacho Mascort SEO Manager @ Grupo Planeta You can find me on: @NachoMascort https://seohacks.es You can see my scripts on: https://github.com/NachoSEO

Slide 4

Slide 4 text

What are we gonna see? 1. What is Web Scraping? 2. Myths about Web Scraping 3. Main use cases a. In our website b. In external websites 4. Understanding the DOM 5. Extraction methods 6. Web Scraping Tools 7. Web Scraping with Python 8. Tips 9. Case studies 10. Bonus by @estevecastells & @NachoMascort

Slide 5

Slide 5 text

1. What is Web Scraping?

Slide 6

Slide 6 text

1.1 What is Web Scraping? The scraping or web scraping, is a technique with which through software, information or content is extracted from a website. There are simple 'scrapers' that parse the HTML of a website, to browsers that render JS and perform complex navigation and extraction tasks.

Slide 7

Slide 7 text

1.2 What are the use cases for Web Scraping? The uses of scraping are infinite, only limited by your creativity and the legality of your actions. The most basic uses can be to check changes in your own or a competitor's website, even to create dynamic websites based on multiple data sources.

Slide 8

Slide 8 text

2. Myths about Web Scraping

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

3. Main use cases

Slide 13

Slide 13 text

3.1 Main use cases in our websites Checking the Value of Certain HTML Tags ➜ Are all elements as defined in our documentation? ○ Deployment checks ➜ Are we sending conflicting signals? ○ HTTP Headers ○ Sitemaps vs goals ○ Duplicity of HTML tags ○ Incorrect label location ➜ Disappearance of HTML tags

Slide 14

Slide 14 text

3.2 Main use cases in external websites ● Automate processes: what a human would do and you can save money ○ Visual changes ● Are you adding new features? ○ Changes in HTML (goals, etc.) ● Are you adding new Schema tagging or changing your indexing strategy? ○ Content changes ● Do you update/cure your content? ○ Monitor ranking changes in Google

Slide 15

Slide 15 text

4. Understanding the DOM

Slide 16

Slide 16 text

DOCUMENT OBJECT MODEL

Slide 17

Slide 17 text

4.1 Document Object Model What is it? It is the structural representation of a document.

Slide 18

Slide 18 text

Defines the hierarchy of each element within each page.

Slide 19

Slide 19 text

Depending on its position a tag can be: ● Child ● Parent ● Sibiling

Slide 20

Slide 20 text

4.1 Document Object Model Components of a website? Our browser makes a get request to the server and it returns several files that the browser renders. These files are usually: ➜ HTML ➜ CSS ➜ JS ➜ Images ➜ ...

Slide 21

Slide 21 text

4.2 Código fuente vs DOM They're two different things. You can consult any HTML of a site by typing in the browser bar: view-source: https://www.domain.tld/path *With CSS and JS it is not necessary because the browser does not render them ** Ctrl / Cmd + u

Slide 22

Slide 22 text

What’s the source code?

Slide 23

Slide 23 text

What’s the source code? >>> view-source:

Slide 24

Slide 24 text

4.2 Source code vs DOM No JS has been executed in the source code. Depending on the behavior of the JS you may obtain "false" data.

Slide 25

Slide 25 text

4.2 Source code vs DOM If the source code doesn't work, what do we do? We can "see an approximation" to the DOM in the "Elements" tab of the Chrome developer tools (and any other browser).

Slide 26

Slide 26 text

4.2 Source code vs DOM

Slide 27

Slide 27 text

4.2 Source code vs DOM Or pressing F12 Shortcuts are cooler!

Slide 28

Slide 28 text

What's on the DOM? >>> F12

Slide 29

Slide 29 text

We can see JS changes in real time

Slide 30

Slide 30 text

4.3 Google, what do you see? Experiment from a little over a year ago: The idea is to modify the Meta Robots tag (via JS) of a URL to deindex the page and see if Google pays attention to the value found in the source code or in the DOM. URL to experiment with: https://seohacks.es/dashboard/

Slide 31

Slide 31 text

4.3 Google, what do you see? The following code is added: jQuery('meta[name="robots"]').remove(); var meta = document.createElement('meta'); meta.name = 'robots'; meta.content = 'noindex, follow'; jQuery('head').append(meta);

Slide 32

Slide 32 text

4.3 Google, what do you see? What he does is: 1. Delete the current meta tag robots

Slide 33

Slide 33 text

4.3 Google, what do you see? What he does is: 1. Delete the current meta tag robots 2. Creates a variable called "meta" that stores the creation of a "meta" type element (worth the redundancy)

Slide 34

Slide 34 text

4.3 Google, what do you see? What he does is: 1. Delete the current meta tag robots 2. Creates a variable called "meta" that stores the creation of a "meta" type element (worth the redundancy) 3. It adds the attributes "name" with value "robots" and "content" with value "noindex, follow".

Slide 35

Slide 35 text

Slide 36

Slide 36 text

4.3 Google, what do you see? Transforms this: In this:

Slide 37

Slide 37 text

4.3 Result DEINDEXED

Slide 38

Slide 38 text

More data https://www.searchviu.com/en/javascript-canonical-tags/

Slide 39

Slide 39 text

5. Methods of extraction

Slide 40

Slide 40 text

5. Methods of extraction We can extract the information from each document using different models that are quite similar to each other.

Slide 41

Slide 41 text

5. Methods of extraction We can extract the information from each document using different models that are quite similar to each other. These are the ones: ➜ Xpath ➜ CSS Selectors ➜ Others such as regex or specific tool selectors

Slide 42

Slide 42 text

5.1 Xpath Use path expressions to define a node or nodes within a document We can get them: ➜ Writing them ourselves ➜ Through developer tools within a browser

Slide 43

Slide 43 text

5.1.1 Xpath Syntax The writing standard is as follows: //tag[@attribute=’value’]

Slide 44

Slide 44 text

5.1.1 Xpath Syntax The writing standard is as follows: //tag[@attribute=’value’] For this tag:

Slide 45

Slide 45 text

5.1.1 Xpath Syntax //tag[@attribute=’value’] For this tag: ➜ Tag: input

Slide 46

Slide 46 text

5.1.1 Xpath Syntax //tag[@attribute=’value’] For this tag: ➜ Tag: input ➜ Attributes: ○ Id ○ Type ○ Value

Slide 47

Slide 47 text

5.1.1 Xpath Syntax //tag[@attribute=’value’] For this tag: ➜ Tag: input ➜ Attributes: ○ Id = seoplus ○ Type = submit ○ Value = Log In

Slide 48

Slide 48 text

5.1.1 Xpath Syntax //input[@id=’seoplus’]

Slide 49

Slide 49 text

5.1.2 Dev Tools

Slide 50

Slide 50 text

5.1.2 Dev Tools

Slide 51

Slide 51 text

5.2 CSS Selectors As its name suggests, these are the same selectors we use to write CSS. We can get them: ➜ Writing them ourselves with the same syntax as modifying the styles of a site ➜ Through developer tools within a browser *tip: to select a label we can use the xpath syntax and remove the @ from the attribute

Slide 52

Slide 52 text

5.2.1 Dev Tools

Slide 53

Slide 53 text

5.3 Xpath vs CSS Xpath CSS Direct Child //div/a div > a Child o Subchild //div//a div a ID //div[@id=”example”] #example Class //div[@clase=”example”] .example Attributes //input[@name='username'] input[name='user name'] https://saucelabs.com/resources/articles/selenium-tips-css-selectors

Slide 54

Slide 54 text

5.4 Others We can access certain nodes of the DOM by other methods such as: ➜ Regex ➜ Specific selectors of python libraries ➜ Adhoc tools

Slide 55

Slide 55 text

6. Web Scraping Tools

Slide 56

Slide 56 text

Some of the tens of tools that exist for Web Scraping Plugins Tools Scraper Jason The Miner Here there are more than 30 if you didn’t like these ones. https://www.octoparse.com/blog/top-30-free-web-scraping-software/

Slide 57

Slide 57 text

From basic tools or plugins that we can use to do basic scrapings, in some cases to get data out faster without having to pull out Python or JS to 'advanced' tools. ➜ Scraper ➜ Screaming Frog ➜ Google Sheets ➜ Grepsr 6.1 Web Scraping Tools

Slide 58

Slide 58 text

Scraper is a Google Chrome plugin that you can use to make small scrapings of elements in a minimally well-structured HTML. It is also useful to remove the XPath when sometimes Google Chrome Dev Tools does not remove it well to use it in other tools. As a plus, it works like Google Chrome Dev Tools, on the DOM 6.1.1 Scraper

Slide 59

Slide 59 text

1. Doble click in the element we want to pull 2. Click on Scrape Similar 3. Done! 6.1.1 Scraper

Slide 60

Slide 60 text

If the elements are well structured, we can get everything pulled extremely easily, without the need to use external programs or programming. 6.1.1 Scraper

Slide 61

Slide 61 text

6.1.1 Scraper Here we have the Xpath

Slide 62

Slide 62 text

6.1.1 Scraper List of elements that we are going to take out. Supports multiple columns

Slide 63

Slide 63 text

6.1.1 Scraper Easily expor to Excel (copypaste)

Slide 64

Slide 64 text

6.1.1 Scraper Or to GDocs (one-click)

Slide 65

Slide 65 text

Screaming Frog is one of the SEO tools par excellence, which can also be used for basic (and even advanced) scraping. As a crawler you can use Text only (pure HTML) or JS rendering, if your website uses client-side rendering. Its extraction mode is simple but with it you can get much of what you need to do, for the other you can use Python or other tools. 6.1.2 Screaming Frog

Slide 66

Slide 66 text

6.1.2 Screaming Frog Configuration > Custom > Extraction

Slide 67

Slide 67 text

6.1.2 Screaming Frog We have various modes - CSS path (CSS selector) - XPath (the main we will use) - Regex

Slide 68

Slide 68 text

6.1.2 Screaming Frog We have up to 10 selectors, which will generally be sufficient. Otherwise, we will have to use Excel with the SEARCHV function to join two or more scrapings.

Slide 69

Slide 69 text

6.1.2 Screaming Frog We will then have to decide whether we want to extract the content into HTML, text only or the entire HTML element

Slide 70

Slide 70 text

6.1.2 Screaming Frog Once we have all the extractors set, we just have to run it, either in crawler mode or ready mode with a sitemap.

Slide 71

Slide 71 text

6.1.2 Screaming Frog Once we have everything configured perfectly (sometimes we will have to test the correct XPath several times), we can leave it crawling and export the data obtained.

Slide 72

Slide 72 text

6.1.2 Screaming Frog Some of the most common uses are, both on original websites and competitors. ➜ Monitor changes/lost data in a deploy ➜ Monitor weekly changes in web content ➜ Check quantity increase or decrease or content/thin content ratios The limit of scraping with Screaming Frog. You can do 99% of the things you want to do and with JS-rendering made easy!

Slide 73

Slide 73 text

6.1.2 Screaming Frog Cutre tip: A 'cutre' use case for removing all URLs from a sitemap index is to import the entire list and then clean it up with Excel. In case you don't (yet) know how to use Python. 1. Go to Download Sitemap index 2. Put the URL of the sitemap index

Slide 74

Slide 74 text

6.1.2 Screaming Frog 3. Wait for all the sitemaps to download (can take mins) 4. Select all, copy paste to Excel

Slide 75

Slide 75 text

6.1.2 Screaming Frog Then we replace "Found " and we'll have all the clean URLs of a sitemap index. In this way we can then clean and pull results by URL patterns those that are interesting to us. Ex: a category, a page type, containing X word in the URL, etc. That way we can segment even more our scraping from either our website or a competitor's website.

Slide 76

Slide 76 text

6.1.3 Cloud version: FandangoSEO If you need to run intensive crawls of millions of pages with pagetype segmentation, with FandangoSEO you can set interesting XPaths with content extraction, count and exists.

Slide 77

Slide 77 text

6.1.4 Google Sheets With Google Sheets we can also import most elements of a web page, from HTML to JSON with a small external script. ➜ Pro's: ○ It imports HTML, CSV, TSV, XML, JSON and RSS. ○ Hosted in the cloud ○ Free and for the whole family ○ Easy to use with familiar functions ➜ Con’s: ○ It gets caught easily and usually takes thousands of rows to process

Slide 78

Slide 78 text

6.1.4 Google Sheets ➜ Easily import feeds to create your own Feedly or news aggregator

Slide 79

Slide 79 text

6.1.5 Grepsr Grepsr is a tool that is based on an extension that facilitates visual extraction, and also offers data export in CSV or API (json) format.

Slide 80

Slide 80 text

First of all we will install the extension in Chrome and run it, loading the desired page to scrape. 6.1.5 Grepsr

Slide 81

Slide 81 text

Then, click on 'Select' and select the exact element you want, by hovering with the mouse you can refine it. 6.1.5 Grepsr

Slide 82

Slide 82 text

Once selected, we will have marked the element and if it is well structured HTML, it will be very easy without having to pull XPath or CSS selectors. 6.1.5 Grepsr

Slide 83

Slide 83 text

Once selected all our fields, we will proceed to save them by clicking on “Next”, we can name each field and extract it in text form or extract the CSS class itself. 6.1.5 Grepsr

Slide 84

Slide 84 text

Finally, we can add pagination for each of our fields, if required, either in HTML with a next link, or if you have load more or infinite scroll (ajax). 6.1.5 Grepsr

Slide 85

Slide 85 text

6.1.5 Grepsr To select the pagination, we will follow the same process as with the elements to scrape. (Optional part, not everything requires pagination)

Slide 86

Slide 86 text

6.1.5 Grepsr Finally, we can also configure a login if necessary, as well as additional fields that are close to the extracted field (images, goals, etc.).

Slide 87

Slide 87 text

6.1.5 Grepsr Finally, we will have the data in both JSON and CSV formats. However, we will need a (free) Grepsr account to export them!

Slide 88

Slide 88 text

7. Web Scraping with Python

Slide 89

Slide 89 text

No content

Slide 90

Slide 90 text

7 Why Python? ➜ It's a very simple language to understand ➜ Easy approach for those starting with programming ➜ Much growth and great community behind it ➜ Core uses for massive data analysis and with very powerful libraries behind it (not just scraping) ➜ We can work on the browser! ○ https://colab.research.google.com

Slide 91

Slide 91 text

7.1 Type of data To start scraping we must know at least these concepts to program in python: ➜ Variables ➜ Lists ➜ Integers, Floats, Strings, Boolean Values.... ➜ For ➜ Conditional ➜ Imports

Slide 92

Slide 92 text

7.2 Scrapping Libraries There are several but I will focus on two: ➜ Requests + BeautifulSoup: To scrape data from the source code of a site. Useful for sites with static data. ➜ Selenium: Tool to automate QA that can help us scrape sites with dynamic content whose values are in the DOM but not in the source code. Colab does not support selenium, we will have to work with Jupyter (or any IDE)

Slide 93

Slide 93 text

No content

Slide 94

Slide 94 text

With 5 lines of code (or less) you can see the parsed HTML

Slide 95

Slide 95 text

Accessing any element of parsed HTML is easy

Slide 96

Slide 96 text

We can create a data frame and process the information as desired

Slide 97

Slide 97 text

Or download it

Slide 98

Slide 98 text

No content

Slide 99

Slide 99 text

7.3 Process We analyze the HTML, looking for patterns We generate the script for an element or URL We extend it to affect all data

Slide 100

Slide 100 text

8. Tips

Slide 101

Slide 101 text

There are many websites that serve their pages on a User-agent basis. Sometimes you will be interested in being a desktop device, sometimes a mobile device. Sometimes a Windows, sometimes a Mac. Sometimes a Googlebot, sometimes a bingbot. Adapt each scraping to what you need to get the desired results! 8.1 User-agent

Slide 102

Slide 102 text

To scrape a website like Google with advanced security mechanisms, it will be necessary to use proxies, among other measures. Proxies act as an intermediary between a request made by an X computer and a Z server. In this way, we leave little trace when it comes to being identified. Depending on the website and number of requests we recommend using one quantity or another. Generally, more than one request per second from the same IP address is not recommended. 8.2 Proxies

Slide 103

Slide 103 text

Generally the use of proxies is more recommended than a VPN, since the VPN does the same thing but under a single IP. It is always advisable to use a VPN with another geo for any kind of tracking on third party websites, to avoid possible problems or identifications. Also, if you are caught by IP (e.g. Cloudflare) you will never be able to access the web again from that IP (if it is static). Recommended service: ExpressVPN 8.3 VPN’s

Slide 104

Slide 104 text

8.4 Concurrency Concurrency consists of limiting the number of requests a network can make per second. We are interested in limiting the requests we always make, in order to avoid saturating the server, be it ours or a competitor's. If we saturate the server, we will have to make the requests again or, depending on the case, start the whole crawling process again. Indicative numbers: ➜ Small websites: 5 req/sec - 5 threads ➜ Large websites: 20 req/sec - 20 threads

Slide 105

Slide 105 text

8.5 Data cleaning It is common that after a data scraping, we find data that does not fit what we need. Normally, we'll have to work on the data to clean it up. Some of the most common corrections: ➜ Duplicates ➜ Format correction/unification ➜ Spaces ➜ Strange characters ➜ Currencies

Slide 106

Slide 106 text

9. Case studies

Slide 107

Slide 107 text

9. Casos prácticos Here are 2 case studies: ➜ Using scraping to automate the curation of content listings ➜ Scraping to generate a product feed for our websites

Slide 108

Slide 108 text

Using scraping to automate the curation of content listings

Slide 109

Slide 109 text

9.1 Using scraping to automate the curation of content listings It can be firmly said that the best search engine at the moment is Google. What if we use Google's results to generate our own listings, based on the ranking (relevancy) that it gives to websites that position for what we want to position?

Slide 110

Slide 110 text

9.1.1 Jason The Miner To do so, we will use Jason The Miner, a scraping library made by Marc Mignonsin, Principal Software Engineer at Softonic (@mawrkus) at Github and (@crossrecursion) at Twitter

Slide 111

Slide 111 text

9.1.1 Jason The Miner Jason The Miner is a versatile and modular Node.js based library that can be adapted to any website and need.

Slide 112

Slide 112 text

9.1.1 Jason The Miner aa

Slide 113

Slide 113 text

9.1.2 Concept We launched a query as 'best washing machines'. We will enter the top 20-30 results, analyze HTML and extract the link ID from the Amazon links. Then we will do a count and we will be automatically validating based on dozens of websites which is the best washing machine.

Slide 114

Slide 114 text

9.1.2 Concept Then, we will have a list of IDs with their URL, which we can scrape directly from Google Play or using their API, and semi-automatically fill our CMS (WordPress, or whatever we have). This allows us to automate content research/curing and focus on delivering real value in what we write. Screenshot is an outcome based on Google Play Store

Slide 115

Slide 115 text

9.1.3 Action First of all we will generate the basis to create the URL, with our user-agent, as well as the language we are interested in.

Slide 116

Slide 116 text

9.1.3 Action Then we are going to generate a maximum concurrence so that Google does not ban our IP or skip captchas.

Slide 117

Slide 117 text

9.1.3 Action Finally, let's define exactly the flow of the crawler. If you need to enter links/websites, and what you need to extract from them.

Slide 118

Slide 118 text

9.1.3 Action Finally, we will transform the output into a.json file that we can use to upload to our CMS.

Slide 119

Slide 119 text

9.1.3 Action And we can even configure it to be automatically uploaded to the CMS once the processes are finished.

Slide 120

Slide 120 text

9.1.3 Acción What does Jason the Miner do? ➜ Load (HTTP, file, json, ....) ➜ Parse (HTML w/ CSS by default) ➜ Transform But this is ok, but we need to do it in bulk for tens or hundreds of cases, we cannot do it one by one.

Slide 121

Slide 121 text

9.1.3 Acción Added functionality to make it work in bulk ➜ Bulk (imported from a CSV) ➜ Load (HTTP, file, json, ....) ➜ Parse (HTML w/ CSS by default) ➜ Transform Creating a variable that would be the query we inserted in Google.

Slide 122

Slide 122 text

9.1.4 CMS Once we have all the data inserted in our CMS, we will have to execute another basic scraping process or with an API such as Amazon to get all the data of each product (logo, name, images, description, etc). Once we have everything, the lists will be sorted and we can add the editorial content we want, with very little manual work to do.

Slide 123

Slide 123 text

9.1.5 Ideas Examples in which it could be applied: ➜ Amazon Products ➜ Listings of restaurants that are on TripAdvisor ➜ Hotel Listings ➜ Netflix Movie Listings ➜ Best PS4 Games ➜ Better android apps ➜ Best Chromecast apps ➜ Best books

Slide 124

Slide 124 text

Scraping to generate a product feed for our websites

Slide 125

Slide 125 text

9.2 Starting point Website affiliated to Casa del Libro. We need to generate a product feed for each of our product pages.

Slide 126

Slide 126 text

9.2 Process We analyze the HTML, looking for patterns We generate the script for an element or URL We extend it to affect all data

Slide 127

Slide 127 text

9.2.0 What do we want to scrape off? We need the following information: ➜ Titles ➜ Author ➜ Editorial ➜ Prices *Only from the category of crime novel

Slide 128