Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Advanced Web Scraping

Advanced Web Scraping

Advanced Web Scraping presentation in SEOplus event in Alicante with my friend Nacho Mascort

Esteve Castells
PRO

August 08, 2018
Tweet

More Decks by Esteve Castells

Other Decks in Programming

Transcript

  1. Advanced Web Scraping or How To Make Internet Your Database

    by @estevecastells & @NachoMascort
  2. I’m Esteve Castells International SEO Specialist @ Softonic You can

    find me on @estevecastells https://estevecastells.com/ Newsletter: http://bit.ly/Seopatia Hi!
  3. Hi! I’m Nacho Mascort SEO Manager @ Grupo Planeta You

    can find me on: @NachoMascort https://seohacks.es You can see my scripts on: https://github.com/NachoSEO
  4. What are we gonna see? 1. What is Web Scraping?

    2. Myths about Web Scraping 3. Main use cases a. In our website b. In external websites 4. Understanding the DOM 5. Extraction methods 6. Web Scraping Tools 7. Web Scraping with Python 8. Tips 9. Case studies 10. Bonus by @estevecastells & @NachoMascort
  5. 1. What is Web Scraping?

  6. 1.1 What is Web Scraping? The scraping or web scraping,

    is a technique with which through software, information or content is extracted from a website. There are simple 'scrapers' that parse the HTML of a website, to browsers that render JS and perform complex navigation and extraction tasks.
  7. 1.2 What are the use cases for Web Scraping? The

    uses of scraping are infinite, only limited by your creativity and the legality of your actions. The most basic uses can be to check changes in your own or a competitor's website, even to create dynamic websites based on multiple data sources.
  8. 2. Myths about Web Scraping

  9. None
  10. None
  11. None
  12. 3. Main use cases

  13. 3.1 Main use cases in our websites Checking the Value

    of Certain HTML Tags ➜ Are all elements as defined in our documentation? ◦ Deployment checks ➜ Are we sending conflicting signals? ◦ HTTP Headers ◦ Sitemaps vs goals ◦ Duplicity of HTML tags ◦ Incorrect label location ➜ Disappearance of HTML tags
  14. 3.2 Main use cases in external websites • Automate processes:

    what a human would do and you can save money ◦ Visual changes • Are you adding new features? ◦ Changes in HTML (goals, etc.) • Are you adding new Schema tagging or changing your indexing strategy? ◦ Content changes • Do you update/cure your content? ◦ Monitor ranking changes in Google
  15. 4. Understanding the DOM

  16. DOCUMENT OBJECT MODEL

  17. 4.1 Document Object Model What is it? It is the

    structural representation of a document.
  18. Defines the hierarchy of each element within each page.

  19. Depending on its position a tag can be: • Child

    • Parent • Sibiling
  20. 4.1 Document Object Model Components of a website? Our browser

    makes a get request to the server and it returns several files that the browser renders. These files are usually: ➜ HTML ➜ CSS ➜ JS ➜ Images ➜ ...
  21. 4.2 Código fuente vs DOM They're two different things. You

    can consult any HTML of a site by typing in the browser bar: view-source: https://www.domain.tld/path *With CSS and JS it is not necessary because the browser does not render them ** Ctrl / Cmd + u
  22. What’s the source code?

  23. What’s the source code? >>> view-source:

  24. 4.2 Source code vs DOM No JS has been executed

    in the source code. Depending on the behavior of the JS you may obtain "false" data.
  25. 4.2 Source code vs DOM If the source code doesn't

    work, what do we do? We can "see an approximation" to the DOM in the "Elements" tab of the Chrome developer tools (and any other browser).
  26. 4.2 Source code vs DOM

  27. 4.2 Source code vs DOM Or pressing F12 Shortcuts are

    cooler!
  28. What's on the DOM? >>> F12

  29. We can see JS changes in real time

  30. 4.3 Google, what do you see? Experiment from a little

    over a year ago: The idea is to modify the Meta Robots tag (via JS) of a URL to deindex the page and see if Google pays attention to the value found in the source code or in the DOM. URL to experiment with: https://seohacks.es/dashboard/
  31. 4.3 Google, what do you see? The following code is

    added: <script> jQuery('meta[name="robots"]').remove(); var meta = document.createElement('meta'); meta.name = 'robots'; meta.content = 'noindex, follow'; jQuery('head').append(meta); </script>
  32. 4.3 Google, what do you see? What he does is:

    1. Delete the current meta tag robots
  33. 4.3 Google, what do you see? What he does is:

    1. Delete the current meta tag robots 2. Creates a variable called "meta" that stores the creation of a "meta" type element (worth the redundancy)
  34. 4.3 Google, what do you see? What he does is:

    1. Delete the current meta tag robots 2. Creates a variable called "meta" that stores the creation of a "meta" type element (worth the redundancy) 3. It adds the attributes "name" with value "robots" and "content" with value "noindex, follow".
  35. 4.3 Google, what do you see? What he does is:

    1. Delete the current meta tag robots 2. Creates a variable called "meta" that stores the creation of a "meta" type element (worth the redundancy) 3. It adds the attributes "name" with value "robots" and "content" with value "noindex, follow". 4. Adds to the head the meta variable that contains the tag with the values that cause a deindexation
  36. 4.3 Google, what do you see? Transforms this: In this:

  37. 4.3 Result DEINDEXED

  38. More data https://www.searchviu.com/en/javascript-canonical-tags/

  39. 5. Methods of extraction

  40. 5. Methods of extraction We can extract the information from

    each document using different models that are quite similar to each other.
  41. 5. Methods of extraction We can extract the information from

    each document using different models that are quite similar to each other. These are the ones: ➜ Xpath ➜ CSS Selectors ➜ Others such as regex or specific tool selectors
  42. 5.1 Xpath Use path expressions to define a node or

    nodes within a document We can get them: ➜ Writing them ourselves ➜ Through developer tools within a browser
  43. 5.1.1 Xpath Syntax The writing standard is as follows: //tag[@attribute=’value’]

  44. 5.1.1 Xpath Syntax The writing standard is as follows: //tag[@attribute=’value’]

    For this tag: <input id=”seoplus” type=”submit” value=”Log In”/>
  45. 5.1.1 Xpath Syntax //tag[@attribute=’value’] For this tag: <input id=”seoplus” type=”submit”

    value=”Log In”/> ➜ Tag: input
  46. 5.1.1 Xpath Syntax //tag[@attribute=’value’] For this tag: <input id=”seoplus” type=”submit”

    value=”Log In”/> ➜ Tag: input ➜ Attributes: ◦ Id ◦ Type ◦ Value
  47. 5.1.1 Xpath Syntax //tag[@attribute=’value’] For this tag: <input id=”seoplus” type=”submit”

    value=”Log In”/> ➜ Tag: input ➜ Attributes: ◦ Id = seoplus ◦ Type = submit ◦ Value = Log In
  48. 5.1.1 Xpath Syntax //input[@id=’seoplus’]

  49. 5.1.2 Dev Tools

  50. 5.1.2 Dev Tools

  51. 5.2 CSS Selectors As its name suggests, these are the

    same selectors we use to write CSS. We can get them: ➜ Writing them ourselves with the same syntax as modifying the styles of a site ➜ Through developer tools within a browser *tip: to select a label we can use the xpath syntax and remove the @ from the attribute
  52. 5.2.1 Dev Tools

  53. 5.3 Xpath vs CSS Xpath CSS Direct Child //div/a div

    > a Child o Subchild //div//a div a ID //div[@id=”example”] #example Class //div[@clase=”example”] .example Attributes //input[@name='username'] input[name='user name'] https://saucelabs.com/resources/articles/selenium-tips-css-selectors
  54. 5.4 Others We can access certain nodes of the DOM

    by other methods such as: ➜ Regex ➜ Specific selectors of python libraries ➜ Adhoc tools
  55. 6. Web Scraping Tools

  56. Some of the tens of tools that exist for Web

    Scraping Plugins Tools Scraper Jason The Miner Here there are more than 30 if you didn’t like these ones. https://www.octoparse.com/blog/top-30-free-web-scraping-software/
  57. From basic tools or plugins that we can use to

    do basic scrapings, in some cases to get data out faster without having to pull out Python or JS to 'advanced' tools. ➜ Scraper ➜ Screaming Frog ➜ Google Sheets ➜ Grepsr 6.1 Web Scraping Tools
  58. Scraper is a Google Chrome plugin that you can use

    to make small scrapings of elements in a minimally well-structured HTML. It is also useful to remove the XPath when sometimes Google Chrome Dev Tools does not remove it well to use it in other tools. As a plus, it works like Google Chrome Dev Tools, on the DOM 6.1.1 Scraper
  59. 1. Doble click in the element we want to pull

    2. Click on Scrape Similar 3. Done! 6.1.1 Scraper
  60. If the elements are well structured, we can get everything

    pulled extremely easily, without the need to use external programs or programming. 6.1.1 Scraper
  61. 6.1.1 Scraper Here we have the Xpath

  62. 6.1.1 Scraper List of elements that we are going to

    take out. Supports multiple columns
  63. 6.1.1 Scraper Easily expor to Excel (copypaste)

  64. 6.1.1 Scraper Or to GDocs (one-click)

  65. Screaming Frog is one of the SEO tools par excellence,

    which can also be used for basic (and even advanced) scraping. As a crawler you can use Text only (pure HTML) or JS rendering, if your website uses client-side rendering. Its extraction mode is simple but with it you can get much of what you need to do, for the other you can use Python or other tools. 6.1.2 Screaming Frog
  66. 6.1.2 Screaming Frog Configuration > Custom > Extraction

  67. 6.1.2 Screaming Frog We have various modes - CSS path

    (CSS selector) - XPath (the main we will use) - Regex
  68. 6.1.2 Screaming Frog We have up to 10 selectors, which

    will generally be sufficient. Otherwise, we will have to use Excel with the SEARCHV function to join two or more scrapings.
  69. 6.1.2 Screaming Frog We will then have to decide whether

    we want to extract the content into HTML, text only or the entire HTML element
  70. 6.1.2 Screaming Frog Once we have all the extractors set,

    we just have to run it, either in crawler mode or ready mode with a sitemap.
  71. 6.1.2 Screaming Frog Once we have everything configured perfectly (sometimes

    we will have to test the correct XPath several times), we can leave it crawling and export the data obtained.
  72. 6.1.2 Screaming Frog Some of the most common uses are,

    both on original websites and competitors. ➜ Monitor changes/lost data in a deploy ➜ Monitor weekly changes in web content ➜ Check quantity increase or decrease or content/thin content ratios The limit of scraping with Screaming Frog. You can do 99% of the things you want to do and with JS-rendering made easy!
  73. 6.1.2 Screaming Frog Cutre tip: A 'cutre' use case for

    removing all URLs from a sitemap index is to import the entire list and then clean it up with Excel. In case you don't (yet) know how to use Python. 1. Go to Download Sitemap index 2. Put the URL of the sitemap index
  74. 6.1.2 Screaming Frog 3. Wait for all the sitemaps to

    download (can take mins) 4. Select all, copy paste to Excel
  75. 6.1.2 Screaming Frog Then we replace "Found " and we'll

    have all the clean URLs of a sitemap index. In this way we can then clean and pull results by URL patterns those that are interesting to us. Ex: a category, a page type, containing X word in the URL, etc. That way we can segment even more our scraping from either our website or a competitor's website.
  76. 6.1.3 Cloud version: FandangoSEO If you need to run intensive

    crawls of millions of pages with pagetype segmentation, with FandangoSEO you can set interesting XPaths with content extraction, count and exists.
  77. 6.1.4 Google Sheets With Google Sheets we can also import

    most elements of a web page, from HTML to JSON with a small external script. ➜ Pro's: ◦ It imports HTML, CSV, TSV, XML, JSON and RSS. ◦ Hosted in the cloud ◦ Free and for the whole family ◦ Easy to use with familiar functions ➜ Con’s: ◦ It gets caught easily and usually takes thousands of rows to process
  78. 6.1.4 Google Sheets ➜ Easily import feeds to create your

    own Feedly or news aggregator
  79. 6.1.5 Grepsr Grepsr is a tool that is based on

    an extension that facilitates visual extraction, and also offers data export in CSV or API (json) format.
  80. First of all we will install the extension in Chrome

    and run it, loading the desired page to scrape. 6.1.5 Grepsr
  81. Then, click on 'Select' and select the exact element you

    want, by hovering with the mouse you can refine it. 6.1.5 Grepsr
  82. Once selected, we will have marked the element and if

    it is well structured HTML, it will be very easy without having to pull XPath or CSS selectors. 6.1.5 Grepsr
  83. Once selected all our fields, we will proceed to save

    them by clicking on “Next”, we can name each field and extract it in text form or extract the CSS class itself. 6.1.5 Grepsr
  84. Finally, we can add pagination for each of our fields,

    if required, either in HTML with a next link, or if you have load more or infinite scroll (ajax). 6.1.5 Grepsr
  85. 6.1.5 Grepsr To select the pagination, we will follow the

    same process as with the elements to scrape. (Optional part, not everything requires pagination)
  86. 6.1.5 Grepsr Finally, we can also configure a login if

    necessary, as well as additional fields that are close to the extracted field (images, goals, etc.).
  87. 6.1.5 Grepsr Finally, we will have the data in both

    JSON and CSV formats. However, we will need a (free) Grepsr account to export them!
  88. 7. Web Scraping with Python

  89. None
  90. 7 Why Python? ➜ It's a very simple language to

    understand ➜ Easy approach for those starting with programming ➜ Much growth and great community behind it ➜ Core uses for massive data analysis and with very powerful libraries behind it (not just scraping) ➜ We can work on the browser! ◦ https://colab.research.google.com
  91. 7.1 Type of data To start scraping we must know

    at least these concepts to program in python: ➜ Variables ➜ Lists ➜ Integers, Floats, Strings, Boolean Values.... ➜ For ➜ Conditional ➜ Imports
  92. 7.2 Scrapping Libraries There are several but I will focus

    on two: ➜ Requests + BeautifulSoup: To scrape data from the source code of a site. Useful for sites with static data. ➜ Selenium: Tool to automate QA that can help us scrape sites with dynamic content whose values are in the DOM but not in the source code. Colab does not support selenium, we will have to work with Jupyter (or any IDE)
  93. None
  94. With 5 lines of code (or less) you can see

    the parsed HTML
  95. Accessing any element of parsed HTML is easy

  96. We can create a data frame and process the information

    as desired
  97. Or download it

  98. None
  99. 7.3 Process We analyze the HTML, looking for patterns We

    generate the script for an element or URL We extend it to affect all data
  100. 8. Tips

  101. There are many websites that serve their pages on a

    User-agent basis. Sometimes you will be interested in being a desktop device, sometimes a mobile device. Sometimes a Windows, sometimes a Mac. Sometimes a Googlebot, sometimes a bingbot. Adapt each scraping to what you need to get the desired results! 8.1 User-agent
  102. To scrape a website like Google with advanced security mechanisms,

    it will be necessary to use proxies, among other measures. Proxies act as an intermediary between a request made by an X computer and a Z server. In this way, we leave little trace when it comes to being identified. Depending on the website and number of requests we recommend using one quantity or another. Generally, more than one request per second from the same IP address is not recommended. 8.2 Proxies
  103. Generally the use of proxies is more recommended than a

    VPN, since the VPN does the same thing but under a single IP. It is always advisable to use a VPN with another geo for any kind of tracking on third party websites, to avoid possible problems or identifications. Also, if you are caught by IP (e.g. Cloudflare) you will never be able to access the web again from that IP (if it is static). Recommended service: ExpressVPN 8.3 VPN’s
  104. 8.4 Concurrency Concurrency consists of limiting the number of requests

    a network can make per second. We are interested in limiting the requests we always make, in order to avoid saturating the server, be it ours or a competitor's. If we saturate the server, we will have to make the requests again or, depending on the case, start the whole crawling process again. Indicative numbers: ➜ Small websites: 5 req/sec - 5 threads ➜ Large websites: 20 req/sec - 20 threads
  105. 8.5 Data cleaning It is common that after a data

    scraping, we find data that does not fit what we need. Normally, we'll have to work on the data to clean it up. Some of the most common corrections: ➜ Duplicates ➜ Format correction/unification ➜ Spaces ➜ Strange characters ➜ Currencies
  106. 9. Case studies

  107. 9. Casos prácticos Here are 2 case studies: ➜ Using

    scraping to automate the curation of content listings ➜ Scraping to generate a product feed for our websites
  108. Using scraping to automate the curation of content listings

  109. 9.1 Using scraping to automate the curation of content listings

    It can be firmly said that the best search engine at the moment is Google. What if we use Google's results to generate our own listings, based on the ranking (relevancy) that it gives to websites that position for what we want to position?
  110. 9.1.1 Jason The Miner To do so, we will use

    Jason The Miner, a scraping library made by Marc Mignonsin, Principal Software Engineer at Softonic (@mawrkus) at Github and (@crossrecursion) at Twitter
  111. 9.1.1 Jason The Miner Jason The Miner is a versatile

    and modular Node.js based library that can be adapted to any website and need.
  112. 9.1.1 Jason The Miner aa

  113. 9.1.2 Concept We launched a query as 'best washing machines'.

    We will enter the top 20-30 results, analyze HTML and extract the link ID from the Amazon links. Then we will do a count and we will be automatically validating based on dozens of websites which is the best washing machine.
  114. 9.1.2 Concept Then, we will have a list of IDs

    with their URL, which we can scrape directly from Google Play or using their API, and semi-automatically fill our CMS (WordPress, or whatever we have). This allows us to automate content research/curing and focus on delivering real value in what we write. Screenshot is an outcome based on Google Play Store
  115. 9.1.3 Action First of all we will generate the basis

    to create the URL, with our user-agent, as well as the language we are interested in.
  116. 9.1.3 Action Then we are going to generate a maximum

    concurrence so that Google does not ban our IP or skip captchas.
  117. 9.1.3 Action Finally, let's define exactly the flow of the

    crawler. If you need to enter links/websites, and what you need to extract from them.
  118. 9.1.3 Action Finally, we will transform the output into a.json

    file that we can use to upload to our CMS.
  119. 9.1.3 Action And we can even configure it to be

    automatically uploaded to the CMS once the processes are finished.
  120. 9.1.3 Acción What does Jason the Miner do? ➜ Load

    (HTTP, file, json, ....) ➜ Parse (HTML w/ CSS by default) ➜ Transform But this is ok, but we need to do it in bulk for tens or hundreds of cases, we cannot do it one by one.
  121. 9.1.3 Acción Added functionality to make it work in bulk

    ➜ Bulk (imported from a CSV) ➜ Load (HTTP, file, json, ....) ➜ Parse (HTML w/ CSS by default) ➜ Transform Creating a variable that would be the query we inserted in Google.
  122. 9.1.4 CMS Once we have all the data inserted in

    our CMS, we will have to execute another basic scraping process or with an API such as Amazon to get all the data of each product (logo, name, images, description, etc). Once we have everything, the lists will be sorted and we can add the editorial content we want, with very little manual work to do.
  123. 9.1.5 Ideas Examples in which it could be applied: ➜

    Amazon Products ➜ Listings of restaurants that are on TripAdvisor ➜ Hotel Listings ➜ Netflix Movie Listings ➜ Best PS4 Games ➜ Better android apps ➜ Best Chromecast apps ➜ Best books
  124. Scraping to generate a product feed for our websites

  125. 9.2 Starting point Website affiliated to Casa del Libro. We

    need to generate a product feed for each of our product pages.
  126. 9.2 Process We analyze the HTML, looking for patterns We

    generate the script for an element or URL We extend it to affect all data
  127. 9.2.0 What do we want to scrape off? We need

    the following information: ➜ Titles ➜ Author ➜ Editorial ➜ Prices *Only from the category of crime novel
  128. 9.2.0 What do we want to scrape off? Title

  129. 9.2.0 What do we want to scrape off?

  130. 9.2.0 What do we want to scrape off? Author

  131. 9.2.0 What do we want to scrape off?

  132. 9.2.0 What do we want to scrape off? 1. Necesitamos

    los siguientes datos: a. Titulos b. Autor c. Editorial d. Precios Editorial
  133. 9.2.0 What do we want to scrape off?

  134. 9.2.0 What do we want to scrape off? 1. Necesitamos

    los siguientes datos: a. Titulos b. Autor c. Editorial d. Precios Price
  135. 9.2.0 What do we want to scrape off?

  136. 9.2.0 What do we want to scrape off?

  137. 9.2.0 What do we want to scrape off?

  138. 9.2.0 Pagination For each page we will have to iterate

    the same code over and over again. You need to find out how paginated URLs are formed in order to access them: >>>https://www.casadellibro.com/libros/novela-negra/126000000/p + page
  139. 9.2.1 We generated the script to extract the first book

  140. 9.2.1 We Iterate in every container

  141. 9.2.1 Time to finish Now that we have the script

    to scrape all the books on the first page we will generate the final script to affect all the pages.
  142. 9.2.2 Let's do the script We import all the libraries

    we are going to use
  143. 9.2.2 Let's do the script We create the empty lists

    in which to host each of the data.
  144. 9.2.2 Let's do the script We will have a list

    containing the numbers 1 to 120 for the pages
  145. 9.2.2 Let's do the script We create variables to prevent

    the server from banning us due to excessive requests
  146. 9.2.2 Let's do the script

  147. 9.2.2 Let's do the script

  148. 9.2.2 Let's do the script With Pandas we transform the

    lists into a DataFrame that we can work with
  149. 9.2.2 Let's do the script

  150. With Pandas we can also transform it into a csv

    or an excel 9.2.2 Let's do the script
  151. And finally, we can download the file thanks to the

    colab library. 9.2.2 Let's do the script
  152. 10. Bonus

  153. 10.1 Sheet2Site

  154. 10.1 Sheet2Site

  155. 10.1 Sheet2Site https://coinmarketcap.com/es/api/

  156. You can use Google Sheets to import data from APIs

    easily. API's such as Dandelion APIs, which are used for semantic analysis of texts, can be very useful for the day to day running of our SEO. ➜ Entity Extraction ➜ Semantic similarity ➜ Keywords extraction ➜ Sentimental analysis 10.2 Dandelion API
  157. Stack Andreas Niessen Stack Advanced projects 10.3 Stacks for scraping

    + WP + + + + +
  158. ➜ With this little script you can easily export an

    entire SERP into a CSV. ◦ bit.ly/2uZCXuL 10.4 Scraping Google SERP
  159. 10.5 Scraping Google

  160. 10.6 Web Scraping + NLP

  161. 10.7 Scraping Breadcrumbs

  162. 10.8 Scraping Sitemaps https://github.com/NachoSEO/extract_urls_from_sitemap_index

  163. 10.9 Translating content

  164. Python & Other ➜ Chapter 11 – Web Scraping https://automatetheboringstuff.com/chapter11/

    ➜ https://twitter.com/i/moments/949019183181856769 ➜ Scraping ‘People Also Ask’ boxes for SEO and content research https://builtvisible.com/scraping-people-also-ask-boxes-for-seo-a nd-content-research/ ➜ https://stackoverflow.com/questions/3964681/find-all-files-in-a-di rectory-with-extension-txt-in-python ➜ 6 Actionable Web Scraping Hacks for White Hat Marketers https://ahrefs.com/blog/web-scraping-for-marketers/ ➜ https://saucelabs.com/resources/articles/selenium-tips-css-selec tors EXTRA RESOURCES
  165. EXTRA RESOURCES Node.js (Thanks @mawrkus) ➜ Web Scraping With Node.js:

    https://www.smashingmagazine.com/2015/04/web-scraping-with-node js/ ➜ X-ray, The next web scraper. See through the noise: https://github.com/lapwinglabs/x-ray ➜ Simple, lightweight & expressive web scraping with Node.js: https://github.com/eeshi/node-scrapy ➜ Node.js Scraping Libraries: http://blog.webkid.io/nodejs-scraping-libraries/ ➜ https://www.scrapesentry.com/scraping-wiki/web-scraping-legal-or-ille gal/ ➜ http://blog.icreon.us/web-scraping-and-you-a-legal-primer-for-one-of-it s-most-useful-tools/ ➜ Web scraping o rastreo de webs y legalidad: https://www.youtube.com/watch?v=EJzugD0l0Bw
  166. CREDITS ➜ Presentation template by SlidesCarnival ➜ Photographs by Death

    to the Stock Photo (license) ➜ Marc Mignonsin for creating Jason The Miner
  167. Thanks! Any question? Esteve Castells | @estevecastells Newsletter: bit.ly/Seopatia https://estevecastells.com/

    Nacho Mascort | @NachoMascort Scripts: https://github.com/NachoSEO https://seohacks.es