Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Marshall Simmonds - Log File Analysis

Marshall Simmonds - Log File Analysis

At Define Media Group, Marshall and his team have been spearheading deep reviews of some of the biggest publishers’ raw data files. It’s intense work but what they’ve learned is insightful and applicable to all industries. Marshall’s going to be sharing those insights with you.

Distilled

May 06, 2014
Tweet

More Decks by Distilled

Other Decks in Technology

Transcript

  1. mdsimmonds www.DefineMG.com About.com Project 2000 • Big 12 Search Engines

    • Most Active • Pages Requested • Pages NOT Requested
  2. mdsimmonds www.DefineMG.com About.com Project 2000 • Big 12 Search Engines

    • Most Active • Pages Requested • Pages NOT Requested • Too L A R G E
  3. mdsimmonds www.DefineMG.com Server Logs Answer • IP – Who •

    User Agent – Who • Method – What • Response – What • URI_Request – Where • Date – When
  4. mdsimmonds www.DefineMG.com Case Study #1 • Bridal Site • Requested:

    • Started with one day- be very very careful – Status Codes – Number of URLs crawled per directories – Type of URL “/[whatever]” • Be kind and ask for just Bing • Nawww – Ask for anything Google • .CSV is fine • ~100K Pages
  5. mdsimmonds www.DefineMG.com Post Mortem • Definitely compare against crawl data

    • Not perfect for auditing and is very rabbit-holey
  6. mdsimmonds www.DefineMG.com Post Mortem • Definitely compare against crawl data

    • Not perfect for auditing and is very rabbit-holey • Googlebot spent a lot of time in /Search
  7. mdsimmonds www.DefineMG.com Post Mortem • Definitely compare against crawl data

    • Not perfect for auditing and is very rabbit-holey • Googlebot spent a lot of time in /Search • IP addresses of Spiders are interesting but won’t reveal much so avoid that rabbit hole • The actionable item is, that isn’t actionable
  8. mdsimmonds www.DefineMG.com Post Mortem • Definitely compare against crawl data

    • Not perfect for auditing and is very rabbit-holey • Googlebot spent a lot of time in /Search • IP addresses of Spiders are interesting but won’t reveal much so avoid that rabbit hole • The actionable item is, that isn’t actionable • Crawl budget?
  9. mdsimmonds www.DefineMG.com What Learned? What Googlebot sees What Browsers see

    = different Crawl ≠ Popularity Be careful about the data pull size
  10. mdsimmonds www.DefineMG.com What Learned? What Googlebot sees What Browsers see

    = different Crawl ≠ Popularity Be careful about the data pull size Previously unknown directory getting hit
  11. mdsimmonds www.DefineMG.com What Learned? What Googlebot sees What Browsers see

    = different Crawl ≠ Popularity Be careful about the data pull size Previously unknown directory getting hit Saw that Robots.txt got hit 1,394 times
  12. mdsimmonds www.DefineMG.com What Learned? What Googlebot sees What Browsers see

    = different Crawl ≠ Popularity Be careful about the data pull size Previously unknown directory getting hit Saw that Robots.txt got hit 1,394 times Small data sets yield small trends
  13. mdsimmonds www.DefineMG.com What Learned? What Googlebot sees What Browsers see

    = different Crawl ≠ Popularity Be careful about the data pull size Previously unknown directory getting hit Saw that Robots.txt got hit 1,394 times Small data sets yield small trends Googs like /search
  14. mdsimmonds www.DefineMG.com What Learned? What Googlebot sees What Browsers see

    = different Crawl ≠ Popularity Be careful about the data pull size Previously unknown directory getting hit Saw that Robots.txt got hit 1,394 times Small data sets yield small trends Googs like /search Identified that nothing tremendously wrong
  15. mdsimmonds www.DefineMG.com What Learned? What Googlebot sees What Browsers see

    = different Crawl ≠ Popularity Be careful about the data pull size Previously unknown directory getting hit Saw that Robots.txt got hit 1,394 times Small data sets yield small trends Googs like /search Identified that nothing tremendously wrong Saw cool stuff, increased value, learned
  16. mdsimmonds www.DefineMG.com Case Study #2 • Enterprise level site covering

    many categories • Month’s worth of data • Requested: – Everything from Googlebot (and some other stuff) • ~1,000,000 pages
  17. mdsimmonds www.DefineMG.com • Google accounts for more than 96% of

    month’s spider visits. • Bing is a distant second • Few visits from Baidu, Yandex, Yahoo Slurp • Some User Agents get tagged both as regular browsers (Chrome, Firefox, etc.) and as spiders
  18. mdsimmonds www.DefineMG.com Hourly Trend: Heat map • Lots of consistency,

    some sharp drops and recoveries • No obvious hourly time trend: most hours show 40‐70,000 hits
  19. mdsimmonds www.DefineMG.com Spider Visits by Asset Type • Most spider

    visits did not favor an asset • Homepage visits were rare!
  20. mdsimmonds www.DefineMG.com Spider Variation • Spiders show some distinct behavior

    • Bing IPs appear to index less than Google • There are two distinct classes of Google IP addresses. • It’s possible spiders function significantly differently across companies
  21. mdsimmonds www.DefineMG.com Takeaways • Crawl Barriers not Budget • Google

    likes /search • IP Addresses aren’t worth the time for small data sets • There are preferred crawl times, section and pages • Crawl popularity isn’t traffic popularity • Home page activity should be low • GWT will only show you a fraction of activity • Excel is nice but get some tools