Slide 10
Slide 10 text
Tools for Cleaning data
• w3lib
• Remove comments, or tags from HTML snippets
• extract base url from HTML snippets
• translate entites on HTML strings
• convert raw HTTP headers to dicts and vice-versa
• construct HTTP auth header
• converting HTML pages to unicode
• sanitize urls (like browsers do)
• extract arguments from urls