$30 off During Our Annual Pro Sale. View Details »

URLs: In Plain View

URLs: In Plain View

URLs are all around us. But just because we see and use them every day, doesn't mean as engineers we understand fully the features and flaws inherent in one of the most powerful pieces of technology ever taken for granted.

Mahmoud Hashemi

May 27, 2017
Tweet

More Decks by Mahmoud Hashemi

Other Decks in Technology

Transcript

  1. URLs:
    In Plain View
    Mahmoud Hashemi
    May, 2017

    View Slide

  2. I love URLs
    The most advanced technology to reach the masses.
    Ever.

    View Slide

  3. A piece of the web for everyone
    ◉ Kids
    ◉ Santa
    ◉ Dads

    View Slide

  4. Locating the URL
    Frontend
    Backend
    Non-web $ git clone [email protected]:mahmoud/boltons.git

    View Slide

  5. URLs are everywhere
    The Internet is very leaky.

    View Slide


  6. “Some long ass link you are somehow
    suppose to fit into the address bar.”
    fnaffoxy2916
    Defined Feb 17, 2017
    urbandictionary.com

    View Slide

  7. Those Three Words Every
    Browser Understands
    Uniform
    Uniformity means the
    mechanism stays the
    same, even if the
    types of resources
    differ.
    Resource
    A resource can be
    anything, even
    dynamic content,
    representing a
    consistent concept.
    Locator
    Locators are more than
    just identifiers; they
    have directions for
    network lookup.
    URLs are like a treasure
    map every browser can
    read.

    View Slide

  8. The history is long.com
    ◉ 1992 - W3 hypertext names
    ◉ 1994 - RFC 1630, 1736, 1737, 1738
    ◉ 1995 - RFC 1808
    ◉ 1997 - RFC 2141
    ◉ 1998 - RFC 2396, 2368
    ◉ 1999 - RFC 2732
    ◉ 2002 - RFC 3305
    ◉ 2005 - RFC 3986 (the gold standard)
    ◉ 2013 - RFC 6874
    ◉ 2014 - RFC 7320
    ◉ 2017 - WHATWG document (the browser bubble)

    View Slide

  9. >67,000
    Words spent explicitly defining URLs in the RFCs
    #

    View Slide

  10. The overambitious URL
    10 years later, even the W3C had to admit it made some mistakes.

    View Slide

  11. Design intent
    ◉ Simple
    ◉ Transcribable
    ◉ No barrier to entry
    Usable by humans and computers

    View Slide

  12. The knowable URL
    The right amount of URL engineering know-how.

    View Slide

  13. The Scheme
    1
    https://mahmoud:url[email protected]/anatomy/scheme?lang=en&rfc=3986#subtitle-2017
    ◉ Short, case-insensitive
    ◉ Letters, numbers, +, -, .
    ◉ Registered with IANA
    ◉ Determines URL semantics
    http, https, ssh, gopher, rsync, mailto, tel, …
    ~60 in common use

    View Slide

  14. The Userinfo
    2
    https://mahmoud:[email protected]/anatomy/userinfo?lang=en&rfc=3986#subtitle-2017
    ◉ Comes after the scheme
    ◉ ...

    View Slide

  15. The Netloc Slashes!
    1.5
    mailto:[email protected]
    vs.
    http://blog.hatnote.com
    https://mahmoud:url[email protected]/anatomy/netloc?lang=en&rfc=3986#subtitle-2017

    View Slide

  16. The Userinfo
    2
    https://mahmoud:[email protected]/anatomy/userinfo?lang=en&rfc=3986#subtitle-2017
    ◉ username:password@
    ◉ Password is base64-encoded into
    Authentication header in HTTP
    ◉ Our first percent-encoded field!

    View Slide

  17. Percent encoding aka quoting
    %
    %20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20
    ◉ URLs are built to support non-ASCII
    ◉ Byte values are replaced with %XX
    ◉ No standard encoding underneath
    ○ UTF-8 conventional now
    ○ Latin-1 one of many before
    ○ Binary-capable

    View Slide

  18. The Host
    3
    https://mahmoud:u[email protected]/anatomy/host?lang=en&rfc=3986#subtitle-2017

    View Slide

  19. The Host
    3
    https://mahmoud:u[email protected]/anatomy/host?lang=en&rfc=3986#subtitle-2017
    ◉ IPv4, [IPv6], or string resolved with DNS
    ◉ Supports Unicode via Punycode
    u'https://bücher.ch' 'https://xn--bcher-kva.ch'
    'https://xn--ggbla1c4e.xn--ngbc5azd/'

    View Slide

  20. The Port
    4
    https://mahmoud:[email protected]:8080/anatomy/port?lang=en&rfc=3986#subtitle-2017
    ◉ Positive integers only
    ◉ Usually registered with IANA
    ◉ Not emitted if equal to scheme default

    View Slide

  21. The Path
    5
    https://mahmoud:[email protected]:8080/anatomy/path?lang=en&rfc=3986#subtitle-2017
    ◉ Host-local hierarchy
    ◉ Also percent-encoded
    ◉ Absolute vs. relative
    ◉ Almost anything is a path (and a URL)
    ○ mailto:[email protected]
    ○ this|is|not|a|url

    View Slide

  22. The Query String
    6
    https://mahmoud:[email protected]:8080/anatomy/query?lang=en&rfc=3986#subtitle-2017
    ◉ My favorite part
    ◉ Order is preserved
    ◉ Duplicate keys combine
    ◉ An ordered multidict!

    View Slide

  23. The Fragment
    7
    https://mahmoud:[email protected]:8080/anatomy/fragment?lang=en&rfc=3986#subtitle-2017
    ◉ The frontend developers’ favorite part
    ◉ Not sent to the server
    ◉ Based on apartment numbers

    View Slide

  24. A Pythonic Example
    Let’s look at some Python

    View Slide

  25. core.py
    def func(a1, a2, kw1=None):
    pass
    Python is pretty powerful
    caller.py
    from pkg.mod import func
    # powerful
    func(arg1, arg2, kw=’kw1’)
    ?

    View Slide

  26. But it seems URLs can keep up!
    !
    py://func.module.pkg/arg1/arg2?kw1=val1#awesome
    \_/ \_____________/\________/ \______/ \_____/
    | | | | |
    scheme authority path query fragment
    OK, back to reality.

    View Slide

  27. What about urlparse?
    No standard library is perfect...

    View Slide

  28. urlparse design gaps
    ◉ Mostly RFC1738 (1994) and RFC2396 (1998)
    ◉ URLs are “just” tuples of strings
    ◉ Hardcoded schemes (~25)
    ◉ Crufty APIs
    ○ urlparse vs. urlsplit

    View Slide

  29. What do we do?
    pip install hyperlink

    View Slide

  30. pip install hyperlink
    ◉ RFC3986+
    ◉ Full-fledged URL type
    ◉ 58 schemes and counting
    ◉ Smart conventions
    ○ Plus schemes (git+ssh, etc.)
    ○ IPv6 validation
    ○ normalization
    ◉ Python 2.6 - 3.6 tested
    ◉ github.com/mahmoud/hyperlink
    ◉ hyperlink.readthedocs.io

    View Slide

  31. Hyperlink API highlights
    ◉ Immutable URL type
    ◉ URIs for computers, IRIs for humans
    >>> url = URL.from_text('http://example.com/caf%C3%A9/láit')
    >>> print(url.to_iri().to_text())
    http://example.com/café/láit
    >>> print(url.to_uri().to_text())
    http://example.com/caf%C3%A9/au%20l%C3%A1it

    View Slide

  32. Want corner cases?
    Check hyperlink/test

    View Slide

  33. Hyperlink History and Future
    My idea of fun over time:
    ◉ 2013
    ○ Build an IO-agnostic HTTP library and spend
    way too much time reading URL RFCs
    ◉ 2017
    ○ Work with the Twisted project to merge my URL
    (boltons.urlutils) with twisted.python.url
    ◉ Future
    ○ Work on the Hyper project to bring more
    sans-IO web libraries to Python
    ○ https://github.com/python-hyper/

    View Slide

  34. URLs in short
    ◉ Flexible
    ◉ Powerful
    ◉ Becoming even more useful
    URLs are what you make of them.

    View Slide

  35. Any questions?
    ◉ github.com/mahmoud
    ◉ twitter.com/mhashemi
    ◉ sedimental.org
    Thanks!

    View Slide