Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Berlin Meetup Nov 2015 - (Some of the) things I wish I knew before starting using Python for Data Science

PyData Berlin Meetup Nov 2015 - (Some of the) things I wish I knew before starting using Python for Data Science

Lighting talk during PyData Berlin Nov' 15 Meetup.

Miguel Cabrera

November 19, 2015
Tweet

More Decks by Miguel Cabrera

Other Decks in Technology

Transcript

  1. (Some of the) Things I
    wish I knew before
    starting using Python
    for Data Science
    Miguel Cabrera
    [email protected]om

    View Slide

  2. Background
    C/Java Experience
    Python at the University
    Mostly Numpy/Scikit-Learn
    Not Pythonic

    View Slide

  3. From This

    View Slide

  4. To This

    View Slide

  5. View Slide

  6. Integration Time
    You have to integrate your code into existing code base.
    You have to make your code maintainable and reusable.
    Sometimes your code deal with semi-structure and textual data.

    View Slide

  7. The Things

    View Slide

  8. Autovivification

    View Slide

  9. One way
    Straight out of Wikipedia:
    f
    r
    o
    m c
    o
    l
    l
    e
    c
    t
    i
    o
    n
    s i
    m
    p
    o
    r
    t d
    e
    f
    a
    u
    l
    t
    d
    i
    c
    t
    d
    e
    f t
    r
    e
    e
    (
    )
    :
    r
    e
    t
    u
    r
    n d
    e
    f
    a
    u
    l
    t
    d
    i
    c
    t
    (
    t
    r
    e
    e
    )
    c
    o
    m
    m
    o
    n
    _
    n
    a
    m
    e = t
    r
    e
    e
    (
    )
    c
    o
    m
    m
    o
    n
    _
    n
    a
    m
    e
    [
    '
    M
    a
    m
    m
    a
    l
    i
    a
    '
    ]
    [
    '
    P
    r
    i
    m
    a
    t
    e
    s
    '
    ]
    [
    '
    H
    o
    m
    o
    '
    ]
    [
    '
    H
    . s
    a
    p
    i
    e
    n
    s
    '
    ] = '
    h
    u
    m
    a
    n b
    e
    i
    n
    g
    '
    r
    e
    t
    u
    r
    n c
    o
    m
    m
    o
    n
    _
    n
    a
    m
    e
    d
    e
    f
    a
    u
    l
    t
    d
    i
    c
    t
    (
    <
    f
    u
    n
    c
    t
    i
    o
    n t
    r
    e
    e a
    t 0
    x
    1
    0
    0
    6
    0
    7
    c
    8
    0
    >
    , {
    '
    M
    a
    m
    m
    a
    l
    i
    a
    '
    : d
    e
    f
    a
    u
    l
    t
    d
    i
    c
    t
    (
    <
    f
    u
    n
    c
    t
    i
    o
    n t
    r
    e
    e a
    t 0
    x
    1
    0
    0
    6
    0
    7
    c
    8
    0
    >
    , {
    '
    P
    r
    i
    m
    a
    t
    e
    s
    '
    : d
    e
    f
    a
    u
    l
    t
    d
    i
    c
    t
    (
    <
    f
    u
    n
    c
    t
    i
    o
    n t
    r
    e
    e a
    t 0
    x
    1
    0
    0
    6
    0
    7
    c
    8
    0
    >
    , {
    '
    H
    o
    m
    o
    '
    : d
    e
    f
    a
    u
    l
    t
    d
    i
    c
    t
    (
    <
    f
    u
    n
    c
    t
    i
    o
    n t
    r
    e
    e a
    t 0
    x
    1
    0
    0
    6
    0
    7
    c
    8
    0
    >
    , {
    '
    H
    . s
    a
    p
    i
    e
    n
    s
    '
    : '
    h
    u
    m
    a
    n b
    e
    i
    n
    g
    '
    }
    )
    }
    )
    }
    )
    }
    )

    View Slide

  10. Another Way
    This on Stackoverflow shows an alternative (maybe clearer) way:
    question
    c
    l
    a
    s
    s V
    i
    v
    i
    d
    i
    c
    t
    (
    d
    i
    c
    t
    )
    :
    d
    e
    f _
    _
    m
    i
    s
    s
    i
    n
    g
    _
    _
    (
    s
    e
    l
    f
    , k
    e
    y
    )
    :
    v
    a
    l
    u
    e = s
    e
    l
    f
    [
    k
    e
    y
    ] = t
    y
    p
    e
    (
    s
    e
    l
    f
    )
    (
    )
    r
    e
    t
    u
    r
    n v
    a
    l
    u
    e
    c
    o
    m
    m
    o
    n
    _
    n
    a
    m
    e = V
    i
    v
    i
    d
    i
    c
    t
    (
    )
    c
    o
    m
    m
    o
    n
    _
    n
    a
    m
    e
    [
    '
    M
    a
    m
    m
    a
    l
    i
    a
    '
    ]
    [
    '
    P
    r
    i
    m
    a
    t
    e
    s
    '
    ]
    [
    '
    H
    o
    m
    o
    '
    ]
    [
    '
    H
    . s
    a
    p
    i
    e
    n
    s
    '
    ] = '
    h
    u
    m
    a
    n b
    e
    i
    n
    g
    '
    r
    e
    t
    u
    r
    n c
    o
    m
    m
    o
    n
    _
    n
    a
    m
    e
    Mammalia : (Primates : (Homo : (H. sapiens : human being)))

    View Slide

  11. What for?
    We have this:
    id-1 a 20 10
    id-2 a 50 2
    id-1 b -1 -5
    id-3 c 10 30
    id-2 d -1 -2
    And let's say we would like to end up with something like:
    {
    "
    i
    d
    -
    1
    "
    :
    {
    "
    a
    " :
    {
    "
    s
    c
    o
    r
    e
    _
    1
    " : 2
    0
    , "
    s
    c
    o
    r
    e
    _
    2
    " : 1
    0 }
    }
    {
    "
    b
    " :
    {
    "
    s
    c
    o
    r
    e
    _
    1
    " : -
    1
    , "
    s
    c
    o
    r
    e
    _
    2
    " : -
    5 }
    }
    }

    View Slide

  12. With a ViviDict
    i
    m
    p
    o
    r
    t p
    p
    r
    i
    n
    t
    c
    l
    a
    s
    s V
    i
    v
    i
    d
    i
    c
    t
    (
    d
    i
    c
    t
    )
    :
    d
    e
    f _
    _
    m
    i
    s
    s
    i
    n
    g
    _
    _
    (
    s
    e
    l
    f
    , k
    e
    y
    )
    :
    v
    a
    l
    u
    e = s
    e
    l
    f
    [
    k
    e
    y
    ] = t
    y
    p
    e
    (
    s
    e
    l
    f
    )
    (
    )
    r
    e
    t
    u
    r
    n v
    a
    l
    u
    e
    z
    o
    m
    b
    i
    e = V
    i
    v
    i
    d
    i
    c
    t
    (
    )
    f
    o
    r r
    o
    w i
    n t
    a
    b
    l
    e
    :
    z
    o
    m
    b
    i
    e
    [
    r
    o
    w
    [
    0
    ]
    ]
    [
    r
    o
    w
    [
    1
    ]
    ]
    [
    '
    s
    c
    o
    r
    e
    _
    1
    '
    ] = r
    o
    w
    [
    2
    ]
    z
    o
    m
    b
    i
    e
    [
    r
    o
    w
    [
    0
    ]
    ]
    [
    r
    o
    w
    [
    1
    ]
    ]
    [
    '
    s
    c
    o
    r
    e
    _
    2
    '
    ] = r
    o
    w
    [
    3
    ]
    p
    p
    r
    i
    n
    t
    .
    p
    p
    r
    i
    n
    t
    (
    z
    o
    m
    b
    i
    e
    )
    {
    '
    i
    d
    -
    1
    '
    : {
    '
    a
    '
    : {
    '
    s
    c
    o
    r
    e
    _
    1
    '
    : 2
    0
    , '
    s
    c
    o
    r
    e
    _
    2
    '
    : 1
    0
    }
    ,
    '
    b
    '
    : {
    '
    s
    c
    o
    r
    e
    _
    1
    '
    : -
    1
    , '
    s
    c
    o
    r
    e
    _
    2
    '
    : -
    5
    }
    }
    ,
    '
    i
    d
    -
    2
    '
    : {
    '
    a
    '
    : {
    '
    s
    c
    o
    r
    e
    _
    1
    '
    : 5
    0
    , '
    s
    c
    o
    r
    e
    _
    2
    '
    : 2
    }
    ,
    '
    d
    '
    : {
    '
    s
    c
    o
    r
    e
    _
    1
    '
    : -
    1
    , '
    s
    c
    o
    r
    e
    _
    2
    '
    : -
    2
    }
    }
    ,
    '
    i
    d
    -
    3
    '
    : {
    '
    c
    '
    : {
    '
    s
    c
    o
    r
    e
    _
    1
    '
    : 1
    0
    , '
    s
    c
    o
    r
    e
    _
    2
    '
    : 3
    0
    }
    }
    }

    View Slide

  13. Iterators and Iterables

    View Slide

  14. What?
    source: http://nvie.com/posts/iterators-vs-generators/

    View Slide

  15. Example: A Generator
    g
    e
    n
    e
    r
    a
    t
    o
    r = (
    w
    o
    r
    d + '
    !
    ' f
    o
    r w
    o
    r
    d i
    n '
    h
    i
    t m
    e b
    a
    b
    y o
    n
    e m
    o
    r
    e t
    i
    m
    e
    '
    .
    s
    p
    l
    i
    t
    (
    )
    )
    t
    r
    y
    :
    l
    e
    n
    (
    g
    e
    n
    e
    r
    a
    t
    o
    r
    )
    e
    x
    c
    e
    p
    t T
    y
    p
    e
    E
    r
    r
    o
    r
    :
    p
    r
    i
    n
    t
    (
    "
    G
    e
    n
    e
    r
    a
    t
    o
    r
    s h
    a
    s n
    o l
    e
    n
    g
    t
    h
    !
    "
    )
    f
    o
    r w i
    n g
    e
    n
    e
    r
    a
    t
    o
    r
    :
    p
    r
    i
    n
    t w
    G
    e
    n
    e
    r
    a
    t
    o
    r
    s h
    a
    s n
    o l
    e
    n
    g
    t
    h
    !
    h
    i
    t
    !
    m
    e
    !
    b
    a
    b
    y
    !
    o
    n
    e
    !
    m
    o
    r
    e
    !
    t
    i
    m
    e
    !

    View Slide

  16. What does it have to do with Data Science?
    Data Streaming through Lazy Evaluation
    Excellent discussion:
    http://rare-technologies.com/data-streaming-in-python-generators-iterators-
    iterables/

    View Slide

  17. Something more useful
    c
    l
    a
    s
    s H
    d
    f
    s
    L
    i
    n
    e
    S
    e
    n
    t
    e
    n
    c
    e
    (
    o
    b
    j
    e
    c
    t
    )
    :
    d
    e
    f _
    _
    i
    t
    e
    r
    _
    _
    (
    s
    e
    l
    f
    )
    :
    s
    t
    r
    e
    a
    m = s
    e
    l
    f
    .
    s
    o
    u
    r
    c
    e
    .
    o
    p
    e
    n
    (
    '
    r
    '
    )
    f
    o
    r l
    i
    n
    e i
    n s
    t
    r
    e
    a
    m
    :
    c
    i
    d
    , s = l
    i
    n
    e
    .
    s
    p
    l
    i
    t
    (
    '
    \
    t
    '
    )
    s
    = u
    " "
    .
    j
    o
    i
    n
    (
    c
    o
    d
    e
    c
    s
    .
    d
    e
    c
    o
    d
    e
    (
    w
    o
    r
    d
    ,
    '
    u
    t
    f
    -
    8
    '
    ,
    '
    r
    e
    p
    l
    a
    c
    e
    '
    ) f
    o
    r w
    o
    r
    d i
    n
    s
    .
    s
    p
    l
    i
    t
    (
    )
    )
    s = s
    .
    s
    p
    l
    i
    t
    (
    )
    y
    i
    e
    l
    d s

    View Slide

  18. NamedTuples

    View Slide

  19. Why
    Many Python developers write code around the d
    i
    c
    t class or tuples
    You never know what to expect
    Code becomes hard to read
    From
    http://stackoverflow.com/questions/2970608/what-are-named-tuples-in-python
    p
    t
    1 = (
    1
    .
    0
    , 5
    .
    0
    )
    p
    t
    2 = (
    2
    .
    5
    , 1
    .
    5
    )
    f
    r
    o
    m m
    a
    t
    h i
    m
    p
    o
    r
    t s
    q
    r
    t
    l
    i
    n
    e
    _
    l
    e
    n
    g
    t
    h = s
    q
    r
    t
    (
    (
    p
    t
    1
    [
    0
    ]
    -
    p
    t
    2
    [
    0
    ]
    )
    *
    *
    2 + (
    p
    t
    1
    [
    1
    ]
    -
    p
    t
    2
    [
    1
    ]
    )
    *
    *
    2
    )

    View Slide

  20. Enter NamedTuples
    Named tuples assign meaning to each position in a tuple and
    allow for more readable, self-documenting code. They can be
    used wherever regular tuples are used, and they add the ability
    to access fields by name instead of position index.
    f
    r
    o
    m c
    o
    l
    l
    e
    c
    t
    i
    o
    n
    s i
    m
    p
    o
    r
    t n
    a
    m
    e
    d
    t
    u
    p
    l
    e
    P
    o
    i
    n
    t = n
    a
    m
    e
    d
    t
    u
    p
    l
    e
    (
    '
    P
    o
    i
    n
    t
    '
    , '
    x y
    '
    )
    p
    t
    1 = P
    o
    i
    n
    t
    (
    1
    .
    0
    , 5
    .
    0
    )
    p
    t
    2 = P
    o
    i
    n
    t
    (
    2
    .
    5
    , 1
    .
    5
    )
    f
    r
    o
    m m
    a
    t
    h i
    m
    p
    o
    r
    t s
    q
    r
    t
    l
    i
    n
    e
    _
    l
    e
    n
    g
    t
    h = s
    q
    r
    t
    (
    (
    p
    t
    1
    .
    x
    -
    p
    t
    2
    .
    x
    )
    *
    *
    2 + (
    p
    t
    1
    .
    y
    -
    p
    t
    2
    .
    y
    )
    *
    *
    2
    )

    View Slide

  21. NamedTuples provide cool methods
    Some of them:
    Name Description
    _
    a
    s
    d
    i
    c
    t Return a new OrderedDict which maps field names to their
    values
    _
    m
    a
    k
    e
    (
    i
    t
    e
    r
    a
    b
    l
    e
    ) Class method that makes a new instance from an existing
    sequence or iterable.

    View Slide

  22. You can extend a NamedTuple
    _
    H
    o
    t
    e
    l
    B
    a
    s
    e = n
    a
    m
    e
    d
    t
    u
    p
    l
    e
    (
    '
    H
    o
    t
    e
    l
    D
    e
    s
    c
    r
    i
    p
    t
    o
    r
    '
    ,
    [
    '
    c
    l
    u
    s
    t
    e
    r
    _
    i
    d
    '
    , '
    t
    r
    u
    s
    t
    _
    s
    c
    o
    r
    e
    '
    , '
    r
    e
    v
    i
    e
    w
    s
    _
    c
    o
    u
    n
    t
    '
    , '
    c
    a
    t
    e
    g
    o
    r
    y
    _
    s
    c
    o
    r
    e
    s
    '
    , '
    i
    n
    t
    e
    n
    s
    i
    t
    y
    _
    f
    a
    c
    t
    o
    r
    s
    '
    ]
    ,
    )
    c
    l
    a
    s
    s H
    o
    t
    e
    l
    D
    e
    s
    c
    r
    i
    p
    t
    o
    r
    (
    _
    H
    o
    t
    e
    l
    B
    a
    s
    e
    )
    :
    d
    e
    f c
    o
    m
    p
    u
    t
    e
    _
    p
    r
    i
    o
    r
    (
    s
    e
    l
    f
    )
    :
    i
    f n
    o
    t s
    e
    l
    f
    .
    t
    r
    u
    s
    t
    _
    s
    c
    o
    r
    e o
    r n
    o
    t s
    e
    l
    f
    .
    r
    e
    v
    i
    e
    w
    s
    _
    c
    o
    u
    n
    t
    :
    r
    a
    i
    s
    e N
    o
    t
    E
    n
    o
    u
    g
    h
    D
    a
    t
    a
    F
    o
    r
    R
    a
    n
    k
    i
    n
    g
    (
    "
    C
    a
    n
    n
    o
    t c
    o
    m
    p
    u
    t
    e p
    r
    i
    o
    r w
    i
    t
    h
    o
    u
    t t
    y
    s
    c
    o
    r
    e a
    n
    d r
    e
    v
    i
    e
    w
    s
    "
    )
    r
    e
    t
    u
    r
    n _
    c
    o
    m
    p
    u
    t
    e
    _
    p
    r
    i
    o
    r
    (
    s
    e
    l
    f
    .
    t
    r
    u
    s
    t
    _
    s
    c
    o
    r
    e
    , s
    e
    l
    f
    .
    r
    e
    v
    i
    e
    w
    s
    _
    c
    o
    u
    n
    t
    )
    (
    .
    .
    .
    )

    View Slide

  23. Conclusion
    (Aspiring) Data Scientists / Engineers should learn:
    Standard library (i.e. the c
    o
    l
    l
    e
    c
    t
    i
    o
    n
    s module in particular)
    Iterables and Iterators
    Object oriented practices
    Documenting your code
    How to package
    Exposing your models (i.e. via an API)

    View Slide

  24. Questions?

    View Slide

  25. Created by Miguel Cabrera.

    View Slide