Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyData Berlin Meetup Nov 2015 - (Some of the) things I wish I knew before starting using Python for Data Science

PyData Berlin Meetup Nov 2015 - (Some of the) things I wish I knew before starting using Python for Data Science

Lighting talk during PyData Berlin Nov' 15 Meetup.

Miguel Cabrera

November 19, 2015
Tweet

More Decks by Miguel Cabrera

Other Decks in Technology

Transcript

  1. (Some of the) Things I
    wish I knew before
    starting using Python
    for Data Science
    Miguel Cabrera
    [email protected]

    View full-size slide

  2. Background
    C/Java Experience
    Python at the University
    Mostly Numpy/Scikit-Learn
    Not Pythonic

    View full-size slide

  3. Integration Time
    You have to integrate your code into existing code base.
    You have to make your code maintainable and reusable.
    Sometimes your code deal with semi-structure and textual data.

    View full-size slide

  4. Autovivification

    View full-size slide

  5. One way
    Straight out of Wikipedia:
    f
    r
    o
    m c
    o
    l
    l
    e
    c
    t
    i
    o
    n
    s i
    m
    p
    o
    r
    t d
    e
    f
    a
    u
    l
    t
    d
    i
    c
    t
    d
    e
    f t
    r
    e
    e
    (
    )
    :
    r
    e
    t
    u
    r
    n d
    e
    f
    a
    u
    l
    t
    d
    i
    c
    t
    (
    t
    r
    e
    e
    )
    c
    o
    m
    m
    o
    n
    _
    n
    a
    m
    e = t
    r
    e
    e
    (
    )
    c
    o
    m
    m
    o
    n
    _
    n
    a
    m
    e
    [
    '
    M
    a
    m
    m
    a
    l
    i
    a
    '
    ]
    [
    '
    P
    r
    i
    m
    a
    t
    e
    s
    '
    ]
    [
    '
    H
    o
    m
    o
    '
    ]
    [
    '
    H
    . s
    a
    p
    i
    e
    n
    s
    '
    ] = '
    h
    u
    m
    a
    n b
    e
    i
    n
    g
    '
    r
    e
    t
    u
    r
    n c
    o
    m
    m
    o
    n
    _
    n
    a
    m
    e
    d
    e
    f
    a
    u
    l
    t
    d
    i
    c
    t
    (
    <
    f
    u
    n
    c
    t
    i
    o
    n t
    r
    e
    e a
    t 0
    x
    1
    0
    0
    6
    0
    7
    c
    8
    0
    >
    , {
    '
    M
    a
    m
    m
    a
    l
    i
    a
    '
    : d
    e
    f
    a
    u
    l
    t
    d
    i
    c
    t
    (
    <
    f
    u
    n
    c
    t
    i
    o
    n t
    r
    e
    e a
    t 0
    x
    1
    0
    0
    6
    0
    7
    c
    8
    0
    >
    , {
    '
    P
    r
    i
    m
    a
    t
    e
    s
    '
    : d
    e
    f
    a
    u
    l
    t
    d
    i
    c
    t
    (
    <
    f
    u
    n
    c
    t
    i
    o
    n t
    r
    e
    e a
    t 0
    x
    1
    0
    0
    6
    0
    7
    c
    8
    0
    >
    , {
    '
    H
    o
    m
    o
    '
    : d
    e
    f
    a
    u
    l
    t
    d
    i
    c
    t
    (
    <
    f
    u
    n
    c
    t
    i
    o
    n t
    r
    e
    e a
    t 0
    x
    1
    0
    0
    6
    0
    7
    c
    8
    0
    >
    , {
    '
    H
    . s
    a
    p
    i
    e
    n
    s
    '
    : '
    h
    u
    m
    a
    n b
    e
    i
    n
    g
    '
    }
    )
    }
    )
    }
    )
    }
    )

    View full-size slide

  6. Another Way
    This on Stackoverflow shows an alternative (maybe clearer) way:
    question
    c
    l
    a
    s
    s V
    i
    v
    i
    d
    i
    c
    t
    (
    d
    i
    c
    t
    )
    :
    d
    e
    f _
    _
    m
    i
    s
    s
    i
    n
    g
    _
    _
    (
    s
    e
    l
    f
    , k
    e
    y
    )
    :
    v
    a
    l
    u
    e = s
    e
    l
    f
    [
    k
    e
    y
    ] = t
    y
    p
    e
    (
    s
    e
    l
    f
    )
    (
    )
    r
    e
    t
    u
    r
    n v
    a
    l
    u
    e
    c
    o
    m
    m
    o
    n
    _
    n
    a
    m
    e = V
    i
    v
    i
    d
    i
    c
    t
    (
    )
    c
    o
    m
    m
    o
    n
    _
    n
    a
    m
    e
    [
    '
    M
    a
    m
    m
    a
    l
    i
    a
    '
    ]
    [
    '
    P
    r
    i
    m
    a
    t
    e
    s
    '
    ]
    [
    '
    H
    o
    m
    o
    '
    ]
    [
    '
    H
    . s
    a
    p
    i
    e
    n
    s
    '
    ] = '
    h
    u
    m
    a
    n b
    e
    i
    n
    g
    '
    r
    e
    t
    u
    r
    n c
    o
    m
    m
    o
    n
    _
    n
    a
    m
    e
    Mammalia : (Primates : (Homo : (H. sapiens : human being)))

    View full-size slide

  7. What for?
    We have this:
    id-1 a 20 10
    id-2 a 50 2
    id-1 b -1 -5
    id-3 c 10 30
    id-2 d -1 -2
    And let's say we would like to end up with something like:
    {
    "
    i
    d
    -
    1
    "
    :
    {
    "
    a
    " :
    {
    "
    s
    c
    o
    r
    e
    _
    1
    " : 2
    0
    , "
    s
    c
    o
    r
    e
    _
    2
    " : 1
    0 }
    }
    {
    "
    b
    " :
    {
    "
    s
    c
    o
    r
    e
    _
    1
    " : -
    1
    , "
    s
    c
    o
    r
    e
    _
    2
    " : -
    5 }
    }
    }

    View full-size slide

  8. With a ViviDict
    i
    m
    p
    o
    r
    t p
    p
    r
    i
    n
    t
    c
    l
    a
    s
    s V
    i
    v
    i
    d
    i
    c
    t
    (
    d
    i
    c
    t
    )
    :
    d
    e
    f _
    _
    m
    i
    s
    s
    i
    n
    g
    _
    _
    (
    s
    e
    l
    f
    , k
    e
    y
    )
    :
    v
    a
    l
    u
    e = s
    e
    l
    f
    [
    k
    e
    y
    ] = t
    y
    p
    e
    (
    s
    e
    l
    f
    )
    (
    )
    r
    e
    t
    u
    r
    n v
    a
    l
    u
    e
    z
    o
    m
    b
    i
    e = V
    i
    v
    i
    d
    i
    c
    t
    (
    )
    f
    o
    r r
    o
    w i
    n t
    a
    b
    l
    e
    :
    z
    o
    m
    b
    i
    e
    [
    r
    o
    w
    [
    0
    ]
    ]
    [
    r
    o
    w
    [
    1
    ]
    ]
    [
    '
    s
    c
    o
    r
    e
    _
    1
    '
    ] = r
    o
    w
    [
    2
    ]
    z
    o
    m
    b
    i
    e
    [
    r
    o
    w
    [
    0
    ]
    ]
    [
    r
    o
    w
    [
    1
    ]
    ]
    [
    '
    s
    c
    o
    r
    e
    _
    2
    '
    ] = r
    o
    w
    [
    3
    ]
    p
    p
    r
    i
    n
    t
    .
    p
    p
    r
    i
    n
    t
    (
    z
    o
    m
    b
    i
    e
    )
    {
    '
    i
    d
    -
    1
    '
    : {
    '
    a
    '
    : {
    '
    s
    c
    o
    r
    e
    _
    1
    '
    : 2
    0
    , '
    s
    c
    o
    r
    e
    _
    2
    '
    : 1
    0
    }
    ,
    '
    b
    '
    : {
    '
    s
    c
    o
    r
    e
    _
    1
    '
    : -
    1
    , '
    s
    c
    o
    r
    e
    _
    2
    '
    : -
    5
    }
    }
    ,
    '
    i
    d
    -
    2
    '
    : {
    '
    a
    '
    : {
    '
    s
    c
    o
    r
    e
    _
    1
    '
    : 5
    0
    , '
    s
    c
    o
    r
    e
    _
    2
    '
    : 2
    }
    ,
    '
    d
    '
    : {
    '
    s
    c
    o
    r
    e
    _
    1
    '
    : -
    1
    , '
    s
    c
    o
    r
    e
    _
    2
    '
    : -
    2
    }
    }
    ,
    '
    i
    d
    -
    3
    '
    : {
    '
    c
    '
    : {
    '
    s
    c
    o
    r
    e
    _
    1
    '
    : 1
    0
    , '
    s
    c
    o
    r
    e
    _
    2
    '
    : 3
    0
    }
    }
    }

    View full-size slide

  9. Iterators and Iterables

    View full-size slide

  10. What?
    source: http://nvie.com/posts/iterators-vs-generators/

    View full-size slide

  11. Example: A Generator
    g
    e
    n
    e
    r
    a
    t
    o
    r = (
    w
    o
    r
    d + '
    !
    ' f
    o
    r w
    o
    r
    d i
    n '
    h
    i
    t m
    e b
    a
    b
    y o
    n
    e m
    o
    r
    e t
    i
    m
    e
    '
    .
    s
    p
    l
    i
    t
    (
    )
    )
    t
    r
    y
    :
    l
    e
    n
    (
    g
    e
    n
    e
    r
    a
    t
    o
    r
    )
    e
    x
    c
    e
    p
    t T
    y
    p
    e
    E
    r
    r
    o
    r
    :
    p
    r
    i
    n
    t
    (
    "
    G
    e
    n
    e
    r
    a
    t
    o
    r
    s h
    a
    s n
    o l
    e
    n
    g
    t
    h
    !
    "
    )
    f
    o
    r w i
    n g
    e
    n
    e
    r
    a
    t
    o
    r
    :
    p
    r
    i
    n
    t w
    G
    e
    n
    e
    r
    a
    t
    o
    r
    s h
    a
    s n
    o l
    e
    n
    g
    t
    h
    !
    h
    i
    t
    !
    m
    e
    !
    b
    a
    b
    y
    !
    o
    n
    e
    !
    m
    o
    r
    e
    !
    t
    i
    m
    e
    !

    View full-size slide

  12. What does it have to do with Data Science?
    Data Streaming through Lazy Evaluation
    Excellent discussion:
    http://rare-technologies.com/data-streaming-in-python-generators-iterators-
    iterables/

    View full-size slide

  13. Something more useful
    c
    l
    a
    s
    s H
    d
    f
    s
    L
    i
    n
    e
    S
    e
    n
    t
    e
    n
    c
    e
    (
    o
    b
    j
    e
    c
    t
    )
    :
    d
    e
    f _
    _
    i
    t
    e
    r
    _
    _
    (
    s
    e
    l
    f
    )
    :
    s
    t
    r
    e
    a
    m = s
    e
    l
    f
    .
    s
    o
    u
    r
    c
    e
    .
    o
    p
    e
    n
    (
    '
    r
    '
    )
    f
    o
    r l
    i
    n
    e i
    n s
    t
    r
    e
    a
    m
    :
    c
    i
    d
    , s = l
    i
    n
    e
    .
    s
    p
    l
    i
    t
    (
    '
    \
    t
    '
    )
    s
    = u
    " "
    .
    j
    o
    i
    n
    (
    c
    o
    d
    e
    c
    s
    .
    d
    e
    c
    o
    d
    e
    (
    w
    o
    r
    d
    ,
    '
    u
    t
    f
    -
    8
    '
    ,
    '
    r
    e
    p
    l
    a
    c
    e
    '
    ) f
    o
    r w
    o
    r
    d i
    n
    s
    .
    s
    p
    l
    i
    t
    (
    )
    )
    s = s
    .
    s
    p
    l
    i
    t
    (
    )
    y
    i
    e
    l
    d s

    View full-size slide

  14. Why
    Many Python developers write code around the d
    i
    c
    t class or tuples
    You never know what to expect
    Code becomes hard to read
    From
    http://stackoverflow.com/questions/2970608/what-are-named-tuples-in-python
    p
    t
    1 = (
    1
    .
    0
    , 5
    .
    0
    )
    p
    t
    2 = (
    2
    .
    5
    , 1
    .
    5
    )
    f
    r
    o
    m m
    a
    t
    h i
    m
    p
    o
    r
    t s
    q
    r
    t
    l
    i
    n
    e
    _
    l
    e
    n
    g
    t
    h = s
    q
    r
    t
    (
    (
    p
    t
    1
    [
    0
    ]
    -
    p
    t
    2
    [
    0
    ]
    )
    *
    *
    2 + (
    p
    t
    1
    [
    1
    ]
    -
    p
    t
    2
    [
    1
    ]
    )
    *
    *
    2
    )

    View full-size slide

  15. Enter NamedTuples
    Named tuples assign meaning to each position in a tuple and
    allow for more readable, self-documenting code. They can be
    used wherever regular tuples are used, and they add the ability
    to access fields by name instead of position index.
    f
    r
    o
    m c
    o
    l
    l
    e
    c
    t
    i
    o
    n
    s i
    m
    p
    o
    r
    t n
    a
    m
    e
    d
    t
    u
    p
    l
    e
    P
    o
    i
    n
    t = n
    a
    m
    e
    d
    t
    u
    p
    l
    e
    (
    '
    P
    o
    i
    n
    t
    '
    , '
    x y
    '
    )
    p
    t
    1 = P
    o
    i
    n
    t
    (
    1
    .
    0
    , 5
    .
    0
    )
    p
    t
    2 = P
    o
    i
    n
    t
    (
    2
    .
    5
    , 1
    .
    5
    )
    f
    r
    o
    m m
    a
    t
    h i
    m
    p
    o
    r
    t s
    q
    r
    t
    l
    i
    n
    e
    _
    l
    e
    n
    g
    t
    h = s
    q
    r
    t
    (
    (
    p
    t
    1
    .
    x
    -
    p
    t
    2
    .
    x
    )
    *
    *
    2 + (
    p
    t
    1
    .
    y
    -
    p
    t
    2
    .
    y
    )
    *
    *
    2
    )

    View full-size slide

  16. NamedTuples provide cool methods
    Some of them:
    Name Description
    _
    a
    s
    d
    i
    c
    t Return a new OrderedDict which maps field names to their
    values
    _
    m
    a
    k
    e
    (
    i
    t
    e
    r
    a
    b
    l
    e
    ) Class method that makes a new instance from an existing
    sequence or iterable.

    View full-size slide

  17. You can extend a NamedTuple
    _
    H
    o
    t
    e
    l
    B
    a
    s
    e = n
    a
    m
    e
    d
    t
    u
    p
    l
    e
    (
    '
    H
    o
    t
    e
    l
    D
    e
    s
    c
    r
    i
    p
    t
    o
    r
    '
    ,
    [
    '
    c
    l
    u
    s
    t
    e
    r
    _
    i
    d
    '
    , '
    t
    r
    u
    s
    t
    _
    s
    c
    o
    r
    e
    '
    , '
    r
    e
    v
    i
    e
    w
    s
    _
    c
    o
    u
    n
    t
    '
    , '
    c
    a
    t
    e
    g
    o
    r
    y
    _
    s
    c
    o
    r
    e
    s
    '
    , '
    i
    n
    t
    e
    n
    s
    i
    t
    y
    _
    f
    a
    c
    t
    o
    r
    s
    '
    ]
    ,
    )
    c
    l
    a
    s
    s H
    o
    t
    e
    l
    D
    e
    s
    c
    r
    i
    p
    t
    o
    r
    (
    _
    H
    o
    t
    e
    l
    B
    a
    s
    e
    )
    :
    d
    e
    f c
    o
    m
    p
    u
    t
    e
    _
    p
    r
    i
    o
    r
    (
    s
    e
    l
    f
    )
    :
    i
    f n
    o
    t s
    e
    l
    f
    .
    t
    r
    u
    s
    t
    _
    s
    c
    o
    r
    e o
    r n
    o
    t s
    e
    l
    f
    .
    r
    e
    v
    i
    e
    w
    s
    _
    c
    o
    u
    n
    t
    :
    r
    a
    i
    s
    e N
    o
    t
    E
    n
    o
    u
    g
    h
    D
    a
    t
    a
    F
    o
    r
    R
    a
    n
    k
    i
    n
    g
    (
    "
    C
    a
    n
    n
    o
    t c
    o
    m
    p
    u
    t
    e p
    r
    i
    o
    r w
    i
    t
    h
    o
    u
    t t
    y
    s
    c
    o
    r
    e a
    n
    d r
    e
    v
    i
    e
    w
    s
    "
    )
    r
    e
    t
    u
    r
    n _
    c
    o
    m
    p
    u
    t
    e
    _
    p
    r
    i
    o
    r
    (
    s
    e
    l
    f
    .
    t
    r
    u
    s
    t
    _
    s
    c
    o
    r
    e
    , s
    e
    l
    f
    .
    r
    e
    v
    i
    e
    w
    s
    _
    c
    o
    u
    n
    t
    )
    (
    .
    .
    .
    )

    View full-size slide

  18. Conclusion
    (Aspiring) Data Scientists / Engineers should learn:
    Standard library (i.e. the c
    o
    l
    l
    e
    c
    t
    i
    o
    n
    s module in particular)
    Iterables and Iterators
    Object oriented practices
    Documenting your code
    How to package
    Exposing your models (i.e. via an API)

    View full-size slide

  19. Created by Miguel Cabrera.

    View full-size slide