$30 off During Our Annual Pro Sale. View Details »

Python and R: Together at Last

Python and R: Together at Last

Both Python and R boast large data science communities. Each have developed a fantastic collection of packages from reading/writing data to plotting and visualization. Unfortunately, some tools are only available in one language or the other, but not both. Python and R provide relatively simple mechanisms for interacting with C, C++, and Fortran. There are many tools that take advantage of this interoperability. While not a simple matter, developing data science tools in these low level languages and providing Python and R wrappers allows code reuse between languages, speed benefits notwithstanding. In this talk we will discuss strategies and lessons learned from porting existing packages to Python from R and writing cross language tools from scratch.

Bill Lattner

July 13, 2016
Tweet

More Decks by Bill Lattner

Other Decks in Programming

Transcript

  1. Building a Data-Driven WorldTM
    Python and R Together at Last
    Writing Cross-Language Tools

    View Slide

  2. Civis Analytics 2
    vs

    View Slide

  3. Civis Analytics 3
    vs
    No!

    View Slide

  4. 4
    Prior Knowledge
    Meet users where
    they are
    R is popular in some fields, Python in
    others. Diverse teams are often polyglot.
    Important packages are often available in
    only one language. NLTK in Python,
    glmnet in R. This means a data science
    workflow often needs to use multiple
    languages.
    Different languages optimize for different
    things. Python is a general purpose
    language, R is optimized for statistics/
    manipulation of tabular data, Go is a
    great fit for network services.
    Availability of Key Packages
    Tradeoffs
    Civis Analytics

    View Slide

  5. Civis Analytics
    Some tools are already cross-language
    5

    View Slide

  6. How?

    View Slide

  7. Civis Analytics
    Native/Compiled Extensions (C/C++)
    Two Options
    7
    RPC over TCP/HTTP or IPC
    Pros
    • fast!
    • many languages speak C
    Cons
    • takes more code
    • difficult
    Examples
    • Stan
    • XGBoost
    Pros
    • every language speaks TCP/
    HTTP
    • easy to “wire up” host language
    Cons
    • cost of communication
    Examples
    • Spark
    • H2o

    View Slide

  8. Civis Analytics
    Pros
    • fast!
    • many languages speak C
    Cons
    • takes more code
    • difficult
    Examples
    • Stan
    • XGBoost
    Native/Compiled Extensions (C/C++)
    Two Options
    8
    RPC over TCP/HTTP or IPC
    Pros
    • every language speaks TCP/
    HTTP
    • easy to “wire up” host language
    Cons
    • cost of communication
    Examples
    • Spark
    • H2o
    Our focus for today.

    View Slide

  9. Civis Analytics
    • Python and R “speak” C
    • Fast!
    • Portable (mostly)
    • Simple
    Why C
    9

    View Slide

  10. Civis Analytics 10
    C++: The Good Parts

    View Slide

  11. Civis Analytics
    • tooling has come a long way
    • various “sanitizers”
    • address/memory sanitizer
    • undefined behavior sanitizer
    • leak sanitizer
    • thread sanitizer
    • clang gives much better error messages
    Modern C
    11

    View Slide

  12. Civis Analytics
    Alternatives: The Hourglass Interface
    12
    Credit: Hourglass Interfaces for C++ APIs, Stefanus Du Toit
    C99
    Python R
    Julia Ruby
    Go
    C
    Rust
    C++

    View Slide

  13. Civis Analytics
    Alternatives: The Hourglass Interface
    13
    Credit: Hourglass Interfaces for C++ APIs, Stefanus Du Toit
    C99
    Python R
    Julia Ruby
    Go
    C
    Rust
    C++
    host language

    View Slide

  14. Civis Analytics
    Alternatives: The Hourglass Interface
    14
    Credit: Hourglass Interfaces for C++ APIs, Stefanus Du Toit
    C99
    Python R
    Julia Ruby
    Go
    C
    Rust
    C++
    public api

    View Slide

  15. Civis Analytics
    Alternatives: The Hourglass Interface
    15
    Credit: Hourglass Interfaces for C++ APIs, Stefanus Du Toit
    C99
    Python R
    Julia Ruby
    Go
    C
    Rust
    C++
    implementation language

    View Slide

  16. Example

    View Slide

  17. Civis Analytics
    The Mighty Summation Function
    17
    Note: It’s best to start development in a language like python.
    1 def tally(s):
    2 total = 0
    3 for elm in s:
    4 total += elm
    5 return total

    View Slide

  18. Civis Analytics
    Smoke Test
    18
    In [1]: tally([1, 2, 3])
    Out[1]: 6

    View Slide

  19. Civis Analytics
    1 #include
    2
    3 double tally(double *s, size_t n) {
    4 double total = 0;
    5 for (size_t i = 0; i < n; i++) {
    6 total += s[i];
    7 }
    8 return total;
    9 }
    C Implementation
    19

    View Slide

  20. Civis Analytics
    1 #include
    2
    3 double tally(double *s, size_t n) {
    4 double total = 0;
    5 for (size_t i = 0; i < n; i++) {
    6 total += s[i];
    7 }
    8 return total;
    9 }
    C Implementation
    20
    size_t n
    need to pass the length

    View Slide

  21. Civis Analytics
    •Cython
    •CFFI
    •ctypes
    •C (via the Python C API)
    C/C++ and Python
    21

    View Slide

  22. Civis Analytics 22
    The Python C API
    1 #include
    2 #include "Python.h"
    3 #include "tally.h"
    4
    5 static PyObject *tally_(PyObject *self, PyObject *args) {
    6 // decode/cast the args
    7 // call our C function tally
    8 // build the result
    9 }
    10
    11 // module method table
    12 static PyMethodDef MethodTable[] = {
    13 // ...
    14 };
    15
    16 // module def
    17 static struct PyModuleDef tally_module = {
    18 // ...
    19 };
    20
    21 // module init
    22 PyMODINIT_FUNC PyInit_tally_py(void) {
    23 return PyModule_Create(&tally_module);
    24 }

    View Slide

  23. Civis Analytics 23
    The Python C API: Buffer API
    1 static PyObject *tally_(PyObject *self, PyObject *args) {
    2 PyObject *buf;
    3 if (!PyArg_ParseTuple(args, "O", &buf)) {
    4 return NULL;
    5 }
    6
    7 Py_buffer view;
    8 int buf_flags = PyBUF_ANY_CONTIGUOUS | PyBUF_FORMAT;
    9 if (PyObject_GetBuffer(buf, &view, buf_flags) == -1) {
    10 return NULL;
    11 }
    12
    13 if (strcmp(view.format,"d") != 0) {
    14 PyErr_SetString(PyExc_TypeError, "we only take floats :(");
    15 PyBuffer_Release(&view);
    16 return NULL;
    17 }
    18
    19 double result = tally(view.buf, view.shape[0]);
    20 PyBuffer_Release(&view);
    21 return Py_BuildValue("d", result);
    22 }

    View Slide

  24. Civis Analytics 24
    The Python C API: Buffer API
    1 static PyObject *tally_(PyObject *self, PyObject *args) {
    2 PyObject *buf;
    3 if (!PyArg_ParseTuple(args, "O", &buf)) {
    4 return NULL;
    5 }
    6
    7 Py_buffer view;
    8 int buf_flags = PyBUF_ANY_CONTIGUOUS | PyBUF_FORMAT;
    9 if (PyObject_GetBuffer(buf, &view, buf_flags) == -1) {
    10 return NULL;
    11 }
    12
    13 if (strcmp(view.format,"d") != 0) {
    14 PyErr_SetString(PyExc_TypeError, "we only take floats :(");
    15 PyBuffer_Release(&view);
    16 return NULL;
    17 }
    18
    19 double result = tally(view.buf, view.shape[0]);
    20 PyBuffer_Release(&view);
    21 return Py_BuildValue("d", result);
    22 }
    2 PyObject *buf;
    3 if (!PyArg_ParseTuple(args, "O", &buf)) {
    4 return NULL;
    5 }

    View Slide

  25. Civis Analytics 25
    The Python C API: Buffer API
    1 static PyObject *tally_(PyObject *self, PyObject *args) {
    2 PyObject *buf;
    3 if (!PyArg_ParseTuple(args, "O", &buf)) {
    4 return NULL;
    5 }
    6
    7 Py_buffer view;
    8 int buf_flags = PyBUF_ANY_CONTIGUOUS | PyBUF_FORMAT;
    9 if (PyObject_GetBuffer(buf, &view, buf_flags) == -1) {
    10 return NULL;
    11 }
    12
    13 if (strcmp(view.format,"d") != 0) {
    14 PyErr_SetString(PyExc_TypeError, "we only take floats :(");
    15 PyBuffer_Release(&view);
    16 return NULL;
    17 }
    18
    19 double result = tally(view.buf, view.shape[0]);
    20 PyBuffer_Release(&view);
    21 return Py_BuildValue("d", result);
    22 }
    7 Py_buffer view;
    8 int buf_flags = PyBUF_ANY_CONTIGUOUS | PyBUF_FORMAT;
    9 if (PyObject_GetBuffer(buf, &view, buf_flags) == -1) {
    10 return NULL;
    11 }

    View Slide

  26. Civis Analytics 26
    The Python C API: Buffer API
    1 static PyObject *tally_(PyObject *self, PyObject *args) {
    2 PyObject *buf;
    3 if (!PyArg_ParseTuple(args, "O", &buf)) {
    4 return NULL;
    5 }
    6
    7 Py_buffer view;
    8 int buf_flags = PyBUF_ANY_CONTIGUOUS | PyBUF_FORMAT;
    9 if (PyObject_GetBuffer(buf, &view, buf_flags) == -1) {
    10 return NULL;
    11 }
    12
    13 if (strcmp(view.format,"d") != 0) {
    14 PyErr_SetString(PyExc_TypeError, "we only take floats :(");
    15 PyBuffer_Release(&view);
    16 return NULL;
    17 }
    18
    19 double result = tally(view.buf, view.shape[0]);
    20 PyBuffer_Release(&view);
    21 return Py_BuildValue("d", result);
    22 }
    13 if (strcmp(view.format,"d") != 0) {
    14 PyErr_SetString(PyExc_TypeError, "we only take floats :(");
    15 PyBuffer_Release(&view);
    16 return NULL;
    17 }

    View Slide

  27. Civis Analytics 27
    The Python C API
    1 static PyObject *tally_(PyObject *self, PyObject *args) {
    2 PyObject *buf;
    3 if (!PyArg_ParseTuple(args, "O", &buf)) {
    4 return NULL;
    5 }
    6
    7 Py_buffer view;
    8 int buf_flags = PyBUF_ANY_CONTIGUOUS | PyBUF_FORMAT;
    9 if (PyObject_GetBuffer(buf, &view, buf_flags) == -1) {
    10 return NULL;
    11 }
    12
    13 if (strcmp(view.format,"d") != 0) {
    14 PyErr_SetString(PyExc_TypeError, "we only take floats :(");
    15 PyBuffer_Release(&view);
    16 return NULL;
    17 }
    18
    19 double result = tally(view.buf, view.shape[0]);
    20 PyBuffer_Release(&view);
    21 return Py_BuildValue("d", result);
    22 }
    19 double result = tally(view.buf, view.shape[0]);

    View Slide

  28. Civis Analytics 28
    The Python C API
    1 static PyObject *tally_(PyObject *self, PyObject *args) {
    2 PyObject *buf;
    3 if (!PyArg_ParseTuple(args, "O", &buf)) {
    4 return NULL;
    5 }
    6
    7 Py_buffer view;
    8 int buf_flags = PyBUF_ANY_CONTIGUOUS | PyBUF_FORMAT;
    9 if (PyObject_GetBuffer(buf, &view, buf_flags) == -1) {
    10 return NULL;
    11 }
    12
    13 if (strcmp(view.format,"d") != 0) {
    14 PyErr_SetString(PyExc_TypeError, "we only take floats :(");
    15 PyBuffer_Release(&view);
    16 return NULL;
    17 }
    18
    19 double result = tally(view.buf, view.shape[0]);
    20 PyBuffer_Release(&view);
    21 return Py_BuildValue("d", result);
    22 }
    1 static PyObject *tally_(PyObject *self, PyObject *args) {
    2 PyObject *buf;
    3 if (!PyArg_ParseTuple(args, "O", &buf)) {
    4 return NULL;
    5 }
    6
    7 Py_buffer view;
    8 int buf_flags = PyBUF_ANY_CONTIGUOUS | PyBUF_FORMAT;
    9 if (PyObject_GetBuffer(buf, &view, buf_flags) == -1) {
    10 return NULL;
    11 }
    12
    13 if (strcmp(view.format,"d") != 0) {
    14 PyErr_SetString(PyExc_TypeError, "we only take floats :(");
    15 PyBuffer_Release(&view);
    16 return NULL;
    17 }
    18
    19 double result = tally(view.buf, view.shape[0]);
    20 PyBuffer_Release(&view);
    21 return Py_BuildValue("d", result);
    22 }

    View Slide

  29. Civis Analytics 29
    The Python C API: Method Table
    1 static PyMethodDef MethodTable[] = {
    2 {"tally", &tally_, METH_VARARGS, "Compute the sum of an array."},
    3 { NULL, NULL, 0, NULL}
    4 };
    5
    6 static struct PyModuleDef tally_module = {
    7 .m_base = PyModuleDef_HEAD_INIT,
    8 .m_name = "tally_py",
    9 .m_size = -1,
    10 .m_methods = MethodTable
    11 };
    12
    13 PyMODINIT_FUNC PyInit_tally_py(void) {
    14 return PyModule_Create(&tally_module);
    15 }

    View Slide

  30. Civis Analytics
    • Rcpp
    • C (via the R C API)
    C/C++ and R
    30

    View Slide

  31. Civis Analytics 31
    The R C API
    1 #include
    2 #include
    3 #include
    4 #include "tally.h"
    5
    6 SEXP tally_(SEXP x_) {
    7 // cast/decode the input
    8 // call our tally function
    9 // build the output
    10 }
    11
    12 // method table
    13 static R_CallMethodDef callMethods[] = {
    14 // ...
    15 };
    16
    17 // module/package init
    18 void R_init_tally_r(DllInfo *info) {
    19 R_registerRoutines(info, NULL, callMethods, NULL, NULL);
    20 }

    View Slide

  32. Civis Analytics 32
    The R C API
    1 SEXP tally_(SEXP x_) {
    2 double *x = REAL(x_);
    3 int n = length(x_);
    4
    5 SEXP out = PROTECT(allocVector(REALSXP, 1));
    6 REAL(out)[0] = tally(x, n);
    7 UNPROTECT(1);
    8
    9 return out;
    10 }

    View Slide

  33. Civis Analytics 33
    The R C API
    1 SEXP tally_(SEXP x_) {
    2 double *x = REAL(x_);
    3 int n = length(x_);
    4
    5 SEXP out = PROTECT(allocVector(REALSXP, 1));
    6 REAL(out)[0] = tally(x, n);
    7 UNPROTECT(1);
    8
    9 return out;
    10 }
    1 SEXP tally_(SEXP x_) {
    2 double *x = REAL(x_);
    3 int n = length(x_);
    4
    5 SEXP out = PROTECT(allocVector(REALSXP, 1));
    6 REAL(out)[0] = tally(x, n);
    7 UNPROTECT(1);
    8
    9 return out;
    10 }

    View Slide

  34. Civis Analytics 34
    The R C API
    1 SEXP tally_(SEXP x_) {
    2 double *x = REAL(x_);
    3 int n = length(x_);
    4
    5 SEXP out = PROTECT(allocVector(REALSXP, 1));
    6 REAL(out)[0] = tally(x, n);
    7 UNPROTECT(1);
    8
    9 return out;
    10 }
    1 SEXP tally_(SEXP x_) {
    2 double *x = REAL(x_);
    3 int n = length(x_);
    4
    5 SEXP out = PROTECT(allocVector(REALSXP, 1));
    6 REAL(out)[0] = tally(x, n);
    7 UNPROTECT(1);
    8
    9 return out;
    10 }

    View Slide

  35. Civis Analytics 35
    The R C API
    1 SEXP tally_(SEXP x_) {
    2 double *x = REAL(x_);
    3 int n = length(x_);
    4
    5 SEXP out = PROTECT(allocVector(REALSXP, 1));
    6 REAL(out)[0] = tally(x, n);
    7 UNPROTECT(1);
    8
    9 return out;
    10 }
    1 SEXP tally_(SEXP x_) {
    2 double *x = REAL(x_);
    3 int n = length(x_);
    4
    5 SEXP out = PROTECT(allocVector(REALSXP, 1));
    6 REAL(out)[0] = tally(x, n);
    7 UNPROTECT(1);
    8
    9 return out;
    10 }

    View Slide

  36. Civis Analytics 36
    The R C API: Function Registration
    1 static R_CallMethodDef callMethods[] = {
    2 {"tally_", (DL_FUNC)&tally_, 1},
    3 {NULL, NULL, 0}
    4 };
    5
    6 void R_init_tally_r(DllInfo *info) {
    7 R_registerRoutines(info, NULL, callMethods, NULL, NULL);
    8 }

    View Slide

  37. Civis Analytics
    Dependencies
    Don’t depend on APIs from host languages, i.e., numpy, rmath
    Errors
    Use error codes to signal problems. Don’t call abort or exit as these will quit the process
    running the host language.
    Memory
    Typically best to make the host language responsible for allocation and deallocation. It’s
    challenging to transfer ownership over the boarder.
    Logging/Verbosity
    At the very least, make this optional.
    Compiler
    Trust the compiler it’s smarter than all of us. Ensure your code compiles without
    warnings.
    Miscellaneous
    37

    View Slide

  38. 1. Meet users where they are
    2. Reach a larger audience
    3. Make a bigger impact
    Parting Thoughts

    View Slide

  39. Thank You
    Bill Lattner
    twitter: @wlattner
    github: github.com/wlattner

    View Slide