Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wrapping C++ Arrow Why and How? – Yoni Davidson

Wrapping C++ Arrow Why and How? – Yoni Davidson

GopherCon Russia

March 28, 2020
Tweet

More Decks by GopherCon Russia

Other Decks in Programming

Transcript

  1. About me! Married + 1 + 4 Data Architect and

    SW engineer at Bond. Before: Sears Israel. Eyesight mobile. Alvarion.
  2. What is Apache Arrow? Apache Arrow is a cross-language development

    platform for in-memory data. zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.
  3. Apache Arrow Python bindings Based on the C++ project. Built

    with Cython. Allows integration with the massive Python ecosystem.
  4. Possible Awesome Usages Sharing table with Python in same memory

    space. 0 serialization Sharing table with Python in between services. 0 serialization
  5. Pros: 1. “Closed problem”. 2. It gives you all the

    advantages of Arrow (to date). 3. Go allows us to improve the implementation. Implement spec in pure Go Cons: 1. Hard to follow up the changes. 2. Harder to maintain. 3. Hard to insert improvements back to the C++.
  6. carrow - Go bindings to Apache Arrow via C++ API

    https://github.com/353solutions/carrow
  7. Go bindings to Apache Arrow via C++ Pros: 1. This

    project enjoys all the C++ main branch improvements. 2. Any addition that we create using the Go project we can export back to Python/C++ project. Cons: 1. It's much harder ...
  8. Challenge 1 - Go and C++ - don’t link C++

    compilers do symbols mangling CGo doesn’t support it and a C wrapper is needed.
  9. Challenge 1 - Go and C++ - don’t link #ifndef

    _CARROW_H_ #define _CARROW_H_ #ifdef __cplusplus extern "C" { #endif void *table_new(void *sp, void *cp); #ifdef __cplusplus } #endif // extern "C" #endif // #ifdef _CARROW_H_
  10. Challenge 2 - Building a C++/Go project C++ libs and

    headers are required, this means that the dev env’ is more complex than a Go project. Solution is a Dockerfile that has Native C++ and Python bindings for E2E tests.
  11. FROM ubuntu:18.04 # Tools RUN apt-get update && apt-get install

    -y \ gdb \ git \ ... # Go installation ... Challenge 2 - Building a C++/Go project
  12. Challenge 2 - Building a C++/Go project # Python bindings

    RUN cd /tmp && \ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ bash Miniconda3-latest-Linux-x86_64.sh -b -p /miniconda && \ rm Miniconda3-latest-Linux-x86_64.sh ENV PATH="/miniconda/bin:${PATH}" RUN conda install -y \ Cython \ conda-forge::compilers \ conda-forge::pyarrow=0.14 \ ... ENV LD_LIBRARY_PATH=/miniconda/lib WORKDIR /src/carrow
  13. Challenge 3 - Wrapper for each type Since this is

    a wrapper lib, there is a need to do a lot of “copy pasta” code to wrap each type. Solution was to use go template and generate some of the code.
  14. Challenge 3 - Wrapper for each type func main() {

    arrowTypes := []string{"Bool", "Float64", "Integer64", "String", "Timestamp"} . . . // Supported data types var( {{- range $val := .ArrowTypes}} {{$val}}Type = DType(C.{{$val | ToUpper }}_DTYPE) {{- end}} )
  15. Challenge 4 - Logger Do we send all our errors

    up the stream to the Go package for logging ? We can also create a Go logger and throw it down to the C++ code for logging.
  16. Challenge 5 - Error handling Where are errors handled ?

    Where is the best place to log and handle them? For now - every call returns this result_t typedef struct { const char *err; void *ptr; int64_t i; } result_t
  17. Challenge 666 - Memory management 2 memory managers. 1. Go

    runtime - Automatic memory management. 2. C++ runtime - Apache arrow uses std::shared_ptr extensively.
  18. Wrap std::shared_ptr with a struct - so we know who

    owns the memory. Challenge 666 - Memory management struct Table { std::shared_ptr<arrow::Table> table; };
  19. Use finalizer to free memory. Challenge 666 - Memory management

    // NewSchema creates a new schema func NewSchema(fields []*Field) (*Schema, error) { fieldsList, err := NewFieldList() if err != nil { return nil, fmt.Errorf("can't create schema,failed creating fields list") } . . . schema := &Schema{ptr} runtime.SetFinalizer(schema, func(s *Schema) { C.schema_free(s.ptr) }) return schema, nil }
  20. Challenge 7 - cgo is FFI FFI - Foreign function

    interface https://github.com/dyu/ffi-overhead Results (500M calls) c: 1182 1182 c++: 1182 1183 Go: 37975 37879
  21. Try and reduce unneeded cgo calls - Builder pattern: Challenge

    7 - cgo is FFI func TestAppendInt64(t *testing.T) { bld := NewInteger64ArrayBuilder() const size = 20913 for i := int64(0); i < size; i++ { err := bld.Append(i) require.NoErrorf(err, "append %d", i) } arr, err := bld.Finish() }
  22. Challenge 8 - Making package Go getable This lib is

    linked to a specific Arrow version in a specific OS (Linux AMD64 for example). Do we precompile for each OS? Add to Readme what packages need to be installed alongside?
  23. Challenge 8 - Making package Go getable package carrow import

    ( "fmt" "runtime" "time" "unsafe" ) /* #cgo pkg-config: arrow plasma #cgo LDFLAGS: -lcarrow #cgo linux LDFLAGS: -L./bindings/linux-x86_64 #cgo CXXFLAGS: -I/src/arrow/cpp/src #include "carrow.h" #include <stdlib.h> */ import "C" //go:generate go run gen.go //go:generate go fmt
  24. carrow status Adding more features (More data types). Building good

    use-cases, Flight! Adding our project to main Apache Arrow Repo.