Upgrade to Pro — share decks privately, control downloads, hide ads and more …

spark serialization

wwwy3y3
November 09, 2015

spark serialization

spark serialization

wwwy3y3

November 09, 2015
Tweet

More Decks by wwwy3y3

Other Decks in Programming

Transcript

  1. What is serialization? • an object can be represented as

    a sequence of bytes • write to disk , or transfer through internet • JAVA: ObjectInputStream, ObjectOutputStream
  2. after computation storage level serialize? not fit in memory? MEMORY_ONLY

    no recomputed MEMORY_AND_DISK no(memory) store on disk MEMORY_ONLY_SER yes recomputed MEMORY_AND_DISK_S ER yes store on disk DISK_ONLY yes store on disk
  3. Data Serialization • Java serialization: By default, Spark serializes objects

    using Java’s ObjectOutputStream framework • Kryo serialization: Spark can also use the Kryo library (version 2) to serialize objects more quickly. conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  4. Kryo • register class • If a class is not

    registered or no serializer is specified, a serializer is chosen automatically from a list
  5. memory store • Stores blocks in memory, either as Arrays

    of deserialized Java objects or as serialized ByteBuffers. • hashMap to store blocks
  6. Serializer Serializer • def newInstance(): SerializerInstance SerializerInstance • def serialize[T:

    ClassTag](t: T): ByteBuffer • def deserialize[T: ClassTag](bytes: ByteBuffer): T • def serializeStream(s: OutputStream): SerializationStream • def deserializeStream(s: InputStream): DeserializationStream
  7. SerializationStream/ DeserializationStream • def read/writeObject • def read/writeKey • def

    read/writeValue • def flush() • def close() • def writeAll • Iterator (read)
  8. use Avro with Kryo • Kryo tries to use its

    default serializer for generic Records, which will include a lot of unneeded data in each record. • user registers the schemas, then the schema's fingerprint will be sent
  9. Avro • Defining a schema {"namespace": "example.avro", "type": "record", "name":

    "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
  10. Avro • Compiling the schema java -jar /path/to/avro-tools-1.7.7.jar compile schema

    <schema file> <destination> User user = new User("Ben", 7, "red"); • create user
  11. Avro DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class); DataFileWriter<User> dataFileWriter = new

    DataFileWriter<User>(userDatumWriter); dataFileWriter.create(user1.getSchema(), new File("users.avro"));
  12. without code gen Schema schema = new Schema.Parser().parse(new File(“user.avsc”)); GenericRecord

    user = new GenericData.Record(schema); DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter); dataFileWriter.create(schema, file);
  13. use with spark kryo val schema : Schema = SchemaBuilder

    .record("testRecord").fields() .requiredString("data") .endRecord() conf.registerAvroSchemas(schema) val record = new Record(schema) record.put("data", "test data")