Upgrade to Pro — share decks privately, control downloads, hide ads and more …

spark serialization

Avatar for wwwy3y3 wwwy3y3
November 09, 2015

spark serialization

spark serialization

Avatar for wwwy3y3

wwwy3y3

November 09, 2015
Tweet

More Decks by wwwy3y3

Other Decks in Programming

Transcript

  1. What is serialization? • an object can be represented as

    a sequence of bytes • write to disk , or transfer through internet • JAVA: ObjectInputStream, ObjectOutputStream
  2. after computation storage level serialize? not fit in memory? MEMORY_ONLY

    no recomputed MEMORY_AND_DISK no(memory) store on disk MEMORY_ONLY_SER yes recomputed MEMORY_AND_DISK_S ER yes store on disk DISK_ONLY yes store on disk
  3. Data Serialization • Java serialization: By default, Spark serializes objects

    using Java’s ObjectOutputStream framework • Kryo serialization: Spark can also use the Kryo library (version 2) to serialize objects more quickly. conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  4. Kryo • register class • If a class is not

    registered or no serializer is specified, a serializer is chosen automatically from a list
  5. memory store • Stores blocks in memory, either as Arrays

    of deserialized Java objects or as serialized ByteBuffers. • hashMap to store blocks
  6. Serializer Serializer • def newInstance(): SerializerInstance SerializerInstance • def serialize[T:

    ClassTag](t: T): ByteBuffer • def deserialize[T: ClassTag](bytes: ByteBuffer): T • def serializeStream(s: OutputStream): SerializationStream • def deserializeStream(s: InputStream): DeserializationStream
  7. SerializationStream/ DeserializationStream • def read/writeObject • def read/writeKey • def

    read/writeValue • def flush() • def close() • def writeAll • Iterator (read)
  8. use Avro with Kryo • Kryo tries to use its

    default serializer for generic Records, which will include a lot of unneeded data in each record. • user registers the schemas, then the schema's fingerprint will be sent
  9. Avro • Defining a schema {"namespace": "example.avro", "type": "record", "name":

    "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
  10. Avro • Compiling the schema java -jar /path/to/avro-tools-1.7.7.jar compile schema

    <schema file> <destination> User user = new User("Ben", 7, "red"); • create user
  11. Avro DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class); DataFileWriter<User> dataFileWriter = new

    DataFileWriter<User>(userDatumWriter); dataFileWriter.create(user1.getSchema(), new File("users.avro"));
  12. without code gen Schema schema = new Schema.Parser().parse(new File(“user.avsc”)); GenericRecord

    user = new GenericData.Record(schema); DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter); dataFileWriter.create(schema, file);
  13. use with spark kryo val schema : Schema = SchemaBuilder

    .record("testRecord").fields() .requiredString("data") .endRecord() conf.registerAvroSchemas(schema) val record = new Record(schema) record.put("data", "test data")