to the source • Work on hot data • Avoid sampling, instead summarize or hash • Determine a common format , logical and physical • Make access to the data easy for analysis • Let the business drive question
Categorizes messages into topics • Persists messages into disk. Allows message retention for a specified amount of time • Can have multiple producers and consumers
your pipeline Take your pick : Protocol Buffers, Thrift, Avro Stay away of schemaless JSON. • Compression : Snappy, LZO, … • Storage format : Look at columnar storage formats like Parquet. Easier for OLAP operations.