Revisions to Traditional Message Brokers and Streaming Data

more detail

edited Jun 7, 2017 at 5:08

656
4
9

Kafka deals in ordered logs of atomic messages. You can view it sort of like the pub/sub mode of message brokers, but with strict ordering and the ability to replay or seek around the stream of messages at any point in the past that's still being retained on disk (which could be forever).

Kafka's flavor of streaming stands opposed to remote procedure call like Thrift or HTTP, and to batch processing like in the Hadoop ecosystem. Unlike RPC, components communicate asynchronously: hours or days may pass between when a message is sent and when the recipient wakes up and acts on it. There could be many recipients at different points in time, or maybe no one will ever bother to consume a message. Multiple producers could produce to the same topic without knowledge of the consumers. Kafka does not know whether you are subscribed, or whether a message has been consumed. A message is simply committed to the log, where any interested party can read it.

Unlike batch processing, you're interested in single messages, not just giant collections of messages. (Though it's not uncommon to archive Kafka messages into Parquet files on HDFS and query them as Hive tables).

Case 1: Kafka does not preserve any particular temporal relationship between producer and consumer. It's a poor fit for streaming video because Kafka is allowed to slow down, speed up, move in fits and starts, etc. For streaming media, we want to trade away overall throughput in exchange for low and, more importantly, stable latency (otherwise known as low jitter). Kafka also takes great pains to never lose a message. With streaming video, we typically use UDP and are content to drop a frame here and there to keep the video running. The SLA on a Kafka-backed process is typically seconds to minutes when healthy, hours to days when healthy. The SLA on streaming media is in tens of milliseconds.

Netflix could use Kafka to move frames around in an internal system that transcodes terabytes of video per hour and saves it to disk, but not to ship them to your screen.

Case 2: Absolutely. We use Kafka this way at my employer.

Case 3: You can use Kafka for this kind of thing, and we do, but you are paying some unnecessary overhead to preserve ordering. Since you don't care about order, you could probably squeeze some more performance out of another system. If your company already maintains a Kafka cluster, though, probably best to reuse it rather than take on the maintenance burden of another messaging system.

Kafka deals in ordered logs of atomic messages. You can view it sort of like the pub/sub mode of message brokers, but with strict ordering and the ability to replay or seek around the stream of messages at any point in the past that's still being retained on disk (which could be forever).

Kafka's flavor of streaming stands opposed to remote procedure call like Thrift or HTTP, and to batch processing like in the Hadoop ecosystem. Unlike RPC, components communicate asynchronously: hours or days may pass between when a message is sent and when the recipient wakes up and acts on it. There could be many recipients at different points in time, or maybe no one will ever bother to consume a message. Multiple producers could produce to the same topic without knowledge of the consumers.