Skip to main content
more detail
Source Link
closeparen
  • 656
  • 4
  • 9

Kafka deals in ordered logs of atomic messages. You can view it sort of like the pub/sub mode of message brokers, but with strict ordering and the ability to replay or seek around the stream of messages at any point in the past that's still being retained on disk (which could be forever).

Kafka's flavor of streaming stands opposed to remote procedure call like Thrift or HTTP, and to batch processing like in the Hadoop ecosystem. Unlike RPC, components communicate asynchronously: hours or days may pass between when a message is sent and when the recipient wakes up and acts on it. There could be many recipients at different points in time, or maybe no one will ever bother to consume a message. Multiple producers could produce to the same topic without knowledge of the consumers. Kafka does not know whether you are subscribed, or whether a message has been consumed. A message is simply committed to the log, where any interested party can read it.

Unlike batch processing, you're interested in single messages, not just giant collections of messages. (Though it's not uncommon to archive Kafka messages into Parquet files on HDFS and query them as Hive tables).

Case 1: Kafka does not preserve any particular temporal relationship between producer and consumer. It's a poor fit for streaming video because Kafka is allowed to slow down, speed up, move in fits and starts, etc. For streaming media, we want to trade away overall throughput in exchange for low and, more importantly, stable latency (otherwise known as low jitter). Kafka also takes great pains to never lose a message. With streaming video, we typically use UDP and are content to drop a frame here and there to keep the video running. The SLA on a Kafka-backed process is typically seconds to minutes when healthy, hours to days when healthy. The SLA on streaming media is in tens of milliseconds.

Netflix could use Kafka to move frames around in an internal system that transcodes terabytes of video per hour and saves it to disk, but not to ship them to your screen.

Case 2: Absolutely. We use Kafka this way at my employer.

Case 3: You can use Kafka for this kind of thing, and we do, but you are paying some unnecessary overhead to preserve ordering. Since you don't care about order, you could probably squeeze some more performance out of another system. If your company already maintains a Kafka cluster, though, probably best to reuse it rather than take on the maintenance burden of another messaging system.

Kafka deals in ordered logs of atomic messages. You can view it sort of like the pub/sub mode of message brokers, but with strict ordering and the ability to replay or seek around the stream of messages at any point in the past that's still being retained on disk (which could be forever).

Kafka's flavor of streaming stands opposed to remote procedure call like Thrift or HTTP, and to batch processing like in the Hadoop ecosystem. Unlike RPC, components communicate asynchronously: hours or days may pass between when a message is sent and when the recipient wakes up and acts on it. There could be many recipients at different points in time, or maybe no one will ever bother to consume a message. Multiple producers could produce to the same topic without knowledge of the consumers.

Unlike batch processing, you're interested in single messages, not just giant collections of messages. (Though it's not uncommon to archive Kafka messages into Parquet files on HDFS and query them as Hive tables).

Case 1: Kafka does not preserve any particular temporal relationship between producer and consumer. It's a poor fit for streaming video because Kafka is allowed to slow down, speed up, move in fits and starts, etc. For streaming media, we want to trade away overall throughput in exchange for low and, more importantly, stable latency (otherwise known as low jitter). Kafka also takes great pains to never lose a message. With streaming video, we typically use UDP and are content to drop a frame here and there to keep the video running. The SLA on a Kafka-backed process is typically seconds to minutes when healthy, hours to days when healthy. The SLA on streaming media is in tens of milliseconds.

Case 2: Absolutely. We use Kafka this way at my employer.

Case 3: You can use Kafka for this kind of thing, and we do, but you are paying some unnecessary overhead to preserve ordering. Since you don't care about order, you could probably squeeze some more performance out of another system. If your company already maintains a Kafka cluster, though, probably best to reuse it rather than take on the maintenance burden of another messaging system.

Kafka deals in ordered logs of atomic messages. You can view it sort of like the pub/sub mode of message brokers, but with strict ordering and the ability to replay or seek around the stream of messages at any point in the past that's still being retained on disk (which could be forever).

Kafka's flavor of streaming stands opposed to remote procedure call like Thrift or HTTP, and to batch processing like in the Hadoop ecosystem. Unlike RPC, components communicate asynchronously: hours or days may pass between when a message is sent and when the recipient wakes up and acts on it. There could be many recipients at different points in time, or maybe no one will ever bother to consume a message. Multiple producers could produce to the same topic without knowledge of the consumers. Kafka does not know whether you are subscribed, or whether a message has been consumed. A message is simply committed to the log, where any interested party can read it.

Unlike batch processing, you're interested in single messages, not just giant collections of messages. (Though it's not uncommon to archive Kafka messages into Parquet files on HDFS and query them as Hive tables).

Case 1: Kafka does not preserve any particular temporal relationship between producer and consumer. It's a poor fit for streaming video because Kafka is allowed to slow down, speed up, move in fits and starts, etc. For streaming media, we want to trade away overall throughput in exchange for low and, more importantly, stable latency (otherwise known as low jitter). Kafka also takes great pains to never lose a message. With streaming video, we typically use UDP and are content to drop a frame here and there to keep the video running. The SLA on a Kafka-backed process is typically seconds to minutes when healthy, hours to days when healthy. The SLA on streaming media is in tens of milliseconds.

Netflix could use Kafka to move frames around in an internal system that transcodes terabytes of video per hour and saves it to disk, but not to ship them to your screen.

Case 2: Absolutely. We use Kafka this way at my employer.

Case 3: You can use Kafka for this kind of thing, and we do, but you are paying some unnecessary overhead to preserve ordering. Since you don't care about order, you could probably squeeze some more performance out of another system. If your company already maintains a Kafka cluster, though, probably best to reuse it rather than take on the maintenance burden of another messaging system.

simplify
Source Link
closeparen
  • 656
  • 4
  • 9

Stream data is data that flows contiguously from a source to a destination over a network: sort of. Kafka deals in contiguous, ordered streamslogs of data, identified byatomic messages. You can view it sort of like the (topic,pub/sub partition) tuples. A stream can flow from one or more producers to Kafkamode of message brokers, but with strict ordering and from Kafkathe ability to onereplay or more consumers, but not necessarilyseek around the stream of messages at any point in the same time. There is no way for sources and destinations to know about each otherpast that's still being retained on disk (other than message contentwhich could be forever).

Stream data is not atomic in nature, meaning any part of a flowing stream of data is meaningful and processable: the atomic unit Kafka's flavor of Kafka is a message. You would not process a partial message,streaming stands opposed to remote procedure call like Thrift or a sliding window with fragments from neighboring messages. On the other handHTTP, and to batch processing like in some of the infrastructure around Kafka like SamzaHadoop ecosystem. Unlike RPC, messages mustcomponents communicate asynchronously: hours or days may pass between when a message is sent and when the recipient wakes up and acts on it. There could be processablemany recipients at different points in isolation. You're not meant to require messages 1time, or 3maybe no one will ever bother to processconsume a message 2. A (topic, partition) tuple is a logMultiple producers could produce to the same topic without knowledge of messages identified by sequential numeric offset. Kafka itself doesn't care, thoughthe consumers.

Stream data can be started/stopped at any time: production and consumption can start Unlike batch processing, stopyou're interested in single messages, and change hands at any time.

Consumers can attach and detach from a stream of data at will, and process just the parts of it that they want: Kafka does not have a conceptjust giant collections of you being attached to a stream, thoughmessages. (Though it's easy enoughnot uncommon to overlay this concept onarchive Kafka by consuming from a topic, partition tuple in a for loop, incrementing offset each iteration and blocking for new messages to become availableinto Parquet files on HDFS and query them as Hive tables).

Case 1: Kafka does not preserve any particular temporal relationship between producer and consumer. It's a poor fit for streaming video because Kafka is allowed to slow down, speed up, move in fits and starts, etc. For streaming media, we want to trade away overall throughput in exchange for low and, more importantly, stable latency (otherwise known as low jitter). Kafka also takes great pains to never lose a message. With streaming video, we typically use UDP and are content to drop a frame here and there to keep the video running. The SLA on a Kafka-backed process is typically seconds to minutes when healthy, hours to days when healthy. The SLA on streaming media is in tens of milliseconds.

Case 2: Absolutely. We use Kafka this way at my employer.

Case 3: You can use Kafka for this kind of thing, and we do, but you are paying some unnecessary overhead to preserve ordering. Since you don't care about order, you could probably squeeze some more performance out of another system. If your company already maintains a Kafka cluster, though, probably best to reuse it rather than take on the maintenance burden of another messaging system.

Stream data is data that flows contiguously from a source to a destination over a network: sort of. Kafka deals in contiguous, ordered streams of data, identified by(topic, partition) tuples. A stream can flow from one or more producers to Kafka, and from Kafka to one or more consumers, but not necessarily at the same time. There is no way for sources and destinations to know about each other (other than message content).

Stream data is not atomic in nature, meaning any part of a flowing stream of data is meaningful and processable: the atomic unit of Kafka is a message. You would not process a partial message, or a sliding window with fragments from neighboring messages. On the other hand, in some of the infrastructure around Kafka like Samza, messages must be processable in isolation. You're not meant to require messages 1 or 3 to process message 2. A (topic, partition) tuple is a log of messages identified by sequential numeric offset. Kafka itself doesn't care, though.

Stream data can be started/stopped at any time: production and consumption can start, stop, and change hands at any time.

Consumers can attach and detach from a stream of data at will, and process just the parts of it that they want: Kafka does not have a concept of you being attached to a stream, though it's easy enough to overlay this concept on Kafka by consuming from a topic, partition tuple in a for loop, incrementing offset each iteration and blocking for new messages to become available.

Case 1: Kafka does not preserve any particular temporal relationship between producer and consumer. It's a poor fit for streaming video because Kafka is allowed to slow down, speed up, move in fits and starts, etc. For streaming media, we want to trade away overall throughput in exchange for low and, more importantly, stable latency (otherwise known as low jitter). Kafka also takes great pains to never lose a message. With streaming video, we typically use UDP and are content to drop a frame here and there to keep the video running. The SLA on a Kafka-backed process is typically seconds to minutes when healthy, hours to days when healthy. The SLA on streaming media is in tens of milliseconds.

Case 2: Absolutely. We use Kafka this way at my employer.

Case 3: You can use Kafka for this kind of thing, and we do, but you are paying some unnecessary overhead to preserve ordering. Since you don't care about order, you could probably squeeze some more performance out of another system. If your company already maintains a Kafka cluster, though, probably best to reuse it rather than take on the maintenance burden of another messaging system.

Kafka deals in ordered logs of atomic messages. You can view it sort of like the pub/sub mode of message brokers, but with strict ordering and the ability to replay or seek around the stream of messages at any point in the past that's still being retained on disk (which could be forever).

Kafka's flavor of streaming stands opposed to remote procedure call like Thrift or HTTP, and to batch processing like in the Hadoop ecosystem. Unlike RPC, components communicate asynchronously: hours or days may pass between when a message is sent and when the recipient wakes up and acts on it. There could be many recipients at different points in time, or maybe no one will ever bother to consume a message. Multiple producers could produce to the same topic without knowledge of the consumers.

Unlike batch processing, you're interested in single messages, not just giant collections of messages. (Though it's not uncommon to archive Kafka messages into Parquet files on HDFS and query them as Hive tables).

Case 1: Kafka does not preserve any particular temporal relationship between producer and consumer. It's a poor fit for streaming video because Kafka is allowed to slow down, speed up, move in fits and starts, etc. For streaming media, we want to trade away overall throughput in exchange for low and, more importantly, stable latency (otherwise known as low jitter). Kafka also takes great pains to never lose a message. With streaming video, we typically use UDP and are content to drop a frame here and there to keep the video running. The SLA on a Kafka-backed process is typically seconds to minutes when healthy, hours to days when healthy. The SLA on streaming media is in tens of milliseconds.

Case 2: Absolutely. We use Kafka this way at my employer.

Case 3: You can use Kafka for this kind of thing, and we do, but you are paying some unnecessary overhead to preserve ordering. Since you don't care about order, you could probably squeeze some more performance out of another system. If your company already maintains a Kafka cluster, though, probably best to reuse it rather than take on the maintenance burden of another messaging system.

Post Undeleted by closeparen
Post Deleted by closeparen
Source Link
closeparen
  • 656
  • 4
  • 9

Stream data is data that flows contiguously from a source to a destination over a network: sort of. Kafka deals in contiguous, ordered streams of data, identified by(topic, partition) tuples. A stream can flow from one or more producers to Kafka, and from Kafka to one or more consumers, but not necessarily at the same time. There is no way for sources and destinations to know about each other (other than message content).

Stream data is not atomic in nature, meaning any part of a flowing stream of data is meaningful and processable: the atomic unit of Kafka is a message. You would not process a partial message, or a sliding window with fragments from neighboring messages. On the other hand, in some of the infrastructure around Kafka like Samza, messages must be processable in isolation. You're not meant to require messages 1 or 3 to process message 2. A (topic, partition) tuple is a log of messages identified by sequential numeric offset. Kafka itself doesn't care, though.

Stream data can be started/stopped at any time: production and consumption can start, stop, and change hands at any time.

Consumers can attach and detach from a stream of data at will, and process just the parts of it that they want: Kafka does not have a concept of you being attached to a stream, though it's easy enough to overlay this concept on Kafka by consuming from a topic, partition tuple in a for loop, incrementing offset each iteration and blocking for new messages to become available.

Case 1: Kafka does not preserve any particular temporal relationship between producer and consumer. It's a poor fit for streaming video because Kafka is allowed to slow down, speed up, move in fits and starts, etc. For streaming media, we want to trade away overall throughput in exchange for low and, more importantly, stable latency (otherwise known as low jitter). Kafka also takes great pains to never lose a message. With streaming video, we typically use UDP and are content to drop a frame here and there to keep the video running. The SLA on a Kafka-backed process is typically seconds to minutes when healthy, hours to days when healthy. The SLA on streaming media is in tens of milliseconds.

Case 2: Absolutely. We use Kafka this way at my employer.

Case 3: You can use Kafka for this kind of thing, and we do, but you are paying some unnecessary overhead to preserve ordering. Since you don't care about order, you could probably squeeze some more performance out of another system. If your company already maintains a Kafka cluster, though, probably best to reuse it rather than take on the maintenance burden of another messaging system.