Kafka™ is a distributed streaming platform. What exactly does that mean?


kafka是分布式流式平台,到底是什么意思呢?


We think of a streaming platform as having three key capabilities:

  1. It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.
  2. It lets you store streams of records in a fault-tolerant way.
  3. It lets you process streams of records as they occur.


一般认为流式平台有三个关键能力:

     1、可以发布以及订阅数据流。这和消息队列或者企业消息系统比较相似

     2、以容错的方式存储数据流

     3、可以及时处理数据流

What is Kafka good for?

It gets used for two broad classes of application:

  1. Building real-time streaming data pipelines that reliably get data between systems or applications
  2. Building real-time streaming applications that transform or react to the streams of data


那么kafka优点何在?

       1.在应用和系统之间建立实时流式数据管道,传输可靠数据。

       2.建立实时流式应用,用来转换数据流或者对数据流做出回应。

To understand how Kafka does these things, let's dive in and explore Kafka's capabilities from the bottom up.

First a few concepts:

  • Kafka is run as a cluster on one or more servers.
  • The Kafka cluster stores streams of records in categories called topics.
  • Each record consists of a key, a value, and a timestamp.


想要理解kafka这些优点,需要从基础开始探索kafka的能力。

首先来看一些概念:

    -kafka以集群方式运行在一个或多个servers上

    -kafka集群以topic分类存储数据流

    -每条记录包含一个key,一个value,以及一个时间戳

Kafka has four core APIs:

  • The Producer API allows an application to publish a stream records to one or more Kafka topics.
  • The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
  • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
  • The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
kafka包括以下四类核心APIs:

    -Producer API:支持应用将数据发布到一个或者多个topics上

    -Consumer API:允许应用订阅一个或者多个topics

    -Streams API:支持应用作为一个流式处理器,从一个或者多个topics中获取数据,然后将输出流发送到一个或者多个topics

    -Connector API:可以建立并运行可重用的生产者或者消费者,用来连接kafka topics与应用或者数据系统。例如,针对关系型数据库的连接器可以捕捉数据表的任何变化


In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages.

kafka中,clients与servers之间通过简单的、高性能的、与编程语言无关的TCP协议。此协议具有版本号,同时是向后兼容老版本的。官方提供kafka的java版本客户端,同时clients也有其他语言版本的

Topics and Logs

Let's first dive into the core abstraction Kafka provides for a stream of records—the topic.

A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.


首先来深入了解一下kafka有关数据流的核心抽象-topic。

topic是类别或者信息流名字,用来发布数据的。kafka中的topics的订阅者可能有多个模式;即,topic可以有0个、一个、或者多个consumers。

For each topic, the Kafka cluster maintains a partitioned log that looks like this:

对于每个topic,kafka集群维护一个分区log,如下所示:


Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.


每个partition都是有序,不可更改的纪录序列,可以持续的向后追加新纪录--一个结构化的提交日志。partitions中的每条记录都会分配一个独一无二的id数字-offset,用来唯一的标识partition中的每条记录。

kafka集群保存所有发布记录-无论是否消耗过-可以配置保存时长。例如,如果删除策略设置为两天,则发布后两天之内是可以消费的,两天之后则会删除并释放空间。kafka可以高效的保存大量的数据。


In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".

This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to "tail" the contents of any topic without changing what is consumed by any existing consumers.

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.


实际上,每个consumer需要保存的元数据信息就是当前消耗的offset。offset需要由consumer控制,consumer一般是随着读取记录线性增大自己的offset,但是,实际上,正是由于consumer自己控制offset,所以consumer可以选择从任意位置读取日志。例如,consumer可以重读已经处理过的数据并重新处理,或者可以跳过某些数据并从现在最新的数据开始读取。

这些特征意味着:kafka consumers资源耗费比较少-它们可以来回读取日志,而不会给集群或者其他consumers造成很大的影响。例如,可以使用官方提供的命令行工具去获取任何topics的最新日志,这不需要对现存的consumers做任何操作。

partitions的存在有多个目的。首先,可以使每个topic的日志空间超出单个server的最大空间。但是,每个独立的partition的最大空间就是它宿主机的最大空间,但是每个topic都可以有多个partitions,这样就可以存储巨量的数据。其次,partitions作为并行处理的单元-对并行处理能力的增加是非常显著的。

Distribution

The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

Each partition has one server which acts as the "leader" and zero or more servers which act as "followers". The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.


分布式

日志的partitions分布在kafka集群中不同servers上,这样每个server就可以分别处理一部分partitions的请求。每个partitions备份数目是可配置的,用来容错。

每个partition都有一个leader以及0个或者多个followers,leaders处理所有读写请求,followers只对leaders进行备份操作。一旦leaders不可用,某个followers就会自动成为新leader。每个server都会作为某些partitions的leader,同时又作为某些partitions的followers,主要目的是为了集群的负载均衡。

Producers

Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!


生产者发布数据到topics。生产者负责将数据发布到topic下的哪些partitions上。可以使用轮询方式进行负载均衡,或者可以通过某些语义分区函数(基于数据中的某些关键字)进行负载均衡。多数使用第二种方式进行数据分发。

Consumers

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.

If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.


消费者

consumer使用consumer group名字标识它们自己,topic中任何一条记录都只能由每个consumer group中某个consumer实例消费。同一个consumer group的consumer实例可以在不同的进程中,也可以在不同的机器上。

如果所有consumer实例都在同一个consumer group,记录可以在consumer实例之间达到负载均衡。

如果所有consumer实例都在不同的consumer groups,则每条记录都可以由所有consumer实例所消费。


A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.

More commonly, however, we have found that topics have a small number of consumer groups, one for each "logical subscriber". Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.


包含两台server的kafka集群存储了四个partitions(p0-p3),由两个consumer groups消费。consumer group A有两个consumer 实例,consumer group B有四个。

更普遍的情况是,topics的consumer groups数目比较少,每个consumers都是topics逻辑上的订阅者。每个group都由很多consumer实例组成,目的是为了扩展性和容错。相比发布-订阅语义模式,kafka这种模式中,订阅者是consumers的集群而不是一个单独的进程。

kafka中消费方式是:多个consumers消费多个partitions,目的是为了使每个实例在任何时刻都可以“相对公平”的独享某个partition。组中成员关系是根据kafka协议动态维护的。如果新实例加入到消费组,这些新实例将会接管组中原有成员的某些partitions;如果某个实例不可用了,它的partitions会分发给其他可用实例上。

kafka只能保证partition内部的数据是有序的,不能保证同一个topics内不同partitions之间数据的顺序。每个partition内部有序以及可以通过key分发数据到不同的partition,对于大多数应用来说足够了。然而,如果想要所有数据都是有序的,可以设置topic只有一个partition,同时也就意味着每个consumer group只有一个consumer实例。


Guarantees

At a high-level Kafka gives the following guarantees:

  • Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
  • A consumer instance sees records in the order they are stored in the log.
  • For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

More details on these guarantees are given in the design section of the documentation.


保证

high-level的kafka提供以下保证:

  -消息在topic-partition上的存储顺序是按照他们发送的顺序来的。即,M1和M2由相同producer发送,其中先发送M1,则M1在log中offset要早于M2

  -consumer实例获取消息的顺序和消息在log中顺序相同

  -某个topic的备份数目为N,则容错能力为可以承受N-1个server出现不可用而不会丢失任何已经提交到日志的数据

更多的细节需要查看文档的设计说明部分


Kafka as a Messaging System

How does Kafka's notion of streams compare to a traditional enterprise messaging system?

Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren't multi-subscriber—once one process reads the data it's gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.

The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.


Kafka作为一个消息系统

kafka和传统企业消息系统相比有哪些独特之处?

传统消息系统有两个模型:队列模式发布-订阅模式。队列模式中,消费组从server中读取数据,每条数据只能由某个消费者读取。发布-订阅模式中,每条数据都是广播到所有consumers。这两种模型都有各自的优缺点。队列模式的优势在于可以将数据分发到不同的consumer实例,这可以扩展处理性能。不好的是,队列模式不能有多个订阅者--一旦某个进程读过某条消息,这条消息就没了。发布-订阅模式允许广播数据到多个进程,但是正是忧郁每条数据都会广播到每个订阅者,所以没有办法扩展处理能力。

kafka的consumer group概念引申出两个概念。作为队列时,consumer group支持多个进程的集合同时分担数据处理。作为发布-订阅模式时,kafka支持发布消息到多个consumer groups。


The advantage of Kafka's model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.

Kafka has stronger ordering guarantees than a traditional messaging system, too.

A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of "exclusive consumer" that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.


kafka模式的优点在于每个topic都有这两个特征-既可以扩展处理能力又可以支持多个订阅者---不需要选择这个却不得不舍弃那个。

kafka要比传统消息系统有更强的次序保证。

传统队列在server按照次序存储数据,同时如果多个consumers从队列中消费数据,那么server会按照存储的顺序发出数据。然而,尽管server按照顺序发出数据,数据以异步的方式发送到consumers,则数据到达不同consumers的次序可能是无序的。这就意味着消息的次序在并行消费中丢失了。消息系统通常采用consumer独占队列的方式来保证次序性,但是这也就意味失去了并行处理能力。


Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.


kafka这方面做得比较好。通过并行处理机制-partition-在topic内部,kafka可以提供次序保证以及consumer 进程组之间负载均衡。这是通过分配topic的每个partition给consumer group中某个consumer,这样可以保证一个partition只会被一个consumer消费。这样就保证了consumer是partition唯一的消费者,即可以获得有序的消息。由于有多个partitions存在,也就需要多个consumer实例来保证负载均衡。注意,同一个consumer group中consumers实例的个数不能多于partitions的数目


Kafka as a Storage System

Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.

Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn't considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.

The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.

As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.


Kafka作为一个存储系统

任何消息队列如果允许发布消息和消费消息分离的话,则对于不断产生的消息来说可以充当一个存储系统。kafka的不同之处就是它也是一个非常好的存储系统。

写入kafka的数据最终会落到磁盘上,同时备份以供容错。kafka允许producers等待server的写入确认消息,以便确认数据写入是否完成,直到数据备份完成,提高容错能力。

kafka使用的磁盘结构扩展性也很好-server上无论是50kb还是50tb的持久话数据,kafka性能都是相同的。

kafka存储能力的优越性以及允许客户端控制它们自己的读取位置,可以将kafka作为一个特定目的的分布式文件系统:高性能、低延迟提交日志的存储系统、可备份、扩展能力突出。


Kafka for Stream Processing

It isn't enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.

In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.

For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.


kafka作为流式处理

只是读写以及存储数据流是不够的,kafka还有一个功能是支持实时流式处理。

在kafka中,流式处理器可以从输入topics获取持续的数据流,对这些输入数据执行某些操作,同时生产持续的数据流到输出topics。

例如,一个零售应用可能获取销售和运输的输入流,然后输出由这些数据计算得出的重排序以及价格的输出流。


It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.

This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.

The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.


使用producer和consumer APIs做简单的处理是可以的。然而对于更复杂的转换来说,kafka也提供了完整工具集合-Streams API。这将支持创建以下应用:即做一些非琐碎的处理,计算数据聚合度或者聚合数据。

这些能力将有助于解决一些应用遇到的这些复杂问题:处理无序数据,当代码改变时重新处理数据,执行有状态的计算。

数据流API构建了kafka所能提供的核心基础能力:它使用producer以及consumer APIs处理输入数据,使用kafka作为有状态的存储,使用相同的组机制做容错。


Putting the Pieces Together

This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka's role as a streaming platform.

A distributed file system like HDFS allows storing static files for batch processing. Effectively a system like this allows storing and processing historical data from the past.

A traditional enterprise messaging system allows processing future messages that will arrive after you subscribe. Applications built in this way process future data as it arrives.

Kafka combines both of these capabilities, and the combination is critical both for Kafka usage as a platform for streaming applications as well as for streaming data pipelines.


消息传递、存储以及流式处理这些功能一块来看的话时非同寻常的,同时kafka作为流式平台这个角色是最基本的。

类似HDFS的分布式文件系统允许存储静态文件,以支持批量处理。像这样的系统一般可以有效的支持存储盒处理历史的数据。

传统企业消息系统支持处理在你订阅之后到达的数据。内置这种方式的应用可以处理以后到达的数据。

kafka包括以上这些能力,同时这些能力的综合对于kafka作为流式处理平台以及流式处理管道来说非常关键。


By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way. That is a single application can process historical, stored data but rather than ending when it reaches the last record it can keep processing as future data arrives. This is a generalized notion of stream processing that subsumes batch processing as well as message-driven applications.

Likewise for streaming data pipelines the combination of subscription to real-time events make it possible to use Kafka for very low-latency pipelines; but the ability to store data reliably make it possible to use it for critical data where the delivery of data must be guaranteed or for integration with offline systems that load data only periodically or may go down for extended periods of time for maintenance. The stream processing facilities make it possible to transform data as it arrives.

For more information on the guarantees, apis, and capabilities Kafka provides see the rest of the documentation.


通过整合存储以及低延迟提交,流式处理应用可以以相同的方式处理过去的和将来的数据。一个单独应用可以处理历史的、存储的数据而不是当处理完数据之后就结束,当再有数据到来时,它依然可以处理。流式处理的通用概念包括批处理和消息驱动的应用。

就像流式数据管道,实时事件提交使得将kafka作为低延迟管道称为可能。但是存储数据可靠性的能力也提高以下面方式使用kafka的可能性:即必须保证数据的可靠性或者离线系统周期性加载数据或者维护期间的宕机。流式处理能力可以尽快的转发到达的数据。

更多有关保证、apis、以及kafka提供的能力请看以下documentation

































































Logo

Kafka开源项目指南提供详尽教程,助开发者掌握其架构、配置和使用,实现高效数据流管理和实时处理。它高性能、可扩展,适合日志收集和实时数据处理,通过持久化保障数据安全,是企业大数据生态系统的核心。

更多推荐