Kafka源码解析之Producer发送数据流程(三)
1. Kakfa生产者examplespublic class Producer extends Thread {private final KafkaProducer<Integer, String> producer;private final String topic;private final Boolean isAsync;/*** 构造方法,初始化生产者对象* @param
1. Kakfa生产者examples
public class Producer extends Thread {
private final KafkaProducer<Integer, String> producer;
private final String topic;
private final Boolean isAsync;
/**
* 构造方法,初始化生产者对象
* @param topic
* @param isAsync
*/
public Producer(String topic, Boolean isAsync) {
Properties props = new Properties();
// 用户拉取kafka的元数据
props.put("bootstrap.servers", "localhost:9092");
props.put("client.id", "DemoProducer");
//K,V
//设置序列化的类
//二进制的格式
props.put("key.serializer", "org.apache.kafka.common.serialization.IntegerSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
//消费者 消费数据的时候 就需要反序列化
//TODO 初始化KafkaProducer
producer = new KafkaProducer<>(props);
this.topic = topic;
this.isAsync = isAsync;
}
public void run() {
int messageNo = 1;
// 一直会往kafka 发送数据
while (true) {
String messageStr = "Message_" + messageNo;
long startTime = System.currentTimeMillis();
//isAsync kafka发送数据的时候有两种方式
//1. 异步发送
//2. 同步发送
if (isAsync) { // Send asynchronously
//异步发送,一直发送,消息响应结果交给回调函数处理
//这样的样式性能比较好,生产中就是用的这种方式
producer.send(new ProducerRecord<>(topic,
messageNo,
messageStr), new DemoCallBack(startTime, messageNo, messageStr));
} else { // Send synchronously
try {
//同步发送
//发送一条消息,等这条消息的所有后续工作都完成以后才继续下一条消息的发送
producer.send(new ProducerRecord<>(topic,
messageNo,
messageStr)).get();
System.out.println("Sent message: (" + messageNo + ", " + messageStr + ")");
} catch (InterruptedException | ExecutionException e) {
e.printStackTrace();
}
}
++messageNo;
}
}
}
class DemoCallBack implements Callback {
private final long startTime;
private final int key;
private final String message;
public DemoCallBack(long startTime, int key, String message) {
this.startTime = startTime;
this.key = key;
this.message = message;
}
/**
* A callback method the user can implement to provide asynchronous handling of request completion. This method will
* be called when the record sent to the server has been acknowledged. Exactly one of the arguments will be
* non-null.
*
* @param metadata The metadata for the record that was sent (i.e. the partition and offset). Null if an error
* occurred.
* @param exception The exception thrown during processing of this record. Null if no error occurred.
*/
public void onCompletion(RecordMetadata metadata, Exception exception) {
long elapsedTime = System.currentTimeMillis() - startTime;
if (exception != null)
//一般我们生产中,还会有其他的备用链路
System.out.println("有异常发生");
else
System.out.println("说明没有异常信息,是成功的发送!");
if (metadata != null) {
System.out.println(
"message(" + key + ", " + message + ") sent to partition(" + metadata.partition() +
"), " +
"offset(" + metadata.offset() + ") in " + elapsedTime + " ms");
} else {
exception.printStackTrace();
}
}
}
2. Producer核心流程介绍
2.1 回顾Producer发送消息概述
- 将我们的消息进行封装为ProducerRecord对象
- 进行序列化操作,为什么需要进行序列化呢,因为官方提供了不同类型的序列化,同时也有自定义序列化,其实官方提供的我们已经够用了,为什么呢,因为我们发送的都是消息,那么消息可以分为基本数据类型和引用数据类型,其实基本数据类型我们一般不用,因为我们需要一些描述信息,那么剩下的其实就是对象了,我们对一个对象序列化通用的方式其实就是json,那么我们这么看的话,是不是一个String就搞定了啊
- 将消息进行分区,这里就很关键了,我们肯定是想把消息发送到某个topic下面的某个分区里面的,那么这个时候我们并不知道我们集群的元数据,所以我们需要进行fetch元数据,然后根据元数据信息,进行分区
- 这个时候我们已经知道了这条消息应该发往那个主题下的那个分区了,那么我们直接开干不好吗,其实这里并不好,我们想像一下,我们假如有100w条消息,进行发送,这个时候我们是不是要进行100W次连接,这个对于我们的资源消耗太大了,所以kafka没有这个干,他把消息放进了我们的一个缓存里面,封装之后发
- 在后台会启动一个Sender线程进行轮训的去检查kafka分区中的数据,其实这里就是一种生产者消费者问题,sender线程将我们的消息进行封装为一个batch一个batch的进行发送,这样很明显的提升了我们的吞吐量,在封装batch的时候还可以进行batch压缩,这个压缩有好处也有坏处,进行压缩的时候会提高吞吐,但是会增加cpu的负担,没有最好的方式只有最适合自己的方式
- 把消息发送给我们的kafka集群
3. KafkaProducer初始化
private KafkaProducer(ProducerConfig config, Serializer<K> keySerializer, Serializer<V> valueSerializer) {
try {
log.trace("Starting the Kafka producer");
// 配置一些用户自定义的参数
Map<String, Object> userProvidedConfigs = config.originals();
this.producerConfig = config;
this.time = new SystemTime();
// 配置 clinetId
clientId = config.getString(ProducerConfig.CLIENT_ID_CONFIG);
if (clientId.length() <= 0)
clientId = "producer-" + PRODUCER_CLIENT_ID_SEQUENCE.getAndIncrement();
Map<String, String> metricTags = new LinkedHashMap<String, String>();
metricTags.put("client-id", clientId);
//metric一些东西,我们一般分析源码的时候 不需要关心
MetricConfig metricConfig = new MetricConfig().samples(config.getInt(ProducerConfig.METRICS_NUM_SAMPLES_CONFIG))
.timeWindow(config.getLong(ProducerConfig.METRICS_SAMPLE_WINDOW_MS_CONFIG), TimeUnit.MILLISECONDS)
.tags(metricTags);
List<MetricsReporter> reporters = config.getConfiguredInstances(ProducerConfig.METRIC_REPORTER_CLASSES_CONFIG,
MetricsReporter.class);
reporters.add(new JmxReporter(JMX_PREFIX));
this.metrics = new Metrics(metricConfig, reporters, time);
//TODO 设置分区器,分区器可以自定义
this.partitioner = config.getConfiguredInstance(ProducerConfig.PARTITIONER_CLASS_CONFIG, Partitioner.class);
//TODO 重试时间
/**
* Producer发送消息的时候,我们的代码里面一般会设置重试机制的
* 什么呢,因为我们的是分布式网络情况,网络是不稳定的,所以我们需要重试机制,hdfs当中也有很多的重试机制
* 这里默认的重试时间是 100ms
* TODO RETRY_BACKOFF_MS_CONFIG retry.backoff.ms 默认100ms
* public static final String RETRY_BACKOFF_MS_CONFIG = "retry.backoff.ms";
* public static final String RETRY_BACKOFF_MS_DOC = "The amount of time to wait before attempting to retry a failed request to a given topic partition. This avoids repeatedly sending requests in a tight loop under some failure scenarios.";
* 这里我们使用DOC文档的模式 这个是值得学习的
*/
long retryBackoffMs = config.getLong(ProducerConfig.RETRY_BACKOFF_MS_CONFIG);
//String 类型就包含了所有的类型
// 对象-> josn
//TODO 设置序列化器
if (keySerializer == null) {
this.keySerializer = config.getConfiguredInstance(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
Serializer.class);
this.keySerializer.configure(config.originals(), true);
} else {
config.ignore(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG);
this.keySerializer = keySerializer;
}
if (valueSerializer == null) {
this.valueSerializer = config.getConfiguredInstance(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
Serializer.class);
this.valueSerializer.configure(config.originals(), false);
} else {
config.ignore(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG);
this.valueSerializer = valueSerializer;
}
// load interceptors and make sure they get clientId
userProvidedConfigs.put(ProducerConfig.CLIENT_ID_CONFIG, clientId);
List<ProducerInterceptor<K, V>> interceptorList = (List) (new ProducerConfig(userProvidedConfigs)).getConfiguredInstances(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG,
ProducerInterceptor.class);
//TODO 设置拦截器
//类似于一个过滤器,很鸡肋,我们不想发送某些消息,在发送之前就过滤了
this.interceptors = interceptorList.isEmpty() ? null : new ProducerInterceptors<>(interceptorList);
ClusterResourceListeners clusterResourceListeners = configureClusterResourceListeners(keySerializer, valueSerializer, interceptorList, reporters);
/**
* 生产者从服务端哪里拉取过来的kafka的元数据
* 生产者想要去拉取元数据,发送网络请求和重试
* metadata.max.age.ms (默认是5分钟)
* retryBackoffMs = 100ms
* TODO 生产者每个5分钟都要去更新一下集群的元数据,因为元数据会出现变化的,例如增加分区什么的
*/
this.metadata = new Metadata(retryBackoffMs, config.getLong(ProducerConfig.METADATA_MAX_AGE_CONFIG), true, clusterResourceListeners);
/**
* 规定消息最大是多大
* maxRequestSize = max.request.size 默认是1M
* 如果你的消息超过了这个规定消息的大小,你的消息就不能发送过去
* 默认是1M,这个值有时候会偏小,例如有的埋点把所有的响应都打过来的时候 我们这个时候就需要修改这个值了
* 经验 设置10M就够了,因为大于10M的情况太小了
*/
this.maxRequestSize = config.getInt(ProducerConfig.MAX_REQUEST_SIZE_CONFIG);
/**
* 指的是缓存大小
* 默认值是32M,这个值一般是够用的,如果有特殊情况可以修改这个值
*/
this.totalMemorySize = config.getLong(ProducerConfig.BUFFER_MEMORY_CONFIG);
/**
* kafka是支持压缩数据的们这里设置压缩格式
* 提高你的系统的吞吐量,就可以设置压缩格式
* TODO 一次发送出去的消息就更多,但是会消耗更多的cpu
*/
this.compressionType = CompressionType.forName(config.getString(ProducerConfig.COMPRESSION_TYPE_CONFIG));
/* check for user defined settings.
* If the BLOCK_ON_BUFFER_FULL is set to true,we do not honor METADATA_FETCH_TIMEOUT_CONFIG.
* This should be removed with release 0.9 when the deprecated configs are removed.
*/
if (userProvidedConfigs.containsKey(ProducerConfig.BLOCK_ON_BUFFER_FULL_CONFIG)) {
log.warn(ProducerConfig.BLOCK_ON_BUFFER_FULL_CONFIG + " config is deprecated and will be removed soon. " +
"Please use " + ProducerConfig.MAX_BLOCK_MS_CONFIG);
boolean blockOnBufferFull = config.getBoolean(ProducerConfig.BLOCK_ON_BUFFER_FULL_CONFIG);
if (blockOnBufferFull) {
this.maxBlockTimeMs = Long.MAX_VALUE;
} else if (userProvidedConfigs.containsKey(ProducerConfig.METADATA_FETCH_TIMEOUT_CONFIG)) {
log.warn(ProducerConfig.METADATA_FETCH_TIMEOUT_CONFIG + " config is deprecated and will be removed soon. " +
"Please use " + ProducerConfig.MAX_BLOCK_MS_CONFIG);
this.maxBlockTimeMs = config.getLong(ProducerConfig.METADATA_FETCH_TIMEOUT_CONFIG);
} else {
this.maxBlockTimeMs = config.getLong(ProducerConfig.MAX_BLOCK_MS_CONFIG);
}
} else if (userProvidedConfigs.containsKey(ProducerConfig.METADATA_FETCH_TIMEOUT_CONFIG)) {
log.warn(ProducerConfig.METADATA_FETCH_TIMEOUT_CONFIG + " config is deprecated and will be removed soon. " +
"Please use " + ProducerConfig.MAX_BLOCK_MS_CONFIG);
this.maxBlockTimeMs = config.getLong(ProducerConfig.METADATA_FETCH_TIMEOUT_CONFIG);
} else {
this.maxBlockTimeMs = config.getLong(ProducerConfig.MAX_BLOCK_MS_CONFIG);
}
/* check for user defined settings.
* If the TIME_OUT config is set use that for request timeout.
* This should be removed with release 0.9
*/
if (userProvidedConfigs.containsKey(ProducerConfig.TIMEOUT_CONFIG)) {
log.warn(ProducerConfig.TIMEOUT_CONFIG + " config is deprecated and will be removed soon. Please use " +
ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG);
this.requestTimeoutMs = config.getInt(ProducerConfig.TIMEOUT_CONFIG);
} else {
this.requestTimeoutMs = config.getInt(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG);
}
//TODO 创建了一个核心的组件,这个组件就是我们的缓冲
this.accumulator = new RecordAccumulator(config.getInt(ProducerConfig.BATCH_SIZE_CONFIG),
this.totalMemorySize,
this.compressionType,
config.getLong(ProducerConfig.LINGER_MS_CONFIG),
retryBackoffMs,
metrics,
time);
List<InetSocketAddress> addresses = ClientUtils.parseAndValidateAddresses(config.getList(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG));
/**
* 去更新元数据
* address 这个地址就是我们写producer代码的时候,传参数的时候,传进去了一个broker地址 bootstrap.servers
* 这段代码看起来像是去服务端拉取元数据,所以等会我们去验证一下,是否真的去拉取元数据
* TODO update方法初始化的时候并没有去服务端拉取元数据
*/
this.metadata.update(Cluster.bootstrap(addresses), time.milliseconds());
ChannelBuilder channelBuilder = ClientUtils.createChannelBuilder(config.values());
/**
* TODO 十分十分重要
* 1. connections.max.idle.ms: 默认是9分钟
* 一个网络连接最多空闲多久,超过这个空闲时间,就关闭这个网络
* 2. max.in.flight.requests.per.connection 默认是 5
* producer向broker发送数据的时候,其实是并发发送数据的
* 每个网络连接可以忍受producer端发送给broker 消息然后消息没有响应的个数
* 因为kafka有重试机制,所以有可能会造成数据乱序,如果想要保证有序,这个值要设置为1
* 3. send.buffer.bytes: socket 发送数据的缓冲区的大小,默认是128K
* 4. receive.buffer.bytes: sicket接受数据的缓冲区的大小,默认值是32K
*/
NetworkClient client = new NetworkClient(
new Selector(config.getLong(ProducerConfig.CONNECTIONS_MAX_IDLE_MS_CONFIG), this.metrics, time, "producer", channelBuilder),
this.metadata,
clientId,
config.getInt(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION),
config.getLong(ProducerConfig.RECONNECT_BACKOFF_MS_CONFIG),
config.getInt(ProducerConfig.SEND_BUFFER_CONFIG),
config.getInt(ProducerConfig.RECEIVE_BUFFER_CONFIG),
this.requestTimeoutMs, time);
/**
* 我们在项目里面一定会设置重试机制
* 1. 重试的次数 reties 默认是0
* TODO
* 2. acks:
* 0:
* Producer发送数据到broker后,就没有后续了,也就没有返回值,不管写成功还是写失败都不管了
* 1:
* producer发送数据到broker后,数据成功写入leader partition 以后返回响应
* -1:
* producer 发送数据到broker后,数据要写入到leader partition 里面 并且数据同步到所有的follower partition里面以后
* 才返回响应
*
*/
//Sender 是一个线程
this.sender = new Sender(client,
this.metadata,
this.accumulator,
config.getInt(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION) == 1,
config.getInt(ProducerConfig.MAX_REQUEST_SIZE_CONFIG),
(short) parseAcks(config.getString(ProducerConfig.ACKS_CONFIG)),
config.getInt(ProducerConfig.RETRIES_CONFIG),
this.metrics,
new SystemTime(),
clientId,
this.requestTimeoutMs);
String ioThreadName = "kafka-producer-network-thread" + (clientId.length() > 0 ? " | " + clientId : "");
//这里将sender设置为 守护线程
//为什么要绕一圈呢 因为业务代码和关于线程的代码隔离开来
this.ioThread = new KafkaThread(ioThreadName, this.sender, true);
this.ioThread.start();
this.errors = this.metrics.sensor("errors");
config.logUnused();
AppInfoParser.registerAppInfo(JMX_PREFIX, clientId);
log.debug("Kafka producer started");
} catch (Throwable t) {
// call close methods if internal objects are already constructed
// this is to prevent resource leak. see KAFKA-2121
close(0, TimeUnit.MILLISECONDS, true);
// now propagate the exception
throw new KafkaException("Failed to construct kafka producer", t);
}
}
- 设置分区器,分区器是可以自定义的,你自定义的话,那么这里就是自定义的分区器
- 重试时间默认100ms
- 设置序列化器
- 设置过滤器或者称之为拦截器的一个东西
- 初始化元数据,其实刚开始 是空的
- 规定最大的消息是多大 默认最大1M 生产可以提高到10M 就差不多了
- 指定缓存大小 默认是32M
- 设置压缩格式
- 初始化RecordAccumulator也就是缓冲区指定为32M
- 更新元数据,在初始化的时候根本就没有去更新
- 创建NetworkClient
- 创建Sender线程
- KafkaThread将Sender设置为守护线程并启动
4. Producer元数据管理
4.1 MetaData元数据都包含了一些什么
在我们想要知道Producer获取元数据的时候我们应该知道 元数据都是一些什么?
Metadata
/**
* 两个革新元数据的请求的最小的时间间隔,100ms 其实也就是重试的时候
* 目的是减少网络的压力
*/
private final long refreshBackoffMs;
// 多久自动更新一次元数据 默认值是5分钟更新一次
private final long metadataExpireMs;
/**
* 对于producer来讲,元数据是有版本号的
* 每次更新元数据,都会修改一下这个版本号
*/
private int version;
/**
* 上次更新元数据的时间
*/
private long lastRefreshMs;
/**
* 上次成功更新元数据的时间
* 如果正常情况下,如果每次都是更新成功 那么 lastRefreshMs = lastSuccessfulRefreshMs 是相同的
*/
private long lastSuccessfulRefreshMs;
/**
* kafka集群本身的元数据
*/
private Cluster cluster;
/**
* 标识是否更新元数据的其中一个
*/
private boolean needUpdate;
/* Topics with expiry time */
/**
* 记录了当前已有的topics
*/
private final Map<String, Long> topics;
Cluster
/**
* 我们一个kafka集群是有多个节点的,这个参数代表的就是kafka的服务器的信息
*/
private final List<Node> nodes;
/**
* 没有授权的topic
*/
private final Set<String> unauthorizedTopics;
private final Set<String> internalTopics;
/**
* TODO 我们发现这里搞了很多的数据结构
*/
/**
* 代表的是一个partition 和 partition对应的信息
* 是因为parition是有副本的,也就是这个分区 都有哪些副本
*/
private final Map<TopicPartition, PartitionInfo> partitionsByTopicPartition;
/**
* 一个topic 对应的分区
*/
private final Map<String, List<PartitionInfo>> partitionsByTopic;
/**
* 一个topic 对应哪些可用partition
*/
private final Map<String, List<PartitionInfo>> availablePartitionsByTopic;
/**
* 一台服务器上面有哪些partition (服务器用的是服务器的编号)
* 也就是我们自己填写的 broker id
*
*/
private final Map<Integer, List<PartitionInfo>> partitionsByNode;
/**
* kafka集群的id 信息
*/
private final Map<Integer, Node> nodesById;
private final ClusterResource clusterResource;
Node
/**
* id 这个就是我们配置文件里面指定的
*/
private final int id;
private final String idString;
/**
* 主机名
*/
private final String host;
/**
* 端口号,默认是9092
*/
private final int port;
/**
* 机架
*/
private final String rack;
4.2 Metadata.update方法
/**
* Updates the cluster metadata. If topic expiry is enabled, expiry time
* is set for topics if required and expired topics are removed from the metadata.
*/
public synchronized void update(Cluster cluster, long now) {
Objects.requireNonNull(cluster, "cluster should not be null");
this.needUpdate = false;
this.lastRefreshMs = now;
this.lastSuccessfulRefreshMs = now;
//TODO 元数据版本号加1
this.version += 1;
if (this.needMetadataForAllTopics) {
// the listener may change the interested topics, which could cause another metadata refresh.
// If we have already fetched all topics, however, another fetch should be unnecessary.
this.needUpdate = false;
this.cluster = getClusterForCurrentTopics(cluster);
} else {
//给元数据赋值
this.cluster = cluster;
}
//TODO 唤醒
notifyAll();
log.debug("Updated cluster metadata version {} to {}", this.version, this.cluster);
}
这里已经要明白他在干什么,我之后就因为这里给看晕了,这里就记住 主要做了两件事
- 第一步就是版本号加一
- 第二部就是给集群元数据赋值(重点就是 这里,初始化的时候 只是把最基本的我们的topic 的地址赋值了,并没有partition的信息 切记)
5. Producer发送消息的核心流程初探
/**
* Implementation of asynchronously send a record to a topic.
*/
private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
TopicPartition tp = null;
try {
// first make sure the metadata for the topic is available
/**
* 步骤一:
* 同步等待拉取元数据
* maxBlockTimeMs
*/
ClusterAndWaitTime clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), maxBlockTimeMs);
//waitOnMetadata 代表的是拉取元数据花了多少时间
// maxBlockTimeMs- 用了多少时间 = 还剩余多少时间可以使用
long remainingWaitMs = Math.max(0, maxBlockTimeMs - clusterAndWaitTime.waitedOnMetadataMs);
// 更新集群的元数据
Cluster cluster = clusterAndWaitTime.cluster;
byte[] serializedKey;
try {
serializedKey = keySerializer.serialize(record.topic(), record.key());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG).getName() +
" specified in key.serializer");
}
byte[] serializedValue;
try {
serializedValue = valueSerializer.serialize(record.topic(), record.value());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
" specified in value.serializer");
}
/**
* 步骤三:
* 根据分区器选择消息应该发送的分区
*
* 因为前面我们已经获取到了元数据
* 也就代表了我们有了元数据
* 这个时候就计算我们要把数据发送到那个分区上面
*/
int partition = partition(record, serializedKey, serializedValue, cluster);
int serializedSize = Records.LOG_OVERHEAD + Record.recordSize(serializedKey, serializedValue);
/**
* 步骤四:
* 判断消息的大小是否超过了我们设置的阈值
* KAFKAProducer初始化的时候 指定了一个参数,这个参数就是判断一条消息是 多大 默认是 1M
*/
ensureValidRecordSize(serializedSize);
/**
* 步骤五:
* 根据元数据信息,封装分区对象
*/
tp = new TopicPartition(record.topic(), partition);
long timestamp = record.timestamp() == null ? time.milliseconds() : record.timestamp();
log.trace("Sending record {} with callback {} to topic {} partition {}", record, callback, record.topic(), partition);
// producer callback will make sure to call both 'callback' and interceptor callback
/**
* 步骤六:
* 给每一条消息都绑定他的回调函数
* 因为我们使用的是异步的方式发送的消息
*/
Callback interceptCallback = this.interceptors == null ? callback : new InterceptorCallback<>(callback, this.interceptors, tp);
/**
* 步骤七:
* 把消息放入accumulator(32M的一个内存)
* 然后有accumulator把消息封装成为一个批次一个批次的去发送
*/
RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey, serializedValue, interceptCallback, remainingWaitMs);
//如果批次满了
//或者新创建出来一个批次
if (result.batchIsFull || result.newBatchCreated) {
log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
/**
* 步骤八:
* 唤醒sender线程,他才是真正发送数据的线程
*/
this.sender.wakeup();
}
return result.future;
// handling exceptions and record the errors;
// for API exceptions return them in the future,
// for other exceptions throw directly
}
}
6. Producer加载元数据
这个时候我们什么都准备好了,要发送数据了
6.1 我们来看看KafkaProducer的doSend方法
/**
* Implementation of asynchronously send a record to a topic.
*/
private Future<RecordMetadata> doSend(ProducerRecord<K, V> record, Callback callback) {
TopicPartition tp = null;
try {
// first make sure the metadata for the topic is available
/**
* 步骤一:
* 同步等待拉取元数据
* maxBlockTimeMs
*/
ClusterAndWaitTime clusterAndWaitTime = waitOnMetadata(record.topic(), record.partition(), maxBlockTimeMs);
//waitOnMetadata 代表的是拉取元数据花了多少时间
// maxBlockTimeMs- 用了多少时间 = 还剩余多少时间可以使用
long remainingWaitMs = Math.max(0, maxBlockTimeMs - clusterAndWaitTime.waitedOnMetadataMs);
// 更新集群的元数据
Cluster cluster = clusterAndWaitTime.cluster;
byte[] serializedKey;
try {
serializedKey = keySerializer.serialize(record.topic(), record.key());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert key of class " + record.key().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG).getName() +
" specified in key.serializer");
}
byte[] serializedValue;
try {
serializedValue = valueSerializer.serialize(record.topic(), record.value());
} catch (ClassCastException cce) {
throw new SerializationException("Can't convert value of class " + record.value().getClass().getName() +
" to class " + producerConfig.getClass(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG).getName() +
" specified in value.serializer");
}
/**
* 步骤三:
* 根据分区器选择消息应该发送的分区
*
* 因为前面我们已经获取到了元数据
* 也就代表了我们有了元数据
* 这个时候就计算我们要把数据发送到那个分区上面
*/
int partition = partition(record, serializedKey, serializedValue, cluster);
int serializedSize = Records.LOG_OVERHEAD + Record.recordSize(serializedKey, serializedValue);
/**
* 步骤四:
* 判断消息的大小是否超过了我们设置的阈值
* KAFKAProducer初始化的时候 指定了一个参数,这个参数就是判断一条消息是 多大 默认是 1M
*/
ensureValidRecordSize(serializedSize);
/**
* 步骤五:
* 根据元数据信息,封装分区对象
*/
tp = new TopicPartition(record.topic(), partition);
long timestamp = record.timestamp() == null ? time.milliseconds() : record.timestamp();
log.trace("Sending record {} with callback {} to topic {} partition {}", record, callback, record.topic(), partition);
// producer callback will make sure to call both 'callback' and interceptor callback
/**
* 步骤六:
* 给每一条消息都绑定他的回调函数
* 因为我们使用的是异步的方式发送的消息
*/
Callback interceptCallback = this.interceptors == null ? callback : new InterceptorCallback<>(callback, this.interceptors, tp);
/**
* 步骤七:
* 把消息放入accumulator(32M的一个内存)
* 然后有accumulator把消息封装成为一个批次一个批次的去发送
*/
RecordAccumulator.RecordAppendResult result = accumulator.append(tp, timestamp, serializedKey, serializedValue, interceptCallback, remainingWaitMs);
//如果批次满了
//或者新创建出来一个批次
if (result.batchIsFull || result.newBatchCreated) {
log.trace("Waking up the sender since topic {} partition {} is either full or getting a new batch", record.topic(), partition);
/**
* 步骤八:
* 唤醒sender线程,他才是真正发送数据的线程
*/
this.sender.wakeup();
}
return result.future;
}
看到这里我们发现第一个步骤和元数据相关
6.2 waitOnMetadata
// add topic to metadata topic list if it is not there already and reset expiry
// 把当前的topic存入到元数据里面
metadata.add(topic);
//因为我们是初始化流程,这里并没有元数据 这个meta是空的,只有我们写代码的时候设置的 broker一些的 address
Cluster cluster = metadata.fetch();
/**
* 根据当前的topic从这个集群的cluster元数据信息里面查看分区的信息
* 因为我们目前是第一次执行这段代码,所以这儿肯定是没有对应的分区的信息的
*/
Integer partitionsCount = cluster.partitionCountForTopic(topic);
// Return cached metadata if we have it, and if the record's partition is either undefined
// or within the known partition range
/**
* 如果在元数据里面获取到了分区的信息
* 第一次肯定不是走着
*/
if (partitionsCount != null && (partition == null || partition < partitionsCount))
return new ClusterAndWaitTime(cluster, 0);
/**
* 如果走到这里 说明真的没有元数据信息
*/
long begin = time.milliseconds();
//剩余多少时间,默认值给的是,最多可以等待的时间
long remainingWaitMs = maxWaitMs;
//已经花了多少时间
long elapsed;
// Issue metadata requests until we have metadata for the topic or maxWaitTimeMs is exceeded.
// In case we already have cached metadata for the topic, but the requested partition is greater
// than expected, issue an update request only once. This is necessary in case the metadata
// is stale and the number of partitions for this topic has increased in the meantime.
do {
log.trace("Requesting metadata update for topic {}.", topic);
/**
* 1. 获取当前元数据的版本
* 在Producer管理元数据的时候,对于他来说元数据是有版本号的
* 每次成功更新元数据,都会递增这个版本号
* 2. 把needUpdate标识赋值为true
*/
int version = metadata.requestUpdate();
/**
* TODO 这个步骤很重要
* 我们发现这儿去唤醒sender线程
* 其实是因为,拉取元数据的这个操作是sender线程去完成的
* 这个地方把线程给唤醒了以后
* 我们知道sender线程肯定是开始搞事情了
*/
sender.wakeup();
try {
//TODO 等待元数据
//同步的等待
//等待着sender线程获取到元数据
metadata.awaitUpdate(version, remainingWaitMs);
} catch (TimeoutException ex) {
// Rethrow with original maxWaitMs to prevent logging exception with remainingWaitMs
throw new TimeoutException("Failed to update metadata after " + maxWaitMs + " ms.");
}
//尝试获取一下集群的元数据信息
cluster = metadata.fetch();
//计算一下 拉取元数据已经花了多少时间
elapsed = time.milliseconds() - begin;
//如果花的时间大于等待的时间,那么就报超时
if (elapsed >= maxWaitMs)
throw new TimeoutException("Failed to update metadata after " + maxWaitMs + " ms.");
//如果已经获取到了元数据,但是发现topic没有授权
if (cluster.unauthorizedTopics().contains(topic))
throw new TopicAuthorizationException(topic);
//计算出来还可以用的时间
remainingWaitMs = maxWaitMs - elapsed;
//尝试获取一下,我们要发送消息的这个topic对应分区的信息
//如果这个值不是null,说明前面sender线程 已经获取到了元数据了
partitionsCount = cluster.partitionCountForTopic(topic);
//如果获取到了元数据以后,这里的代码就会退出
} while (partitionsCount == null);
if (partition != null && partition >= partitionsCount) {
throw new KafkaException(
String.format("Invalid partition given with record: %d is not in the range [0...%d).", partition, partitionsCount));
}
return new ClusterAndWaitTime(cluster, elapsed);
}
这里我们很明白的可以看出来 真正元数据拉取操作的是 sender线程,所以我们现在sender线程的run方法
6.3 run
/**
* Run a single iteration of sending
*
* @param now
* The current POSIX time in milliseconds
*/
void run(long now) {
/**
* 场景驱动:
* 1. 代码第一次进来
* 获取元数据,因为是根据场景驱动,目前我们第一次代码进来 还没有获取到元数据
* 所以这个cluster里面是没有数据的,如果这里没有数据的,这个方法接下来的代码都不用看了,因为接下来的这些代码都依赖这个元数据
*/
Cluster cluster = metadata.fetch();
// get the list of partitions with data ready to send
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
// if there are any partitions whose leaders are not known yet, force metadata update
if (!result.unknownLeaderTopics.isEmpty()) {
// The set of topics with unknown leader contains topics with leader election pending as well as
// topics which may have expired. Add the topic again to metadata to ensure it is included
// and request metadata update, since there are messages to send to the topic.
for (String topic : result.unknownLeaderTopics)
this.metadata.add(topic);
this.metadata.requestUpdate();
}
// remove any nodes we aren't ready to send to
Iterator<Node> iter = result.readyNodes.iterator();
long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
Node node = iter.next();
if (!this.client.ready(node, now)) {
iter.remove();
notReadyTimeout = Math.min(notReadyTimeout, this.client.connectionDelay(node, now));
}
}
// create produce requests
Map<Integer, List<RecordBatch>> batches = this.accumulator.drain(cluster,
result.readyNodes,
this.maxRequestSize,
now);
if (guaranteeMessageOrder) {
// Mute all the partitions drained
for (List<RecordBatch> batchList : batches.values()) {
for (RecordBatch batch : batchList)
this.accumulator.mutePartition(batch.topicPartition);
}
}
List<RecordBatch> expiredBatches = this.accumulator.abortExpiredBatches(this.requestTimeout, now);
// update sensors
for (RecordBatch expiredBatch : expiredBatches)
this.sensors.recordErrors(expiredBatch.topicPartition.topic(), expiredBatch.recordCount);
sensors.updateProduceRequestMetrics(batches);
List<ClientRequest> requests = createProduceRequests(batches, now);
// If we have any nodes that are ready to send + have sendable data, poll with 0 timeout so this can immediately
// loop and try sending more data. Otherwise, the timeout is determined by nodes that have partitions with data
// that isn't yet sendable (e.g. lingering, backing off). Note that this specifically does not include nodes
// with sendable data that aren't ready to send since they would cause busy looping.
long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
if (result.readyNodes.size() > 0) {
log.trace("Nodes with data ready to send: {}", result.readyNodes);
log.trace("Created {} produce requests: {}", requests.size(), requests);
pollTimeout = 0;
}
for (ClientRequest request : requests)
client.send(request, now);
// if some partitions are already ready to be sent, the select time would be 0;
// otherwise if some partition already has some data accumulated but not ready yet,
// the select time will be the time difference between now and its linger expiry time;
// otherwise the select time will be the time difference between now and the metadata expiry time;
/**
* 步骤八:
* 真正执行网络操作的都是这个NetworkClient这个组件
* 包括:发送请求,接收响应(处理响应)
*
* 拉取元数据信息,靠的就是这段代码
*/
//我们猜这里可能就是去建立连接
this.client.poll(pollTimeout, now);
}
这个时候我们发现 元数据和所有的操作都相关,元数据都没有 其他的一些操作 根本就无法执行 所以最后的可能就是 poll去拉取了我们的元数据
6.4 NetworkClient的poll方法
/**
* Do actual reads and writes to sockets.
*
* @param timeout The maximum amount of time to wait (in ms) for responses if there are none immediately,
* must be non-negative. The actual timeout will be the minimum of timeout, request timeout and
* metadata timeout
* @param now The current time in milliseconds
* @return The list of responses received
*/
@Override
public List<ClientResponse> poll(long timeout, long now) {
//步骤一:封装了一个要拉去元数据的请求
/**
* 在这个方法里面涉及到kafka的网络的方法,但是目前我们还没有分析到kafaka的网络
* 我们只想知道元数据如何获取,我们只需要知道我们这里去拉取元数据
* 封装请求一个请求
*/
long metadataTimeout = metadataUpdater.maybeUpdate(now);
try {
//TODO 步骤二 执行网络的IO的操作 NIO
this.selector.poll(Utils.min(timeout, metadataTimeout, requestTimeoutMs));
} catch (IOException e) {
log.error("Unexpected error during I/O", e);
}
// process completed actions
long updatedNow = this.time.milliseconds();
List<ClientResponse> responses = new ArrayList<>();
handleCompletedSends(responses, updatedNow);
//步骤三 处理响应,响应里面就会有我们需要的元数据
/**
* 这个地方是我们再看生产者如何获取元数据的时候看的
* 其实kafka获取元数据的流程跟我们发送消息的流程是一模一样的
* 获取元数据 判断网络是否建立好,建立网络连接,发送请求,服务端返回响应
*/
handleCompletedReceives(responses, updatedNow);
handleDisconnections(responses, updatedNow);
handleConnections();
handleTimedOutRequests(responses, updatedNow);
// invoke callbacks
for (ClientResponse response : responses) {
if (response.request().hasCallback()) {
try {
response.request().callback().onComplete(response);
} catch (Exception e) {
log.error("Uncaught error in request completion:", e);
}
}
}
return responses;
}
这里我们发现一共进行了三个步骤:
- 构建元数据请求 发送请求
- 获取响应
- 处理响应
6.5 maybeUpdate 封装请求 和 发送请求
/**
* Add a metadata request to the list of sends if we can make one
*/
private void maybeUpdate(long now, Node node) {
if (node == null) {
log.debug("Give up sending metadata request since no node is available");
// mark the timestamp for no node available to connect
this.lastNoNodeAvailableMs = now;
return;
}
String nodeConnectionId = node.idString();
/**
* 判断网络连接是否建立好
* 因为我们还没看kafka的网络,所以这里我们就认为他建立好了
*/
if (canSendRequest(nodeConnectionId)) {
this.metadataFetchInProgress = true;
MetadataRequest metadataRequest;
if (metadata.needMetadataForAllTopics())
/**
* 封装请求 (封装所有的topics) 的元数据信息的请求
* 但是我们一般获取元数据的时候,只获取自己要发送消息的
* 对应的topic的元数据信息
*/
metadataRequest = MetadataRequest.allTopics();
else
//我们默认走的肯定是这个方法,因为我只获取一个
//这里就是拉取元数据对应的Topic
metadataRequest = new MetadataRequest(new ArrayList<>(metadata.topics()));
//这儿就给我们创建了一个请求(拉取元数据的)
ClientRequest clientRequest = request(now, nodeConnectionId, metadataRequest);
log.debug("Sending metadata request {} to node {}", metadataRequest, node.id());
doSend(clientRequest, now);
}
}
}
6.6 handleCompletedReceives 处理响应
/**
* Handle any completed receives and update the response list with the responses received.
*
* @param responses The list of responses to update
* @param now The current time
*/
private void handleCompletedReceives(List<ClientResponse> responses, long now) {
for (NetworkReceive receive : this.selector.completedReceives()) {
//获取broker id
String source = receive.source();
/**
* kafka 有这样的一个机制: 每个连接可以容忍5个发送出去了,但是还没接收到响应的
*/
ClientRequest req = inFlightRequests.completeNext(source);
Struct body = parseResponse(receive.payload(), req.request().header());
//TODO 元数据的响应
if (!metadataUpdater.maybeHandleCompletedReceive(req, now, body))
responses.add(new ClientResponse(req, now, false, body));
}
}
@Override
public boolean maybeHandleCompletedReceive(ClientRequest req, long now, Struct body) {
short apiKey = req.request().header().apiKey();
if (apiKey == ApiKeys.METADATA.id && req.isInitiatedByNetworkClient()) {
//TODO 处理响应
handleResponse(req.request().header(), body, now);
return true;
}
return false;
}
private void handleResponse(RequestHeader header, Struct body, long now) {
this.metadataFetchInProgress = false;
/**
* 因为服务端发送回来的是一个二进制的一个数据结构
* 所以生产者这儿要对这个数据结构进行解析
* 解析完了之后就封装成一个MetadataResponse对象
*/
MetadataResponse response = new MetadataResponse(body);
/**
* 响应里面会带回来元数据的信息
* 获取到了从元数据拉取的集群的元数据信息
*/
Cluster cluster = response.cluster();
// check if any topics metadata failed to get updated
Map<String, Errors> errors = response.errors();
if (!errors.isEmpty())
log.warn("Error while fetching metadata with correlation id {} : {}", header.correlationId(), errors);
// don't update the cluster if there are no valid nodes...the topic we want may still be in the process of being
// created which means we will get errors and no nodes until it exists
/**
* 如果正常获取到了元数据的信息
*/
if (cluster.nodes().size() > 0) {
//更新元数据
this.metadata.update(cluster, now);
} else {
log.trace("Ignoring empty metadata response with correlation id {}.", header.correlationId());
this.metadata.failedUpdate(now);
}
}
6.7 更新元数据
/**
* Updates the cluster metadata. If topic expiry is enabled, expiry time
* is set for topics if required and expired topics are removed from the metadata.
*/
public synchronized void update(Cluster cluster, long now) {
Objects.requireNonNull(cluster, "cluster should not be null");
this.needUpdate = false;
this.lastRefreshMs = now;
this.lastSuccessfulRefreshMs = now;
//TODO 元数据版本号加1
this.version += 1;
//这个默认值是true
if (topicExpiryEnabled) {
// Handle expiry of topics from the metadata refresh set.
for (Iterator<Map.Entry<String, Long>> it = topics.entrySet().iterator(); it.hasNext(); ) {
Map.Entry<String, Long> entry = it.next();
long expireMs = entry.getValue();
if (expireMs == TOPIC_EXPIRY_NEEDS_UPDATE)
entry.setValue(now + TOPIC_EXPIRY_MS);
else if (expireMs <= now) {
it.remove();
log.debug("Removing unused topic {} from the metadata list, expiryMs {} now {}", entry.getKey(), expireMs, now);
}
}
}
for (Listener listener: listeners)
listener.onMetadataUpdate(cluster);
String previousClusterId = cluster.clusterResource().clusterId();
if (this.needMetadataForAllTopics) {
// the listener may change the interested topics, which could cause another metadata refresh.
// If we have already fetched all topics, however, another fetch should be unnecessary.
this.needUpdate = false;
this.cluster = getClusterForCurrentTopics(cluster);
} else {
//给元数据赋值
this.cluster = cluster;
}
// The bootstrap cluster is guaranteed not to have any useful information
if (!cluster.isBootstrapConfigured()) {
String clusterId = cluster.clusterResource().clusterId();
if (clusterId == null ? previousClusterId != null : !clusterId.equals(previousClusterId))
log.info("Cluster ID: {}", cluster.clusterResource().clusterId());
clusterResourceListeners.onUpdate(cluster.clusterResource());
}
//TODO 唤醒主线程
notifyAll();
log.debug("Updated cluster metadata version {} to {}", this.version, this.cluster);
}
6.8 总结
7.Kafka发送消息的时候分区是如何选择
/**
* Compute the partition for the given record.
*
* @param topic The topic name
* @param key The key to partition on (or null if no key)
* @param keyBytes serialized key to partition on (or null if no key)
* @param value The value to partition on or null
* @param valueBytes serialized value to partition on or null
* @param cluster The current cluster metadata
*/
public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
//首先获取到我们要发送消息的对应的topic的分区的信息
List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
//计算出来分区的总的个数
int numPartitions = partitions.size();
/**
* producer 发送数据的时候:
*
* message
* key,message
*
*/
//策略一: 如果发送消息的时候没有指定key
if (keyBytes == null) {
//这有个计数器
//AtomicInteger 每次执行是自增的,初始化是一个随机值
int nextValue = counter.getAndIncrement();
List<PartitionInfo> availablePartitions = cluster.availablePartitionsForTopic(topic);
if (availablePartitions.size() > 0) {
//计算我们要发送到那个分区上
//一个数如果对10进行取模,0-9之间
//11 % 9
//12 % 9
//13 % 9
//我认为这里应该使用& 的方式 性能会比 取模要高 Utils.toPositive(nextValue) & availablePartitions.size() -1;
// 这里取正数的原因是 初始时 有可能是一个负数
int part = Utils.toPositive(nextValue) % availablePartitions.size();
return availablePartitions.get(part).partition();
} else {
// no partitions are available, give a non-available partition
return Utils.toPositive(nextValue) % numPartitions;
}
}
//策略二: 如果发送消息的时候指定key了
else {
// hash the keyBytes to choose a partition
//策略二: 这个地方就是指定了key
/**
* hash the keybytes to choose a partition
* 直接对key取一个hash值 % 分区的总数取模
* 如果是同一个key,计算出来的分区肯定是同一个分区
* 如果我们想要让消息能发送到一个分区上面,那么我们就必须制定key,这一点非常重要
*/
return Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
}
}
- 指定key 就走 hash 然后取模
- 没指定key 就轮询
8. Producer写入数据中的Accumulator封装消息流程初探
8.1 accumulator的append
// We keep track of the number of appending thread to make sure we do not miss batches in
// abortIncompleteBatches().
appendsInProgress.incrementAndGet();
try {
// check if we have an in-progress batch
/**
* 步骤一
* 获取一个队列,获取当前对应分区的队列
*/
Deque<RecordBatch> dq = getOrCreateDeque(tp); //对应分区的队列
/**
* 假设我们有线程1 线程2 线程3
*
*/
synchronized (dq) {
if (closed)
throw new IllegalStateException("Cannot send after the producer is closed.");
/**
* 步骤二:
* 尝试往队列里面的批次里添加数据
* 一开始添加数据肯定是失败的,我们目前只是获取到了队列
* 数据是需要存储在批次对象里面(这个批次对象是需要分配内存的)
* 我们目前还没有分配内存,所以如果按场景驱动的方式
* 代码第一次运行到这儿其实是不成功的
*/
RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
//线程1 进来的时候
//第一次进来的时候appendResult的值为null
if (appendResult != null)
return appendResult;
}//释放锁
// we don't have an in-progress record batch try to allocate a new batch
/**
* 步骤3:计算一个批次的大小
* 在消息的大小和批次的大小之间去一个最大值,用这个值作为当前这个批次的大小
* 有可能我们的一个消息的大小比一个设定好的批次的大小还要大
*
* 默认一个批次的大小就时16K
*
* 所以我们看到这段代码以后,我们应该能感觉到
* 如果我们生产者发送数据的时候,如果我们的消息的大小都是超过16K了
* 说明其实就是一条消息就是一个批次,那也就是说消息是一条一条被发送出去的
* 那如果这样的话,批次这个概念设计就没有意义了
* 所以我们应该又找到了一个优化的点,根据公司的消息数据的大小情况 来动态调整这个值
*
*/
int size = Math.max(this.batchSize, Records.LOG_OVERHEAD + Record.recordSize(key, value));
log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
/**
* 步骤4:
* 根据批次的大小去分配内存
*/
ByteBuffer buffer = free.allocate(size, maxTimeToBlock);
synchronized (dq) {
// Need to check if producer is closed again after grabbing the dequeue lock.
if (closed)
throw new IllegalStateException("Cannot send after the producer is closed.");
/**
* 步骤5:
* 尝试吧数据写入到批次里面
* 代码第一次执行到这儿的时候,依然还是失败的(appendResult==null)
* 目前虽然已经分配了内存
* 但是还没有创建批次,那我们想往批次里面写数据
* 这是不能写的
*/
RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
if (appendResult != null) {
// Somebody else found us a batch, return the one we waited for! Hopefully this doesn't happen often...
free.deallocate(buffer);
return appendResult;
}
MemoryRecords records = MemoryRecords.emptyRecords(buffer, compression, this.batchSize);
RecordBatch batch = new RecordBatch(tp, records, time.milliseconds());
FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, callback, time.milliseconds()));
dq.addLast(batch);
incomplete.add(batch);
return new RecordAppendResult(future, dq.size() > 1 || batch.records.isFull(), true);
}
} finally {
appendsInProgress.decrementAndGet();
}
我们简单的可以看的出来,首先是获取分区器,然后获取到当前消息对应的TopicPartition
,然后获取到对应的队列也就是Deuqe,
如果没有Batch就开辟空间,然后封装Batch,添加到队列尾部
那么我们这里知道了大概其是怎么封装的数据,那么 我们应该可以发送,每来一条消息都会拿对应的队列,那么效率这么高是怎么实现的呢?
我们还发现有些代码比较冗余,那么真的是冗余代码吗
9. batches的实现CopyOnWriteMap
package org.apache.kafka.common.utils;
import java.util.Collection;
import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ConcurrentMap;
/**
* A simple read-optimized map implementation that synchronizes only writes and does a full copy on each modification
*/
public class CopyOnWriteMap<K, V> implements ConcurrentMap<K, V> {
/**
* 核心的变量就是map
* 这个map使用的是volatile关键字
* 在多线程情况下,如果这个map值发生变化,其他线程也是可以看到的 内存可见性
*/
private volatile Map<K, V> map;
public CopyOnWriteMap() {
this.map = Collections.emptyMap();
}
/**
* 没有加锁,读取数据的时候性能很高(高并发的场景下,性能会很高)
* 并且是线程安全的
* 因为采用的是读写分离的思想
* @param k
* @return
*/
@Override
public V get(Object k) {
return map.get(k);
}
@Override
public synchronized V put(K k, V v) {
//加锁了
//新的内存空间
//读写分离
//往新的内存里面插入数据
//这个时候读的话,读还是老的
Map<K, V> copy = new HashMap<K, V>(this.map);
V prev = copy.put(k, v);
//赋值给当前map,这个时候其他线程就能发现,数据被改变了
this.map = Collections.unmodifiableMap(copy);
return prev;
}
//...
}
我们这里就发现了,batches采用的是读写分离的思想,我们可以肯定的知道读多写少,这个batches其实就是Map<TopicPartition,Deque<RecordBatch>>
这样的一种数据类型,那么我们每进来一条数据我们就获取一个topicparition,没有我们再创建,也就是添加到 batches,假设10000条数据,3个分区,那么获取了10000次,写操作才3次,这也就是kafka为什么性能高,把细节都考虑到了
这种只适合 读多写少的场景
10. 把数据写入对应的批次(采用了分段加锁)
代码格式:
append(){
// 这里采用的是读写分离,所以是线程安全的
Deque<RecordBatch> dq = getOrCreateDeque(tp); //对应分区的队列
synchronized (dq) {
}
//....
synchronized (dq) {
}
}
这个不就是我们典型的分段加锁的机制吗,我们在hdfs当中也发现了这种结构,值得我们学习多线程编发编程的一些知识
让我们再仔细看看append的代码,假设我们此时是所有的消息都是往一个topic下的一个分区发送
10.1 线程 1
// check if we have an in-progress batch
/**
* 步骤一
* 获取一个队列,获取当前对应分区的队列
*/
Deque<RecordBatch> dq = getOrCreateDeque(tp); //对应分区的队列
/**
* 假设我们有线程1 线程2 线程3
*/
synchronized (dq) {
if (closed)
throw new IllegalStateException("Cannot send after the producer is closed.");
/**
* 步骤二:
* 尝试往队列里面的批次里添加数据
* 一开始添加数据肯定是失败的,我们目前只是获取到了队列
* 数据是需要存储在批次对象里面(这个批次对象是需要分配内存的)
* 我们目前还没有分配内存,所以如果按场景驱动的方式
* 代码第一次运行到这儿其实是不成功的
* 线程1进来的时候是拿不到东西的,
*/
RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
//线程1 进来的时候
//第一次进来的时候appendResult的值为null
if (appendResult != null)
return appendResult;
}//释放锁
这个时候 线程1 并没有获取到什么东西 也就是 appendResult
为null,这个执行完,记住 这里释放锁 线程二是可以进来的
// we don't have an in-progress record batch try to allocate a new batch
/**
* 步骤3:计算一个批次的大小
* 在消息的大小和批次的大小之间去一个最大值,用这个值作为当前这个批次的大小
* 有可能我们的一个消息的大小比一个设定好的批次的大小还要大
*
* 默认一个批次的大小就时16K
*
* 所以我们看到这段代码以后,我们应该能感觉到
* 如果我们生产者发送数据的时候,如果我们的消息的大小都是超过16K了
* 说明其实就是一条消息就是一个批次,那也就是说消息是一条一条被发送出去的
* 那如果这样的话,批次这个概念设计就没有意义了
* 所以我们应该又找到了一个优化的点,根据公司的消息数据的大小情况 来动态调整这个值
*
*/
int size = Math.max(this.batchSize, Records.LOG_OVERHEAD + Record.recordSize(key, value));
log.trace("Allocating a new {} byte message buffer for topic {} partition {}", size, tp.topic(), tp.partition());
/**
* 步骤4:
* 根据批次的大小去分配内存
* 根据批次的大小去分配内存
*
* 线程1,线程2,线程3 执行到这里的时候 都会申请内存
* 假设每个线程都申请了 16K的内存
* 线程1 16K
* 线程2 16K
*
*/
ByteBuffer buffer = free.allocate(size, maxTimeToBlock);
线程1创建了一个大小为16K的内存 buffer
synchronized (dq) {
// Need to check if producer is closed again after grabbing the dequeue lock.
if (closed)
throw new IllegalStateException("Cannot send after the producer is closed.");
/**
* 步骤5:
* 尝试吧数据写入到批次里面
* 代码第一次执行到这儿的时候,依然还是失败的(appendResult==null)
* 目前虽然已经分配了内存
* 但是还没有创建批次,那我们想往批次里面写数据
* 这是不能写的
*/
RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
if (appendResult != null) {
// Somebody else found us a batch, return the one we waited for! Hopefully this doesn't happen often...
// 释放内存
/**
* 线程2到这,其实他自己已经把数据写到批次了,所以他的内存也就没什么用了,就把内存释放了
*/
free.deallocate(buffer);
//
return appendResult;
}
/**
* 步骤6: 封装RecordBatch
* 根据内存大小封装批次
* 线程1 到这里会根据内存封装出来一个批次
*/
MemoryRecords records = MemoryRecords.emptyRecords(buffer, compression, this.batchSize);
RecordBatch batch = new RecordBatch(tp, records, time.milliseconds());
FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, callback, time.milliseconds()));
/**
* 步骤7:
* 把这个批次放入到这个队列的队尾
* 线程1 把批次添加到队尾
*/
dq.addLast(batch);
incomplete.add(batch);
return new RecordAppendResult(future, dq.size() > 1 || batch.records.isFull(), true);
}
线程1第二次进行尝试apend 其实还是不行的,我只有内存 没有 RecordBatch呢
然后就创建了RecordBatch,把他添加到队尾
10.2 线程2
线程2,这个时候应该在哪里我们应该很清楚
synchronized (dq) {
// Need to check if producer is closed again after grabbing the dequeue lock.
if (closed)
throw new IllegalStateException("Cannot send after the producer is closed.");
/**
* 步骤5:
* 尝试吧数据写入到批次里面
* 代码第一次执行到这儿的时候,依然还是失败的(appendResult==null)
* 目前虽然已经分配了内存
* 但是还没有创建批次,那我们想往批次里面写数据
* 这是不能写的
*/
RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
if (appendResult != null) {
// Somebody else found us a batch, return the one we waited for! Hopefully this doesn't happen often...
// 释放内存
/**
* 线程2到这,其实他自己已经把数据写到批次了,所以他的内存也就没什么用了,就把内存释放了
*/
free.deallocate(buffer);
//
return appendResult;
}
/**
* 步骤6: 封装RecordBatch
* 根据内存大小封装批次
* 线程1 到这里会根据内存封装出来一个批次
*/
MemoryRecords records = MemoryRecords.emptyRecords(buffer, compression, this.batchSize);
RecordBatch batch = new RecordBatch(tp, records, time.milliseconds());
FutureRecordMetadata future = Utils.notNull(batch.tryAppend(timestamp, key, value, callback, time.milliseconds()));
/**
* 步骤7:
* 把这个批次放入到这个队列的队尾
* 线程1 把批次添加到队尾
*/
dq.addLast(batch);
incomplete.add(batch);
return new RecordAppendResult(future, dq.size() > 1 || batch.records.isFull(), true);
}
他也是走到这里了,因为是分段加锁 线程1释放锁了,所以线程2,也构建了一块16K的内存,
接着他进入到tryAppend
/**
* If `RecordBatch.tryAppend` fails (i.e. the record batch is full), close its memory records to release temporary
* resources (like compression streams buffers).
*/
private RecordAppendResult tryAppend(long timestamp, byte[] key, byte[] value, Callback callback, Deque<RecordBatch> deque) {
//首先要获取到队列里面的一个批次
RecordBatch last = deque.peekLast();
//第一次进来是没有批次的,所以last肯定为null
//线程2进来的时候 这个last不为空
if (last != null) {
//线程2就 插入数据就ok了,如果满了就会为null
FutureRecordMetadata future = last.tryAppend(timestamp, key, value, callback, time.milliseconds());
if (future == null)
last.records.close();
else
//返回值就不为空了
return new RecordAppendResult(future, deque.size() > 1 || last.records.isFull(), false);
}
//返回结果 第一次的时候 是没有值的
return null;
}
这个时候进来我们的last不为空,并且 添加操作 的这个batch并没有满,我们就进行添加到我们的batch里面直接返回就好了
这个时候也就是添加成功了,那么添加成功的话我当初还申请了16K的内存啊,我得释放掉,接着返回就好了
10.3 线程3
这个时候线程3进来,有batch直接添加就好了
/**
* 假设我们有线程1 线程2 线程3
*/
synchronized (dq) {
if (closed)
throw new IllegalStateException("Cannot send after the producer is closed.");
/**
* 步骤二:
* 尝试往队列里面的批次里添加数据
* 一开始添加数据肯定是失败的,我们目前只是获取到了队列
* 数据是需要存储在批次对象里面(这个批次对象是需要分配内存的)
* 我们目前还没有分配内存,所以如果按场景驱动的方式
* 代码第一次运行到这儿其实是不成功的
* 线程1进来的时候是拿不到东西的,
*/
RecordAppendResult appendResult = tryAppend(timestamp, key, value, callback, dq);
//线程1 进来的时候
//第一次进来的时候appendResult的值为null
if (appendResult != null)
return appendResult;
}//释放锁
10.4 总结
这一块是Producer最精华的部分,这一块并不会冗余,他榨干了所有的性能,而且采用了分段锁,增加了很高的性能,假如要是我们写的话,可能只会在方法上写一个
sync关键字就好了,但是人家并没有这么做,这是我们应该学习的地方 其实这里叫做 分段加锁 双重检测机制
11. Producer内存池设计
11.1 申请内存
/**
* Allocate a buffer of the given size. This method blocks if there is not enough memory and the buffer pool
* is configured with blocking mode.
*
* @param size The buffer size to allocate in bytes
* @param maxTimeToBlockMs The maximum time in milliseconds to block for buffer memory to be available
* @return The buffer
* @throws InterruptedException If the thread is interrupted while blocked
* @throws IllegalArgumentException if size is larger than the total memory controlled by the pool (and hence we would block
* forever)
*/
public ByteBuffer allocate(int size, long maxTimeToBlockMs) throws InterruptedException {
//如果你想要申请的内存大笑,如果超过了32M 那么就会报错
if (size > this.totalMemory)
throw new IllegalArgumentException("Attempt to allocate " + size
+ " bytes, but there is a hard limit of "
+ this.totalMemory
+ " on memory allocations.");
//加锁
this.lock.lock();
try {
// check if we have a free buffer of the right size pooled
/**
* 我们采用场景驱动的方式,我们现在第一次进来
* 内存池里面其实是没有内存的,所以这里获取不到内存
* poolableSize 默认就是批次的大小
* free 并且我们的池子不是空地,我就把第一个buffer给你
*/
if (size == poolableSize && !this.free.isEmpty())
return this.free.pollFirst();
// now check if the request is immediately satisfiable with the
// memory on hand or if we need to block
/**
* 内存的个数 * 批次的大小 = free的大小
* 内存池内存的大小
*/
int freeListSize = this.free.size() * this.poolableSize;
/**
* this.availableMemory + freeListSize 目前可用的总内存
* 这里就是判断的就是 目前可用的总内存大于你要申请的内存
*/
if (this.availableMemory + freeListSize >= size) {
// 我们有足够的未分配或池化内存来进行分配
// 满足要求
freeUp(size);
this.availableMemory -= size;
lock.unlock();
//直接分配内存 NIO 里面的其实就是分配一个缓冲区,大小为 Math.max(消息,我们配置或者默认的batch大小 默认是16K)
return ByteBuffer.allocate(size);
} else {
/**
* 这种情况就是,我们整个内存池 资源不够了,假如我们的消息大小是32K
* 我们需要32K 的内存,因为 32K大于16K ,但是此时我们的内存池资源不够32K 就会走到这里
*
*/
// we are out of memory and will have to block
int accumulated = 0;
ByteBuffer buffer = null;
// JUC的精确唤醒
Condition moreMemory = this.lock.newCondition();
long remainingTimeToBlockNs = TimeUnit.MILLISECONDS.toNanos(maxTimeToBlockMs);
//等待 被人释放内存
// 把我们需要唤醒的东西存放到我们队列里面,因为此时可能等待内存的线程有很多,那么就先来后到 谁先到谁先获得资源
this.waiters.addLast(moreMemory);
// loop over and over until we have a buffer or have reserved
// enough memory to allocate one
/**
* 总的分配思路,就是可能一下分配不了这么大的内存,但是可以先有一点慢慢积攒
* 如果归还回来的内存,还是没有要申请的内存大小大
* accumulated 16K+16K
*/
while (accumulated < size) {
long startWaitNs = time.nanoseconds();
long timeNs;
boolean waitingTimeElapsed;
/**
* 在这里等待,等待别人释放内存
* 如果这里的dia,a是等待wait的操作
* 那么我们可以猜想一下,当有人释放内存的时候,肯定不是唤醒这里的代码
* 假如有人往内存池里面还了内存,那么代码继续往下执行
*/
try {
/**
* 1. 时间到了
* 2. 被人唤醒了
*/
waitingTimeElapsed = !moreMemory.await(remainingTimeToBlockNs, TimeUnit.NANOSECONDS);
} catch (InterruptedException e) {
this.waiters.remove(moreMemory);
throw e;
} finally {
long endWaitNs = time.nanoseconds();
timeNs = Math.max(0L, endWaitNs - startWaitNs);
this.waitTime.record(timeNs, time.milliseconds());
}
if (waitingTimeElapsed) {
this.waiters.remove(moreMemory);
throw new TimeoutException("Failed to allocate memory within the configured max blocking time " + maxTimeToBlockMs + " ms.");
}
remainingTimeToBlockNs -= timeNs;
// check if we can satisfy this request from the free list,
// otherwise allocate memory
/**
* 假如自己醒了,那么就去内存池看一下有没有这么多的内存了,
* 如果内存池里面有buffer可用,那么就看看我要的是16k吗,并且 free不为空
*/
if (accumulated == 0 && size == this.poolableSize && !this.free.isEmpty()) {
// just grab a buffer from the free list
//这儿的就可以直接获取到内存
buffer = this.free.pollFirst();
accumulated = size;
} else {
// we'll need to allocate memory, but we may only get
// part of what we need on this iteration
freeUp(size - accumulated);
//如果不够 就把我需要的内存累加起来,等到什么时候累加够了 就给你用
int got = (int) Math.min(size - accumulated, this.availableMemory);
this.availableMemory -= got;
accumulated += got;
}
}
}
11.2 归还内存
/**
* Return buffers to the pool. If they are of the poolable size add them to the free list, otherwise just mark the
* memory as free.
*
* @param buffer The buffer to return
* @param size The size of the buffer to mark as deallocated, note that this maybe smaller than buffer.capacity
* since the buffer may re-allocate itself during in-place compression
*/
public void deallocate(ByteBuffer buffer, int size) {
lock.lock();
try {
//如果你还回来的内存大小 等于一个批次的大小,就把内存还回去
if (size == this.poolableSize && size == buffer.capacity()) {
buffer.clear();
this.free.add(buffer);
} else {
//如果不是 就放弃等待垃圾回收
this.availableMemory += size;
}
Condition moreMem = this.waiters.peekFirst();
if (moreMem != null)
moreMem.signal();
} finally {
lock.unlock();
}
}
11.3 总结
12. Sender线程流程初探
/**
* Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE
* file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file
* to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the
* License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
* an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
* specific language governing permissions and limitations under the License.
*/
package org.apache.kafka.clients.producer.internals;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.kafka.clients.ClientRequest;
import org.apache.kafka.clients.ClientResponse;
import org.apache.kafka.clients.KafkaClient;
import org.apache.kafka.clients.Metadata;
import org.apache.kafka.clients.RequestCompletionHandler;
import org.apache.kafka.common.Cluster;
import org.apache.kafka.common.Node;
import org.apache.kafka.common.TopicPartition;
import org.apache.kafka.common.errors.InvalidMetadataException;
import org.apache.kafka.common.errors.RetriableException;
import org.apache.kafka.common.errors.TopicAuthorizationException;
import org.apache.kafka.common.errors.UnknownTopicOrPartitionException;
import org.apache.kafka.common.metrics.Measurable;
import org.apache.kafka.common.metrics.MetricConfig;
import org.apache.kafka.common.MetricName;
import org.apache.kafka.common.metrics.Metrics;
import org.apache.kafka.common.metrics.Sensor;
import org.apache.kafka.common.metrics.stats.Avg;
import org.apache.kafka.common.metrics.stats.Max;
import org.apache.kafka.common.metrics.stats.Rate;
import org.apache.kafka.common.protocol.ApiKeys;
import org.apache.kafka.common.protocol.Errors;
import org.apache.kafka.common.record.Record;
import org.apache.kafka.common.requests.ProduceRequest;
import org.apache.kafka.common.requests.ProduceResponse;
import org.apache.kafka.common.requests.RequestSend;
import org.apache.kafka.common.utils.Time;
import org.apache.kafka.common.utils.Utils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* The background thread that handles the sending of produce requests to the Kafka cluster. This thread makes metadata
* requests to renew its view of the cluster and then sends produce requests to the appropriate nodes.
*/
public class Sender implements Runnable {
private static final Logger log = LoggerFactory.getLogger(Sender.class);
/* the state of each nodes connection */
private final KafkaClient client;
/* the record accumulator that batches records */
private final RecordAccumulator accumulator;
/* the metadata for the client */
private final Metadata metadata;
/* the flag indicating whether the producer should guarantee the message order on the broker or not. */
private final boolean guaranteeMessageOrder;
/* the maximum request size to attempt to send to the server */
private final int maxRequestSize;
/* the number of acknowledgements to request from the server */
private final short acks;
/* the number of times to retry a failed request before giving up */
private final int retries;
/* the clock instance used for getting the time */
private final Time time;
/* true while the sender thread is still running */
private volatile boolean running;
/* true when the caller wants to ignore all unsent/inflight messages and force close. */
private volatile boolean forceClose;
/* metrics */
private final SenderMetrics sensors;
/* param clientId of the client */
private String clientId;
/* the max time to wait for the server to respond to the request*/
private final int requestTimeout;
public Sender(KafkaClient client,
Metadata metadata,
RecordAccumulator accumulator,
boolean guaranteeMessageOrder,
int maxRequestSize,
short acks,
int retries,
Metrics metrics,
Time time,
String clientId,
int requestTimeout) {
this.client = client;
this.accumulator = accumulator;
this.metadata = metadata;
this.guaranteeMessageOrder = guaranteeMessageOrder;
this.maxRequestSize = maxRequestSize;
this.running = true;
this.acks = acks;
this.retries = retries;
this.time = time;
this.clientId = clientId;
this.sensors = new SenderMetrics(metrics);
this.requestTimeout = requestTimeout;
}
/**
* The main run loop for the sender thread
*/
public void run() {
log.debug("Starting Kafka producer I/O thread.");
// main loop, runs until close is called
//其实代码就是一个死循环,然后一直在运行
//所以我们应该知道sender线程启动起来后 是不停的
while (running) {
try {
//TODO 核心代码
run(time.milliseconds());
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
log.debug("Beginning shutdown of Kafka producer I/O thread, sending remaining records.");
// okay we stopped accepting requests but there may still be
// requests in the accumulator or waiting for acknowledgment,
// wait until these are completed.
while (!forceClose && (this.accumulator.hasUnsent() || this.client.inFlightRequestCount() > 0)) {
try {
run(time.milliseconds());
} catch (Exception e) {
log.error("Uncaught error in kafka producer I/O thread: ", e);
}
}
if (forceClose) {
// We need to fail all the incomplete batches and wake up the threads waiting on
// the futures.
this.accumulator.abortIncompleteBatches();
}
try {
this.client.close();
} catch (Exception e) {
log.error("Failed to close network client", e);
}
log.debug("Shutdown of Kafka producer I/O thread has completed.");
}
/**
* Run a single iteration of sending
*
* @param now
* The current POSIX time in milliseconds
*/
void run(long now) {
/**
* 场景驱动:
* 1. 代码第一次进来
* 获取元数据,因为是根据场景驱动,目前我们第一次代码进来 还没有获取到元数据
* 所以这个cluster里面是没有数据的,如果这里没有数据的,这个方法接下来的代码都不用看了,因为接下来的这些代码都依赖这个元数据
* 获取元数据
*/
Cluster cluster = metadata.fetch();
// get the list of partitions with data ready to send
/**
* 2. 首先判断哪些partition有消息可以发送:
* 我们看一下一个批次可以发送出去的条件
* 获取到这个分区的leader 对应的broker主机
*
*/
RecordAccumulator.ReadyCheckResult result = this.accumulator.ready(cluster, now);
// if there are any partitions whose leaders are not known yet, force metadata update
/**
* 3. 标识没有拉取到元数据的topic
*/
if (!result.unknownLeaderTopics.isEmpty()) {
// The set of topics with unknown leader contains topics with leader election pending as well as
// topics which may have expired. Add the topic again to metadata to ensure it is included
// and request metadata update, since there are messages to send to the topic.
for (String topic : result.unknownLeaderTopics)
this.metadata.add(topic);
this.metadata.requestUpdate();
}
// remove any nodes we aren't ready to send to
/**
* 4. 检查与要发送数据的主机的网络是否已经建立好
*/
Iterator<Node> iter = result.readyNodes.iterator();
long notReadyTimeout = Long.MAX_VALUE;
while (iter.hasNext()) {
Node node = iter.next();
if (!this.client.ready(node, now)) {
//如果返回的是false !false就进来了
//移除result里面要发送消息的主机
//所以我们会看到这里所有的主机都会被移除
iter.remove();
notReadyTimeout = Math.min(notReadyTimeout, this.client.connectionDelay(node, now));
}
}
// create produce requests
/**
* 5. 我们有可能要发送的partition有很多个,但是假如我们极其少,分区多,那么我们的有些分区的leader就会在一台机器上
* 那么这个时候会再一次封装批次 一次发送出去 减少网络IO 因为网络的资源是很匮乏的
*
* 网络没有建立成功 这里的代码是不执行的
*/
Map<Integer, List<RecordBatch>> batches = this.accumulator.drain(cluster,
result.readyNodes,
this.maxRequestSize,
now);
if (guaranteeMessageOrder) {
// Mute all the partitions drained
//如果batches 空的话,这里也不执行
for (List<RecordBatch> batchList : batches.values()) {
for (RecordBatch batch : batchList)
this.accumulator.mutePartition(batch.topicPartition);
}
}
/**
* 6. 超时批次的处理
*/
List<RecordBatch> expiredBatches = this.accumulator.abortExpiredBatches(this.requestTimeout, now);
// update sensors
for (RecordBatch expiredBatch : expiredBatches)
this.sensors.recordErrors(expiredBatch.topicPartition.topic(), expiredBatch.recordCount);
sensors.updateProduceRequestMetrics(batches);
/**
* 7. 创建发送数据的请求
*/
List<ClientRequest> requests = createProduceRequests(batches, now);
// If we have any nodes that are ready to send + have sendable data, poll with 0 timeout so this can immediately
// loop and try sending more data. Otherwise, the timeout is determined by nodes that have partitions with data
// that isn't yet sendable (e.g. lingering, backing off). Note that this specifically does not include nodes
// with sendable data that aren't ready to send since they would cause busy looping.
long pollTimeout = Math.min(result.nextReadyCheckDelayMs, notReadyTimeout);
if (result.readyNodes.size() > 0) {
log.trace("Nodes with data ready to send: {}", result.readyNodes);
log.trace("Created {} produce requests: {}", requests.size(), requests);
pollTimeout = 0;
}
for (ClientRequest request : requests)
client.send(request, now);
// if some partitions are already ready to be sent, the select time would be 0;
// otherwise if some partition already has some data accumulated but not ready yet,
// the select time will be the time difference between now and its linger expiry time;
// otherwise the select time will be the time difference between now and the metadata expiry time;
/**
* 8.
* 真正执行网络操作的都是这个NetworkClient这个组件
* 包括:发送请求,接收响应(处理响应)
*
* 拉取元数据信息,靠的就是这段代码
*/
//我们猜这里可能就是去建立连接
this.client.poll(pollTimeout, now);
}
/**
* Start closing the sender (won't actually complete until all data is sent out)
*/
public void initiateClose() {
// Ensure accumulator is closed first to guarantee that no more appends are accepted after
// breaking from the sender loop. Otherwise, we may miss some callbacks when shutting down.
this.accumulator.close();
this.running = false;
this.wakeup();
}
/**
* Closes the sender without sending out any pending messages.
*/
public void forceClose() {
this.forceClose = true;
initiateClose();
}
/**
* Handle a produce response
*/
private void handleProduceResponse(ClientResponse response, Map<TopicPartition, RecordBatch> batches, long now) {
int correlationId = response.request().request().header().correlationId();
if (response.wasDisconnected()) {
log.trace("Cancelled request {} due to node {} being disconnected", response, response.request()
.request()
.destination());
for (RecordBatch batch : batches.values())
completeBatch(batch, Errors.NETWORK_EXCEPTION, -1L, Record.NO_TIMESTAMP, correlationId, now);
} else {
log.trace("Received produce response from node {} with correlation id {}",
response.request().request().destination(),
correlationId);
// if we have a response, parse it
if (response.hasResponse()) {
ProduceResponse produceResponse = new ProduceResponse(response.responseBody());
for (Map.Entry<TopicPartition, ProduceResponse.PartitionResponse> entry : produceResponse.responses().entrySet()) {
TopicPartition tp = entry.getKey();
ProduceResponse.PartitionResponse partResp = entry.getValue();
Errors error = Errors.forCode(partResp.errorCode);
RecordBatch batch = batches.get(tp);
completeBatch(batch, error, partResp.baseOffset, partResp.timestamp, correlationId, now);
}
this.sensors.recordLatency(response.request().request().destination(), response.requestLatencyMs());
this.sensors.recordThrottleTime(response.request().request().destination(),
produceResponse.getThrottleTime());
} else {
// this is the acks = 0 case, just complete all requests
for (RecordBatch batch : batches.values())
completeBatch(batch, Errors.NONE, -1L, Record.NO_TIMESTAMP, correlationId, now);
}
}
}
/**
* Complete or retry the given batch of records.
*
* @param batch The record batch
* @param error The error (or null if none)
* @param baseOffset The base offset assigned to the records if successful
* @param timestamp The timestamp returned by the broker for this batch
* @param correlationId The correlation id for the request
* @param now The current POSIX time stamp in milliseconds
*/
private void completeBatch(RecordBatch batch, Errors error, long baseOffset, long timestamp, long correlationId, long now) {
if (error != Errors.NONE && canRetry(batch, error)) {
// retry
log.warn("Got error produce response with correlation id {} on topic-partition {}, retrying ({} attempts left). Error: {}",
correlationId,
batch.topicPartition,
this.retries - batch.attempts - 1,
error);
this.accumulator.reenqueue(batch, now);
this.sensors.recordRetries(batch.topicPartition.topic(), batch.recordCount);
} else {
RuntimeException exception;
if (error == Errors.TOPIC_AUTHORIZATION_FAILED)
exception = new TopicAuthorizationException(batch.topicPartition.topic());
else
exception = error.exception();
// tell the user the result of their request
batch.done(baseOffset, timestamp, exception);
this.accumulator.deallocate(batch);
if (error != Errors.NONE)
this.sensors.recordErrors(batch.topicPartition.topic(), batch.recordCount);
}
if (error.exception() instanceof InvalidMetadataException) {
if (error.exception() instanceof UnknownTopicOrPartitionException)
log.warn("Received unknown topic or partition error in produce request on partition {}. The " +
"topic/partition may not exist or the user may not have Describe access to it", batch.topicPartition);
metadata.requestUpdate();
}
// Unmute the completed partition.
if (guaranteeMessageOrder)
this.accumulator.unmutePartition(batch.topicPartition);
}
/**
* We can retry a send if the error is transient and the number of attempts taken is fewer than the maximum allowed
*/
private boolean canRetry(RecordBatch batch, Errors error) {
return batch.attempts < this.retries && error.exception() instanceof RetriableException;
}
/**
* Transfer the record batches into a list of produce requests on a per-node basis
*/
private List<ClientRequest> createProduceRequests(Map<Integer, List<RecordBatch>> collated, long now) {
List<ClientRequest> requests = new ArrayList<ClientRequest>(collated.size());
for (Map.Entry<Integer, List<RecordBatch>> entry : collated.entrySet())
requests.add(produceRequest(now, entry.getKey(), acks, requestTimeout, entry.getValue()));
return requests;
}
/**
* Create a produce request from the given record batches
*/
private ClientRequest produceRequest(long now, int destination, short acks, int timeout, List<RecordBatch> batches) {
Map<TopicPartition, ByteBuffer> produceRecordsByPartition = new HashMap<TopicPartition, ByteBuffer>(batches.size());
final Map<TopicPartition, RecordBatch> recordsByPartition = new HashMap<TopicPartition, RecordBatch>(batches.size());
for (RecordBatch batch : batches) {
TopicPartition tp = batch.topicPartition;
produceRecordsByPartition.put(tp, batch.records.buffer());
recordsByPartition.put(tp, batch);
}
ProduceRequest request = new ProduceRequest(acks, timeout, produceRecordsByPartition);
RequestSend send = new RequestSend(Integer.toString(destination),
this.client.nextRequestHeader(ApiKeys.PRODUCE),
request.toStruct());
RequestCompletionHandler callback = new RequestCompletionHandler() {
public void onComplete(ClientResponse response) {
handleProduceResponse(response, recordsByPartition, time.milliseconds());
}
};
return new ClientRequest(now, acks != 0, send, callback);
}
/**
* Wake up the selector associated with this send thread
*/
public void wakeup() {
this.client.wakeup();
}
/**
* A collection of sensors for the sender
*/
private class SenderMetrics {
private final Metrics metrics;
public final Sensor retrySensor;
public final Sensor errorSensor;
public final Sensor queueTimeSensor;
public final Sensor requestTimeSensor;
public final Sensor recordsPerRequestSensor;
public final Sensor batchSizeSensor;
public final Sensor compressionRateSensor;
public final Sensor maxRecordSizeSensor;
public final Sensor produceThrottleTimeSensor;
public SenderMetrics(Metrics metrics) {
this.metrics = metrics;
String metricGrpName = "producer-metrics";
this.batchSizeSensor = metrics.sensor("batch-size");
MetricName m = metrics.metricName("batch-size-avg", metricGrpName, "The average number of bytes sent per partition per-request.");
this.batchSizeSensor.add(m, new Avg());
m = metrics.metricName("batch-size-max", metricGrpName, "The max number of bytes sent per partition per-request.");
this.batchSizeSensor.add(m, new Max());
this.compressionRateSensor = metrics.sensor("compression-rate");
m = metrics.metricName("compression-rate-avg", metricGrpName, "The average compression rate of record batches.");
this.compressionRateSensor.add(m, new Avg());
this.queueTimeSensor = metrics.sensor("queue-time");
m = metrics.metricName("record-queue-time-avg", metricGrpName, "The average time in ms record batches spent in the record accumulator.");
this.queueTimeSensor.add(m, new Avg());
m = metrics.metricName("record-queue-time-max", metricGrpName, "The maximum time in ms record batches spent in the record accumulator.");
this.queueTimeSensor.add(m, new Max());
this.requestTimeSensor = metrics.sensor("request-time");
m = metrics.metricName("request-latency-avg", metricGrpName, "The average request latency in ms");
this.requestTimeSensor.add(m, new Avg());
m = metrics.metricName("request-latency-max", metricGrpName, "The maximum request latency in ms");
this.requestTimeSensor.add(m, new Max());
this.produceThrottleTimeSensor = metrics.sensor("produce-throttle-time");
m = metrics.metricName("produce-throttle-time-avg", metricGrpName, "The average throttle time in ms");
this.produceThrottleTimeSensor.add(m, new Avg());
m = metrics.metricName("produce-throttle-time-max", metricGrpName, "The maximum throttle time in ms");
this.produceThrottleTimeSensor.add(m, new Max());
this.recordsPerRequestSensor = metrics.sensor("records-per-request");
m = metrics.metricName("record-send-rate", metricGrpName, "The average number of records sent per second.");
this.recordsPerRequestSensor.add(m, new Rate());
m = metrics.metricName("records-per-request-avg", metricGrpName, "The average number of records per request.");
this.recordsPerRequestSensor.add(m, new Avg());
this.retrySensor = metrics.sensor("record-retries");
m = metrics.metricName("record-retry-rate", metricGrpName, "The average per-second number of retried record sends");
this.retrySensor.add(m, new Rate());
this.errorSensor = metrics.sensor("errors");
m = metrics.metricName("record-error-rate", metricGrpName, "The average per-second number of record sends that resulted in errors");
this.errorSensor.add(m, new Rate());
this.maxRecordSizeSensor = metrics.sensor("record-size-max");
m = metrics.metricName("record-size-max", metricGrpName, "The maximum record size");
this.maxRecordSizeSensor.add(m, new Max());
m = metrics.metricName("record-size-avg", metricGrpName, "The average record size");
this.maxRecordSizeSensor.add(m, new Avg());
m = metrics.metricName("requests-in-flight", metricGrpName, "The current number of in-flight requests awaiting a response.");
this.metrics.addMetric(m, new Measurable() {
public double measure(MetricConfig config, long now) {
return client.inFlightRequestCount();
}
});
m = metrics.metricName("metadata-age", metricGrpName, "The age in seconds of the current producer metadata being used.");
metrics.addMetric(m, new Measurable() {
public double measure(MetricConfig config, long now) {
return (now - metadata.lastSuccessfulUpdate()) / 1000.0;
}
});
}
private void maybeRegisterTopicMetrics(String topic) {
// if one sensor of the metrics has been registered for the topic,
// then all other sensors should have been registered; and vice versa
String topicRecordsCountName = "topic." + topic + ".records-per-batch";
Sensor topicRecordCount = this.metrics.getSensor(topicRecordsCountName);
if (topicRecordCount == null) {
Map<String, String> metricTags = Collections.singletonMap("topic", topic);
String metricGrpName = "producer-topic-metrics";
topicRecordCount = this.metrics.sensor(topicRecordsCountName);
MetricName m = this.metrics.metricName("record-send-rate", metricGrpName, metricTags);
topicRecordCount.add(m, new Rate());
String topicByteRateName = "topic." + topic + ".bytes";
Sensor topicByteRate = this.metrics.sensor(topicByteRateName);
m = this.metrics.metricName("byte-rate", metricGrpName, metricTags);
topicByteRate.add(m, new Rate());
String topicCompressionRateName = "topic." + topic + ".compression-rate";
Sensor topicCompressionRate = this.metrics.sensor(topicCompressionRateName);
m = this.metrics.metricName("compression-rate", metricGrpName, metricTags);
topicCompressionRate.add(m, new Avg());
String topicRetryName = "topic." + topic + ".record-retries";
Sensor topicRetrySensor = this.metrics.sensor(topicRetryName);
m = this.metrics.metricName("record-retry-rate", metricGrpName, metricTags);
topicRetrySensor.add(m, new Rate());
String topicErrorName = "topic." + topic + ".record-errors";
Sensor topicErrorSensor = this.metrics.sensor(topicErrorName);
m = this.metrics.metricName("record-error-rate", metricGrpName, metricTags);
topicErrorSensor.add(m, new Rate());
}
}
public void updateProduceRequestMetrics(Map<Integer, List<RecordBatch>> batches) {
long now = time.milliseconds();
for (List<RecordBatch> nodeBatch : batches.values()) {
int records = 0;
for (RecordBatch batch : nodeBatch) {
// register all per-topic metrics at once
String topic = batch.topicPartition.topic();
maybeRegisterTopicMetrics(topic);
// per-topic record send rate
String topicRecordsCountName = "topic." + topic + ".records-per-batch";
Sensor topicRecordCount = Utils.notNull(this.metrics.getSensor(topicRecordsCountName));
topicRecordCount.record(batch.recordCount);
// per-topic bytes send rate
String topicByteRateName = "topic." + topic + ".bytes";
Sensor topicByteRate = Utils.notNull(this.metrics.getSensor(topicByteRateName));
topicByteRate.record(batch.records.sizeInBytes());
// per-topic compression rate
String topicCompressionRateName = "topic." + topic + ".compression-rate";
Sensor topicCompressionRate = Utils.notNull(this.metrics.getSensor(topicCompressionRateName));
topicCompressionRate.record(batch.records.compressionRate());
// global metrics
this.batchSizeSensor.record(batch.records.sizeInBytes(), now);
this.queueTimeSensor.record(batch.drainedMs - batch.createdMs, now);
this.compressionRateSensor.record(batch.records.compressionRate());
this.maxRecordSizeSensor.record(batch.maxRecordSize, now);
records += batch.recordCount;
}
this.recordsPerRequestSensor.record(records, now);
}
}
public void recordRetries(String topic, int count) {
long now = time.milliseconds();
this.retrySensor.record(count, now);
String topicRetryName = "topic." + topic + ".record-retries";
Sensor topicRetrySensor = this.metrics.getSensor(topicRetryName);
if (topicRetrySensor != null)
topicRetrySensor.record(count, now);
}
public void recordErrors(String topic, int count) {
long now = time.milliseconds();
this.errorSensor.record(count, now);
String topicErrorName = "topic." + topic + ".record-errors";
Sensor topicErrorSensor = this.metrics.getSensor(topicErrorName);
if (topicErrorSensor != null)
topicErrorSensor.record(count, now);
}
public void recordLatency(String node, long latency) {
long now = time.milliseconds();
this.requestTimeSensor.record(latency, now);
if (!node.isEmpty()) {
String nodeTimeName = "node-" + node + ".latency";
Sensor nodeRequestTime = this.metrics.getSensor(nodeTimeName);
if (nodeRequestTime != null)
nodeRequestTime.record(latency, now);
}
}
public void recordThrottleTime(String node, long throttleTimeMs) {
this.produceThrottleTimeSensor.record(throttleTimeMs, time.milliseconds());
}
}
}
13.batchs的校验
/**
* Get a list of nodes whose partitions are ready to be sent, and the earliest time at which any non-sendable
* partition will be ready; Also return the flag for whether there are any unknown leaders for the accumulated
* partition batches.
* <p>
* A destination node is ready to send data if:
* <ol>
* <li>There is at least one partition that is not backing off its send
* <li><b>and</b> those partitions are not muted (to prevent reordering if
* {@value org.apache.kafka.clients.producer.ProducerConfig#MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION}
* is set to one)</li>
* <li><b>and <i>any</i></b> of the following are true</li>
* <ul>
* <li>The record set is full</li>
* <li>The record set has sat in the accumulator for at least lingerMs milliseconds</li>
* <li>The accumulator is out of memory and threads are blocking waiting for data (in this case all partitions
* are immediately considered ready).</li>
* <li>The accumulator has been closed</li>
* </ul>
* </ol>
*/
public ReadyCheckResult ready(Cluster cluster, long nowMs) {
Set<Node> readyNodes = new HashSet<>();
long nextReadyCheckDelayMs = Long.MAX_VALUE;
Set<String> unknownLeaderTopics = new HashSet<>();
/**
* return this.waiters.size();
* waiters里面有数据,说明我们的这个内存池里面内存不够了有等待封装批次的
* 如果 exhausted为true说明内存池里面的内存不够用了
*/
boolean exhausted = this.free.queued() > 0;
//遍历所有的分区
for (Map.Entry<TopicPartition, Deque<RecordBatch>> entry : this.batches.entrySet()) {
//获取到分区
TopicPartition part = entry.getKey();
//获取到分区对应的队列
Deque<RecordBatch> deque = entry.getValue();
//跟去分区信息,获取到我们这个分区的leader partition在那一台kafka的主机上面
Node leader = cluster.leaderFor(part);
synchronized (deque) {
//如果没有找到对应的主机 unknownLeaderTopics.add
if (leader == null && !deque.isEmpty()) {
// This is a partition for which leader is not known, but messages are available to send.
// Note that entries are currently not removed from batches when deque is empty.
unknownLeaderTopics.add(part.topic());
} else if (!readyNodes.contains(leader) && !muted.contains(part)) {
//从队列的头获取批次
RecordBatch batch = deque.peekFirst();
//如果这个批次不是null ,我们判断一下是否可以发送这个批次
if (batch != null) {
/**
* 判断这个批次是否可以发送
* 重试判断
*
*/
boolean backingOff = batch.attempts > 0 && batch.lastAttemptMs + retryBackoffMs > nowMs;
/**
* 这个批次 等了多久
*/
long waitedTimeMs = nowMs - batch.lastAttemptMs;
/**
* 假如不管我们批次写满了没有 只要等待100ms就会发送
*/
long timeToWaitMs = backingOff ? retryBackoffMs : lingerMs;
/**
* timeToWaitMs 最多可以等多久
* waitedTimeMs 已经等了多久
* timeLeftMs 还要等多久
*/
long timeLeftMs = Math.max(timeToWaitMs - waitedTimeMs, 0);
/**
* 如果队列大于1,说明这个队列里面至少有一个批次是有的
* 如果批次写满了肯定是可以发送数据的
* 当然也有可能这个队列里面只有一个批次
* full 是否有写满的批次
* 这里就很巧妙的解决了 满和不满的情况
*/
boolean full = deque.size() > 1 || batch.records.isFull();
/**
* timeToWaitMs 最多等多久
* waitedTimeMs 已经等了多久
* expired 就代表了 时间到了 要发送消息
*/
boolean expired = waitedTimeMs >= timeToWaitMs;
// 判断
boolean sendable = full || expired || exhausted || closed || flushInProgress();
//可以发送消息
/**
* TODO closed 意思就是 我要停止的时候 我要先把内存里面的批次发送出去之后 再停
*/
if (sendable && !backingOff) {
// 把可以发送批次的Partition的leader partion 加入到readyNodes里面
readyNodes.add(leader);
} else {
// Note that this results in a conservative estimate since an un-sendable partition may have
// a leader that will later be found to have sendable data. However, this is good enough
// since we'll just wake up and then sleep again for the remaining time.
nextReadyCheckDelayMs = Math.min(timeLeftMs, nextReadyCheckDelayMs);
}
}
}
}
}
//TODO 封装起来
return new ReadyCheckResult(readyNodes, nextReadyCheckDelayMs, unknownLeaderTopics);
}
13. 筛选可以发送请求的broker
13.1 ready
/**
* 开始连接到指定的节点,如果我们已经连接并准备发送给该节点返回true
*
* @param node The node to check
* @param now The current timestamp
* @return True if we are ready to send to the given node
*/
@Override
public boolean ready(Node node, long now) {
// 如果当前节点为null 就报异常
if (node.isEmpty())
throw new IllegalArgumentException("Cannot connect to empty node " + node);
// 第一次进来 不具备,false
if (isReady(node, now))
return true;
//判断是否可以尝试去建立网络
if (connectionStates.canConnect(node.idString(), now))
// if we are interested in sending to a node and we don't have a connection to it, initiate one
//初始化连接
//TODO 绑定了连接事件
initiateConnect(node, now);
return false;
}
13.2 isReady
/**
* 检查是否给定id的节点准备好发送更多的请求
*
* @param node The node
* @param now The current time in ms
* @return true if the node is ready
*/
@Override
public boolean isReady(Node node, long now) {
// if we need to update our metadata now declare all requests unready to make metadata requests first
// priority
//我们要发送写数据请求的时候,不能是正在更新元数据的时候
return !metadataUpdater.isUpdateDue(now)
//能发送球球吗,这个节点
&& canSendRequest(node.idString());
}
13.3 canSendRequest
/**
* 我们是否连接,并准备和能够发送更多的请求,以给定的连接?
*
* @param node The node
*/
private boolean canSendRequest(String node) {
/**
* connectionStates.isConnected(node):
* 生产者:多个连接,缓存多个连接(跟我们的broker的节点数是一样的)
* 判断缓存里面是否已经把这个连接给建立好了
*
* selector.isChannelReady(node):
* java NIO : selector
* selector -> 绑定了多个KafkaChannel(java socketChannel)
* 一个kafkachannel 就代表一个连接
*
* inFlightRequests.canSendMore(node):
* 每个往broker主机上面发送消息的连接,最多能容忍5个请求,发送出去了,但是还没有接受到响应
*/
return connectionStates.isConnected(node) &&
selector.isChannelReady(node) &&
inFlightRequests.canSendMore(node);
}
14. Producer Client 端的网络架构设计
14.1 Kafka自己封装的Seletcor 重要的属性
/**
* broker 和 kafkachannel(socketChannel)的映射
* 这里的kafkaChannel大家暂时可以理解为就是SocketChannel
* 代表的就是一个网络连接
*/
private final Map<String, KafkaChannel> channels;
/**
* 已经完成发送的请求
*/
private final List<Send> completedSends;
/**
* 已经接收到的,并且处理完了的响应
*/
private final List<NetworkReceive> completedReceives;
/**
* 已经接收到了,但是还没有来的及处理的响应
* 一个连接,对应一个响应队列
*/
private final Map<KafkaChannel, Deque<NetworkReceive>> stagedReceives;
private final Set<SelectionKey> immediatelyConnectedKeys;
/**
* 建立连接失败的主机
*/
private final List<String> disconnected;
14.2 KafkaChannel的属性
/**
* 一个broker就对应一个kafkachannel
* 这个id 就是broker的id
*/
private final String id;
/**
* 我们推测这里面应该会有SocketChannel
*/
private final TransportLayer transportLayer;
private final Authenticator authenticator;
private final int maxReceiveSize;
/**
* 接收到的响应
*/
private NetworkReceive receive;
/**
* 发送出去的请求
*/
private Send send;
14.3 TransportLayer
/**
* returns underlying socketChannel
* 这里就是NIO 里面的socket channel
*/
SocketChannel socketChannel();
14.4 属性关系图
最根本的就是一个客户端维持了多个连接
Node node = iter.next();
//TODO 检查与主机的网络是否已经建立好
if (!this.client.ready(node, now)) {
//如果返回的是false !false就进来了
//移除result里面要发送消息的主机
//所以我们会看到这里所有的主机都会被移除
iter.remove();
notReadyTimeout = Math.min(notReadyTimeout, this.client.connectionDelay(node, now));
}
如果某个节点没有建立好连接,就把对应的节点删除,不发送他的消息
14.5 initiateConnect
/**
* Begin connecting to the given address and add the connection to this nioSelector associated with the given id
* number.
* <p>
* Note that this call only initiates the connection, which will be completed on a future {@link #poll(long)}
* call. Check {@link #connected()} to see which (if any) connections have completed after a given poll call.
* @param id The id for the new connection
* @param address The address to connect to
* @param sendBufferSize The send buffer for the new connection
* @param receiveBufferSize The receive buffer for the new connection
* @throws IllegalStateException if there is already a connection for that id
* @throws IOException if DNS resolution fails on the hostname or if the broker is down
*/
@Override
public void connect(String id, InetSocketAddress address, int sendBufferSize, int receiveBufferSize) throws IOException {
if (this.channels.containsKey(id))
throw new IllegalStateException("There is already a connection for id " + id);
//这里就是和NIO的diamante编程是一样的
//获取到SocketChannel
SocketChannel socketChannel = SocketChannel.open();
//设置为非阻塞的模式
socketChannel.configureBlocking(false);
Socket socket = socketChannel.socket();
socket.setKeepAlive(true);
//设置一些参数
if (sendBufferSize != Selectable.USE_DEFAULT_BUFFER_SIZE)
socket.setSendBufferSize(sendBufferSize);
if (receiveBufferSize != Selectable.USE_DEFAULT_BUFFER_SIZE)
socket.setReceiveBufferSize(receiveBufferSize);
/**
* 这个默认值是false,嗲表要开启Nagle的算法
* 他会把网络中的一些小的数据包收集起来,组合成一个大的数据包
* 然后在发送出去,因为它认为如果网络中有大量的小的数据包在传输
* 其实会影响网络阻塞的
* kafka一定不能把这里设置为false.因为我们有些时候可能有些数据包就是比较小,他这里就不帮我们发送了,显然是不合理的
*/
socket.setTcpNoDelay(true);
boolean connected;
try {
/**
* 尝试去服务器连接
* 因为这里是非阻塞的
* 有可能就立马连接成功,如果成功了就返回true
* 也有可能需要很久才发送成功
*/
connected = socketChannel.connect(address);
} catch (UnresolvedAddressException e) {
socketChannel.close();
throw new IOException("Can't resolve address: " + address, e);
} catch (IOException e) {
socketChannel.close();
throw e;
}
/**
* 注册 OP_CONNECT
*/
SelectionKey key = socketChannel.register(nioSelector, SelectionKey.OP_CONNECT);
/**
* 根据socketchannel封装kafkachannel
*/
KafkaChannel channel = channelBuilder.buildChannel(id, key, maxReceiveSize);
/**
* 把key和kafkachannel关联起来
* 后面使用起来会比较方便
* 我们可以根据key找到kafkachannel
* 也可以根据kafkachannel找到key
*/
key.attach(channel);
//缓存起来
this.channels.put(id, channel);
if (connected) {
// OP_CONNECT won't trigger for immediately connected channels
log.debug("Immediately connected to node {}", channel.id());
immediatelyConnectedKeys.add(key);
key.interestOps(0);
}
}
14.6 poll
/**
* Do whatever I/O can be done on each connection without blocking. This includes completing connections, completing
* disconnections, initiating new sends, or making progress on in-progress sends or receives.
*
* When this call is completed the user can check for completed sends, receives, connections or disconnects using
* {@link #completedSends()}, {@link #completedReceives()}, {@link #connected()}, {@link #disconnected()}. These
* lists will be cleared at the beginning of each `poll` call and repopulated by the call if there is
* any completed I/O.
*
* In the "Plaintext" setting, we are using socketChannel to read & write to the network. But for the "SSL" setting,
* we encrypt the data before we use socketChannel to write data to the network, and decrypt before we return the responses.
* This requires additional buffers to be maintained as we are reading from network, since the data on the wire is encrypted
* we won't be able to read exact no.of bytes as kafka protocol requires. We read as many bytes as we can, up to SSLEngine's
* application buffer size. This means we might be reading additional bytes than the requested size.
* If there is no further data to read from socketChannel selector won't invoke that channel and we've have additional bytes
* in the buffer. To overcome this issue we added "stagedReceives" map which contains per-channel deque. When we are
* reading a channel we read as many responses as we can and store them into "stagedReceives" and pop one response during
* the poll to add the completedReceives. If there are any active channels in the "stagedReceives" we set "timeout" to 0
* and pop response and add to the completedReceives.
*
* @param timeout The amount of time to wait, in milliseconds, which must be non-negative
* @throws IllegalArgumentException If `timeout` is negative
* @throws IllegalStateException If a send is given for which we have no existing connection or for which there is
* already an in-progress send
*/
@Override
public void poll(long timeout) throws IOException {
if (timeout < 0)
throw new IllegalArgumentException("timeout should be >= 0");
clear();
if (hasStagedReceives() || !immediatelyConnectedKeys.isEmpty())
timeout = 0;
/* check ready keys */
long startSelect = time.nanoseconds();
//查看我们注册的key
int readyKeys = select(timeout);
long endSelect = time.nanoseconds();
this.sensors.selectTime.record(endSelect - startSelect, time.milliseconds());
//我们刚才确实是注册了一个key
if (readyKeys > 0 || !immediatelyConnectedKeys.isEmpty()) {
//然后就对这个上面的key进行处理
pollSelectionKeys(this.nioSelector.selectedKeys(), false, endSelect);
pollSelectionKeys(immediatelyConnectedKeys, true, endSelect);
}
//TODO 对stagedReceives里面的数据要进行处理
addToCompletedReceives();
long endIo = time.nanoseconds();
this.sensors.ioTime.record(endIo - endSelect, time.milliseconds());
// we use the time at the end of select to ensure that we don't close any connections that
// have just been processed in pollSelectionKeys
maybeCloseOldestConnection(endSelect);
}
14.7 pollSelectionKeys
//获取到所有的key
Iterator<SelectionKey> iterator = selectionKeys.iterator();
//遍历所有的key
while (iterator.hasNext()) {
SelectionKey key = iterator.next();
iterator.remove();
//根据key找到对应的kafkachannel
KafkaChannel channel = channel(key);
// register all per-connection metrics at once
sensors.maybeRegisterConnectionMetrics(channel.id());
if (idleExpiryManager != null)
idleExpiryManager.update(channel.id(), currentTimeNanos);
/* complete any connections that have finished their handshake (either normally or immediately) */
/**
* 我们的代码第一次进来应该走的是这里的分支,因为我们前面注册是 OP_CONNECT
*/
if (isImmediatelyConnected || key.isConnectable()) {
//TODO 核心代码
//去最后完成网络的连接
//TODO 绑定OP_READ
if (channel.finishConnect()) {
this.connected.add(channel.id());
this.sensors.connectionCreated.record();
SocketChannel socketChannel = (SocketChannel) key.channel();
log.debug("Created socket with SO_RCVBUF = {}, SO_SNDBUF = {}, SO_TIMEOUT = {} to node {}",
socketChannel.socket().getReceiveBufferSize(),
socketChannel.socket().getSendBufferSize(),
socketChannel.socket().getSoTimeout(),
channel.id());
} else
continue;
}
14.8 finishConnect
@Override
public boolean finishConnect() throws IOException {
//完成的最后的网络的连接
boolean connected = socketChannel.finishConnect();
//如果连接完成了以后
if (connected)
/**
* 取消了OP_CONNECT 事件
* 增加了OP_READ 事件
*
*/
key.interestOps(key.interestOps() & ~SelectionKey.OP_CONNECT | SelectionKey.OP_READ);
return connected;
}
14.9 createProduceRequests
/**
* Transfer the record batches into a list of produce requests on a per-node basis
*/
private List<ClientRequest> createProduceRequests(Map<Integer, List<RecordBatch>> collated, long now) {
List<ClientRequest> requests = new ArrayList<ClientRequest>(collated.size());
for (Map.Entry<Integer, List<RecordBatch>> entry : collated.entrySet())
//TODO 封装请求
requests.add(produceRequest(now, entry.getKey(), acks, requestTimeout, entry.getValue()));
return requests;
}
14.10 client.send(request, now);
private void doSend(ClientRequest request, long now) {
request.setSendTimeMs(now);
//把当前的请求放入到inFlightRequests
/**
* 存储的就是还没有收到响应的请求
* 这个里面默认最多能存5个请求
* 其实我们还可以猜想一个事,如果我们的请求发送出去了
* 然后也成功搞得接收到了响应,后面就会到这里把这个请求移除
*/
this.inFlightRequests.add(request);
//TODO
selector.send(request.request());
}
/**
* Queue the given request for sending in the subsequent {@link #poll(long)} calls
* @param send The request to send
*/
public void send(Send send) {
//获取到一个KafkaChannel
KafkaChannel channel = channelOrFail(send.destination());
try {
//TODO
channel.setSend(send);
} catch (CancelledKeyException e) {
this.failedSends.add(send.destination());
close(channel);
}
}
public void setSend(Send send) {
if (this.send != null)
throw new IllegalStateException("Attempt to begin a send operation with prior send operation still in progress.");
this.send = send;
//往kafkachannel里面绑定了一个发送出去的请求
/**
* 关键的代码
* 这里绑定了OP_WRITE事件
* 一旦绑定了这个事件以后,我们就可以往服务端发送请求了
*/
this.transportLayer.addInterestOps(SelectionKey.OP_WRITE);
}
14.11 poll
/* if channel is ready write to any sockets that have space in their buffer and for which we have data */
if (channel.ready() && key.isWritable()) {
//TODO OP_WRITE
Send send = channel.write();
if (send != null) {
this.completedSends.add(send);
this.sensors.recordBytesSent(channel.id(), send.size());
}
}
14.12 总结
15.Produer 如何解决响应的粘包和拆包
// Need a method to read from ReadableByteChannel because BlockingChannel requires read with timeout
// See: http://stackoverflow.com/questions/2866557/timeout-for-socketchannel-doesnt-work
// This can go away after we get rid of BlockingChannel
@Deprecated
public long readFromReadableChannel(ReadableByteChannel channel) throws IOException {
int read = 0;
//size 是一个4字节大小的内存空间
// 如果size 还有剩余的内存空间
if (size.hasRemaining()) {
//先读取4字节的数据,(代表的意思就是后面跟着消息体的大小)
int bytesRead = channel.read(size);
if (bytesRead < 0)
throw new EOFException();
read += bytesRead;
/**
* 一直读取到当这个size没有剩余空间
* 说明已经读取到了一个4字节的int类型的数据了
*/
if (!size.hasRemaining()) {
size.rewind();
//4 ->10
int receiveSize = size.getInt();
if (receiveSize < 0)
throw new InvalidReceiveException("Invalid receive (size = " + receiveSize + ")");
if (maxSize != UNLIMITED && receiveSize > maxSize)
throw new InvalidReceiveException("Invalid receive (size = " + receiveSize + " larger than " + maxSize + ")");
/**
* 分配出来一个内存空间,这个内存空间的大小
* 就是刚刚读出来的那个4自己的int的大小
* 40
*/
this.buffer = ByteBuffer.allocate(receiveSize);
}
}
if (buffer != null) {
//去读取数据
int bytesRead = channel.read(buffer);
if (bytesRead < 0)
throw new EOFException();
read += bytesRead;
}
return read;
}
更多推荐
所有评论(0)