1. 新增MiddleManager节点遇到的坑
    问题背景: 我们的consumer是从kafka介入数据,然后通过集成tranquility向druid生产数据。
    问题描述: 线上druid集群新增MiddleManager节点后,发现某些任务success了, 但是数据并没有保存到hdfs里面。
    查看peon节点日志,发现有如下错误:
......(略)......
2017-09-04T00:56:03,285 INFO [task-runner-0-priority-0] io.druid.segment.realtime.plumber.RealtimePlumber - Starting merge and push.
2017-09-04T00:56:03,286 INFO [task-runner-0-priority-0] io.druid.segment.realtime.plumber.RealtimePlumber - Found [0] segments. Attempting to hand off segments that start before [1970-01-01T00:00:00.000Z].
2017-09-04T00:56:03,286 INFO [task-runner-0-priority-0] io.druid.segment.realtime.plumber.RealtimePlumber - Found [0] sinks to persist and merge
......(略)......
2017-09-04T02:09:03,305 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2017-09-04T02:09:03.305Z","service":"druid/middleManager","host":"AtsbDruidMid1:8101","version":"0.10.0","metric":"ingest/handoff/count","value":0,"dataSource":"mobileTopic-1m","taskId":["index_realtime_mobileTopic-1m_2017-09-04T01:00:00.000Z_1_0"]}]
2017-09-04T02:09:03,305 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2017-09-04T02:09:03.305Z","service":"druid/middleManager","host":"AtsbDruidMid1:8101","version":"0.10.0","metric":"ingest/sink/count","value":0,"dataSource":"mobileTopic-1m","taskId":["index_realtime_mobileTopic-1m_2017-09-04T01:00:00.000Z_1_0"]}]
2017-09-04T02:09:03,305 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2017-09-04T02:09:03.305Z","service":"druid/middleManager","host":"AtsbDruidMid1:8101","version":"0.10.0","metric":"ingest/events/messageGap","value":1504490943305,"dataSource":"mobileTopic-1m","taskId":["index_realtime_mobileTopic-1m_2017-09-04T01:00:00.000Z_1_0"]}]
2017-09-04T02:10:00,003 INFO [mobileTopic-1m-overseer-1] io.druid.segment.realtime.plumber.RealtimePlumber - Starting merge and push.
2017-09-04T02:10:00,004 INFO [mobileTopic-1m-overseer-1] io.druid.segment.realtime.plumber.RealtimePlumber - Found [0] segments. Attempting to hand off segments that start before [1970-01-01T00:00:00.000Z].
2017-09-04T02:10:00,004 INFO [mobileTopic-1m-overseer-1] io.druid.segment.realtime.plumber.RealtimePlumber - Found [0] sinks to persist and merge
2017-09-04T02:10:02,864 INFO [MonitorScheduler-0] com.metamx.emitter.core.LoggingEmitter - Event [{"feed":"metrics","timestamp":"2017-09-04T02:10:02.864Z","service":"druid/middleManager","host":"AtsbDruidMid1:8101","version":"0.10.0","metric":"segment/scan/pending","value":0}]
2017-09-04T02:15:00,000 INFO [timed-shutoff-firehose-0] io.druid.data.input.FirehoseFactory - Closing delegate firehose.
2017-09-04T02:15:00,000 INFO [timed-shutoff-firehose-0] io.druid.segment.realtime.firehose.EventReceiverFirehoseFactory - Firehose closing.
2017-09-04T02:15:00,001 INFO [timed-shutoff-firehose-0] io.druid.server.metrics.EventReceiverFirehoseRegister - Unregistering EventReceiverFirehoseMetric for service [druid:firehose:mobileTopic-1m-01-0001-0000]
2017-09-04T02:15:00,001 INFO [timed-shutoff-firehose-0] io.druid.segment.realtime.firehose.ServiceAnnouncingChatHandlerProvider - Unregistering chat handler[druid:firehose:mobileTopic-1m-01-0001-0000]
2017-09-04T02:15:00,002 INFO [timed-shutoff-firehose-0] io.druid.curator.discovery.CuratorServiceAnnouncer - Unannouncing service[DruidNode{serviceName='druid:firehose:mobileTopic-1m-01-0001-0000', host='AtsbDruidMid1', port=8101}]
2017-09-04T02:15:00,373 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.RealtimeIndexTask - Persisting remaining data.
2017-09-04T02:15:00,374 INFO [task-runner-0-priority-0] io.druid.segment.realtime.plumber.RealtimePlumber - Submitting persist runnable for dataSource[mobileTopic-1m]
2017-09-04T02:15:00,376 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.RealtimeIndexTask - Finishing the job.
2017-09-04T02:15:00,376 INFO [task-runner-0-priority-0] io.druid.segment.realtime.plumber.RealtimePlumber - Shutting down...
2017-09-04T02:15:00,377 INFO [task-runner-0-priority-0] io.druid.indexing.common.task.RealtimeIndexTask - Job done!
2017-09-04T02:15:00,377 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_realtime_mobileTopic-1m_2017-09-04T01:00:00.000Z_1_0] status changed to [SUCCESS].
2017-09-04T02:15:00,381 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
  "id" : "index_realtime_mobileTopic-1m_2017-09-04T01:00:00.000Z_1_0",
  "status" : "SUCCESS",
  "duration" : 4737152
}
2017-09-04T02:15:00,388 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.coordination.AbstractDataSegmentAnnouncer.stop()] on object[io.druid.server.coordination.BatchDataSegmentAnnouncer@615439f7].
2017-09-04T02:15:00,388 INFO [main] io.druid.server.coordination.AbstractDataSegmentAnnouncer - Stopping class io.druid.server.coordination.BatchDataSegmentAnnouncer with config[io.druid.server.initialization.ZkPathsConfig@22e2266d]
2017-09-04T02:15:00,388 INFO [main] io.druid.curator.announcement.Announcer - unannouncing [/druid/announcements/AtsbDruidMid1:8101]
2017-09-04T02:15:00,403 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.server.listener.announcer.ListenerResourceAnnouncer.stop()] on object[io.druid.query.lookup.LookupResourceListenerAnnouncer@45bbc52f].
2017-09-04T02:15:00,403 INFO [main] io.druid.curator.announcement.Announcer - unannouncing [/druid/listeners/lookups/__default/AtsbDruidMid1:8101]
2017-09-04T02:15:00,405 INFO [main] io.druid.server.listener.announcer.ListenerResourceAnnouncer - Unannouncing start time on [/druid/listeners/lookups/__default/AtsbDruidMid1:8101]
2017-09-04T02:15:00,406 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.query.lookup.LookupReferencesManager.stop()] on object[io.druid.query.lookup.LookupReferencesManager@4245bf68].
2017-09-04T02:15:00,406 INFO [main] io.druid.query.lookup.LookupReferencesManager - Stopping lookup factory references manager
2017-09-04T02:15:00,416 INFO [main] org.eclipse.jetty.server.AbstractConnector - Stopped ServerConnector@4c635edc{HTTP/1.1,[http/1.1]}{0.0.0.0:8101}
2017-09-04T02:15:00,417 INFO [main] org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@47fb7ec9{/,null,UNAVAILABLE}
2017-09-04T02:15:00,424 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.indexing.worker.executor.ExecutorLifecycle.stop() throws java.lang.Exception] on object[io.druid.indexing.worker.executor.ExecutorLifecycle@6010d0d1].
2017-09-04T02:15:00,425 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.indexing.overlord.ThreadPoolTaskRunner.stop()] on object[io.druid.indexing.overlord.ThreadPoolTaskRunner@283ab206].
2017-09-04T02:15:00,425 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@741ac284].
2017-09-04T02:15:00,429 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.announcement.Announcer.stop()] on object[io.druid.curator.announcement.Announcer@733e6df7].
2017-09-04T02:15:00,431 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.curator.discovery.ServerDiscoverySelector.stop() throws java.io.IOException] on object[io.druid.curator.discovery.ServerDiscoverySelector@2d7637e6].
2017-09-04T02:15:00,431 INFO [main] com.metamx.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.http.client.NettyHttpClient.stop()] on object[com.metamx.http.client.NettyHttpClient@69a024a0].
2017-09-04T02:15:00,466 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.storage.hdfs.HdfsStorageAuthentication.stop()] on object[io.druid.storage.hdfs.HdfsStorageAuthentication@2a4f8009].
2017-09-04T02:15:00,467 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.metrics.MonitorScheduler.stop()] on object[com.metamx.metrics.MonitorScheduler@5fdfe8cf].
2017-09-04T02:15:00,476 INFO [main] io.druid.curator.CuratorModule - Stopping Curator
2017-09-04T02:15:00,477 INFO [Curator-Framework-0] org.apache.curator.framework.imps.CuratorFrameworkImpl - backgroundOperationsLoop exiting
2017-09-04T02:15:00,484 INFO [main] org.apache.zookeeper.ZooKeeper - Session: 0x354bcd7b507b357 closed
2017-09-04T02:15:00,484 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void com.metamx.emitter.service.ServiceEmitter.close() throws java.io.IOException] on object[com.metamx.emitter.service.ServiceEmitter@697d6ded].
2017-09-04T02:15:00,484 INFO [main] com.metamx.emitter.core.LoggingEmitter - Close: started [false]
2017-09-04T02:15:00,484 INFO [main] io.druid.java.util.common.lifecycle.Lifecycle$AnnotationBasedHandler - Invoking stop method[public void io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner.stop()] on object[io.druid.initialization.Log4jShutterDownerModule$Log4jShutterDowner@65c5bf0e].
2017-09-04T02:15:00,484 INFO [main-EventThread] org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x354bcd7b507b357
2017-09-04 02:15:00,519 Thread-2 ERROR Unable to register shutdown hook because JVM is shutting down. java.lang.IllegalStateException: Not started
    at io.druid.common.config.Log4jShutdown.addShutdownCallback(Log4jShutdown.java:45)
    at org.apache.logging.log4j.core.impl.Log4jContextFactory.addShutdownCallback(Log4jContextFactory.java:273)
    at org.apache.logging.log4j.core.LoggerContext.setUpShutdownHook(LoggerContext.java:256)
    at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:216)
    at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:145)
    at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:41)
    at org.apache.logging.log4j.LogManager.getContext(LogManager.java:182)
    at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getContext(AbstractLoggerAdapter.java:103)
    at org.apache.logging.slf4j.Log4jLoggerFactory.getContext(Log4jLoggerFactory.java:43)
    at org.apache.logging.log4j.spi.AbstractLoggerAdapter.getLogger(AbstractLoggerAdapter.java:42)
    at org.apache.logging.slf4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:29)
    at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:253)
    at org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:155)
    at org.apache.commons.logging.impl.SLF4JLogFactory.getInstance(SLF4JLogFactory.java:132)
    at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:273)
    at org.apache.hadoop.hdfs.LeaseRenewer.<clinit>(LeaseRenewer.java:72)
    at org.apache.hadoop.hdfs.DFSClient.getLeaseRenewer(DFSClient.java:699)
    at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:859)
    at org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:853)
    at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:2407)
    at org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer.run(FileSystem.java:2424)
    at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

解决方案:
首先: 确定是不是所有有问题的任务都是在这个MiddleManager下面,经运维确定所有有问题的任务都是新增MiddleManager下的。
其次, 仔细查看日志发现如下重点日志:

2017-09-04T00:56:03,285 INFO [task-runner-0-priority-0] io.druid.segment.realtime.plumber.RealtimePlumber - Starting merge and push.
2017-09-04T00:56:03,286 INFO [task-runner-0-priority-0] io.druid.segment.realtime.plumber.RealtimePlumber - Found [0] segments. Attempting to hand off segments that start before [1970-01-01T00:00:00.000Z].
2017-09-04T00:56:03,286 INFO [task-runner-0-priority-0] io.druid.segment.realtime.plumber.RealtimePlumber - Found [0] sinks to persist and merge
......(略)......
2017-09-04T02:10:00,003 INFO [mobileTopic-1m-overseer-1] io.druid.segment.realtime.plumber.RealtimePlumber - Starting merge and push.
2017-09-04T02:10:00,004 INFO [mobileTopic-1m-overseer-1] io.druid.segment.realtime.plumber.RealtimePlumber - Found [0] segments. Attempting to hand off segments that start before [1970-01-01T00:00:00.000Z].
2017-09-04T02:10:00,004 INFO [mobileTopic-1m-overseer-1] io.druid.segment.realtime.plumber.RealtimePlumber - Found [0] sinks to persist and merge

怎么回事呢?peon没有接收到数据???
原因找到了,原来consumer所在机器没有配置新增MiddleManager的host配置,导致任务创建正常,并且可以正常结束,然后hdfs里却没有数据。
因为,consumer不知道把数据发送给谁……

2.理解误区
我们的consumer通过tranquility,向overlord发送任务模板,overlord接收到任务模板后,会根据任务分配策略发送给某个MiddleManager,MiddleManager启动一个java进程,
即peon。peon可以理解为一个http server。最后: consumer是向http server直接发送数据,而不再依赖overlord和MiddleManager节点。可以理解为我们通过集成tranqulity可以和
每个peon创建一个连接。
一些人的误区主要体现在: consumer将数据发送给了overlord,是通过overlord转发数据给peon,这样理解是错误的。

Logo

Kafka开源项目指南提供详尽教程,助开发者掌握其架构、配置和使用,实现高效数据流管理和实时处理。它高性能、可扩展,适合日志收集和实时数据处理,通过持久化保障数据安全,是企业大数据生态系统的核心。

更多推荐