Clickhouse集成离线hive、实时kafka数据实操

Clickhouse集成hive一、官网介绍：1、clickhouse集成hive官网介绍：2、clickhouse支持的数据类型：二、创建对应hive的clickhouse表1、利用clickhouse的HDFS Engine建表语句二、建表需要注意的地方（详细解析）1、schema必须要与hive表中一致2、格式要求3、路径解析4、hdfs路径三、我担心hive数据更新后不能同步过来，继续在做

静听枫语

2772人浏览 · 2021-08-20 10:29:29

静听枫语 · 2021-08-20 10:29:29 发布

Clickhouse集成hive

一、官网介绍：
- 1、clickhouse集成hive、kafka官网介绍：
- 2、clickhouse支持的数据类型：
二、创建对应hive的clickhouse表
- 1、利用clickhouse的HDFS Engine
- - 建表语句
三、集成hive建表需要注意的地方（详细解析）
四、关于增量数据同步问题，因为我配置了增量数据同步，经测试clickhouse表也可以正常同步hive增量数据至创建的表中。
五、创建对应kafka的clickhouse表

一、官网介绍：

1、clickhouse集成hive、kafka官网介绍：

https://clickhouse.tech/docs/en/engines/table-engines/integrations/hdfs/
https://clickhouse.tech/docs/en/engines/table-engines/integrations/kafka/

2、clickhouse支持的数据类型：

https://clickhouse.tech/docs/en/interfaces/formats/#data_types-matching

二、创建对应hive的clickhouse表

1、利用clickhouse的HDFS Engine

建表语句

CREATE TABLE company_employees.employees(
    name String,
    age String,
    email String,
    addr String
  ) ENGINE =  HDFS('hdfs://{active_namenode}:8020/ods/statis_date=202{0..9}{0..9}{0..9}{0..9}{0..9}/statis_hour={0,1,2}{0..9}/*', 'ORC')

官网举例：

Example

Suppose we have several files in TSV format with the following URIs on HDFS:
'hdfs://hdfs1:9000/some_dir/some_file_1'
'hdfs://hdfs1:9000/some_dir/some_file_2'
'hdfs://hdfs1:9000/some_dir/some_file_3'
'hdfs://hdfs1:9000/another_dir/some_file_1'
'hdfs://hdfs1:9000/another_dir/some_file_2'
'hdfs://hdfs1:9000/another_dir/some_file_3'
There are several ways to make a table consisting of all six files:
CREATE TABLE table_with_range (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/some_file_{1..3}', 'TSV')
Another way:

CREATE TABLE table_with_question_mark (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/some_file_?', 'TSV')
Table consists of all the files in both directories (all files should satisfy format and schema described in query):

CREATE TABLE table_with_asterisk (name String, value UInt32) ENGINE = HDFS('hdfs://hdfs1:9000/{some,another}_dir/*', 'TSV')

三、集成hive建表需要注意的地方（详细解析）

1、schema必须要与hive表中一致

如果hive为分区表，分区字段不能写入建表语句中
建表语句的字段完全与hive表中字段一致

2、格式要求

建表语句中的格式跟hive表格式一致

3、路径解析

常用的hive表基本都是分区表，因此clickhouse建表路径要包含所有分区路径
*
我构建的hive表的分区如下所示
在这里插入图片描述
对应的路径写法为：

/ods/statis_date=202{0..9}{0..9}{0..9}{0..9}{0..9}/statis_hour={0,1,2}{0..9}/*', 'ORC')

4、hdfs路径

将路径写为hdfs://{active_namenode}:8020，如果采用cluster或者standby节点回获取不到数据，cluster模式可能也有办法，比如导入core-site.xml文件啥的，但是我没有试过

四、关于增量数据同步问题，因为我配置了增量数据同步，经测试clickhouse表也可以正常同步hive增量数据至创建的表中。

五、创建对应kafka的clickhouse表

1、源表创建

CREATE TABLE service_flow.service_flow_increment_resource (
    mac String,
    rxTraffic double,
    updateTime String
  ) ENGINE = Kafka SETTINGS kafka_broker_list = 'node1:9092',
                            kafka_topic_list = 'service-flow-increment',
                            kafka_group_name = 'kafka_flow_test',
                            kafka_format = 'JSONEachRow',
                            kafka_num_consumers = 1;

2、同步后的表创建

CREATE TABLE service_flow.service_flow_increment(
    mac String,
    rxTraffic double,
    updateTime String
)ENGINE = MergeTree()
ORDER BY updateTime
SETTINGS index_granularity = 8192;

3、同步视图创建

视图里面可以进行一系列操作，使得源数据经处理后符合自己的数据需求同步至实际的表中

CREATE MATERIALIZED VIEW service_flow.service_flow_increment_consumer TO service_flow.service_flow_increment
    AS SELECT 
	mac, 
	rxTraffic,
	updateTime
    FROM service_flow.service_flow_increment_resource;