好久没写博客了,更新一下。这篇博客主要将的是Flink中双流join的操作。

在说双流join之前先了解一下window的定义,推荐博客。然后了解一下watermark的定义,推荐博客

我是在看完第二篇博客后,将数据源改为了kafka。但是一直没有触发window,打印window信息。然后在Flink的开发者邮箱中看见了问题解答。从邮箱的回答中可以很明显的看出为什么没有触发window。One of the source parallelisms doesn't have data

因为监听的topic给的是10个分区,在Flink中会并行使用多个分区。每一个分区发出的数据都会进入window,这个时候触发window的watermark就会变为分区中最小的watermark。问题就在我测试的时候每次都只往一个分区中发送消息,所以无法触发window。把kafka分区改为1完美解决问题。官方解释

测试代码:

public class WatermarkTest {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", Constants.KAFKA_BROKER);
        properties.setProperty("group.id", "crm_stream_window");
        properties.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
        DataStream<String> stream =
                env.addSource(new FlinkKafkaConsumer011<>("test", new SimpleStringSchema(), properties));
        DataStream<Tuple2<String, Long>> inputMap = stream.map(new MapFunction<String, Tuple2<String, Long>>() {
            private static final long serialVersionUID = -8812094804806854937L;

            @Override
            public Tuple2<String, Long> map(String value) throws Exception {
                return new Tuple2<>(value.split("\\W+")[0], Long.valueOf(value.split("\\W+")[1]));
            }
        });
        DataStream<Tuple2<String, Long>> watermark =
                inputMap.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Tuple2<String, Long>>() {

                    private static final long serialVersionUID = 8252616297345284790L;
                    Long currentMaxTimestamp = 0L;
                    Long maxOutOfOrderness = 10000L;//最大允许的乱序时间是10s
                    Watermark watermark = null;
                    SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");

                    @Nullable
                    @Override
                    public Watermark getCurrentWatermark() {
                        watermark = new Watermark(currentMaxTimestamp - maxOutOfOrderness);
                        return watermark;
                    }

                    @Override
                    public long extractTimestamp(Tuple2<String, Long> element, long previousElementTimestamp) {
                        Long timestamp = element.f1;
                        currentMaxTimestamp = Math.max(timestamp, currentMaxTimestamp);
                        System.out.println("timestamp : " + element.f1 + "|" + format.format(element.f1) + " currentMaxTimestamp : " + currentMaxTimestamp + "|" + format.format(currentMaxTimestamp) + "," + " watermark : " + watermark.getTimestamp() + "|" + format.format(watermark.getTimestamp()));
                        return timestamp;
                    }
                });
        watermark.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(3)))
                .apply(new WindowFunction<Tuple2<String, Long>, String, Tuple, TimeWindow>() {
                    private static final long serialVersionUID = 7813420265419629362L;

                    @Override
                    public void apply(Tuple tuple, TimeWindow window, Iterable<Tuple2<String, Long>> input, Collector<String> out) throws Exception {
                        System.out.println(tuple);
                        SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS");
                        out.collect("window  " + format.format(window.getStart()) + "   window  " + format.format(window.getEnd()));
                    }
                }).print();
        env.execute("window test");
    }
}

单流的问题解决后,开始解决双流join。双流join的EvenTimeWindow和单流其实是差不多的。给两个流都指定EventTiem,然后指定延迟时间。需要注意的就是,此时触发window的watermark是两个流中较慢的watermark。也就是说A流的watermark为10,B流的watermark为20,当流join的时候,默认的watermark是10。解答推荐。这里的双流join是左联,将窗口中所有的数据都取出,然后经过逻辑判断将需要的数据发往下游。

双流代码:

public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    StreamTableEnvironment tableEnv = StreamTableEnvironment.getTableEnvironment(env);
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    Kafka011JsonTableSource orderSource = new OrderStreamKafkaReader("test").getTableSource("crm_stream");
    tableEnv.registerTableSource("orderDetail", orderSource);
    Table orderDetailTable = tableEnv.sqlQuery("SELECT * FROM orderDetail ");
    //getItemTable() 从kafka中取出的json数据
    Table itemIdTable = new StreamJoinTest().getItemTable(orderDetailTable);
    DataStream<OrderDetailInfo4Stat> dataStream = tableEnv.toAppendStream(itemIdTable, OrderDetailInfo4Stat.class)
            .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<OrderDetailInfo4Stat>(Time.seconds(5)) {
                private static final long serialVersionUID = -4075075569327086395L;

                @Override
                public long extractTimestamp(OrderDetailInfo4Stat element) {
                    return element.getOrderLastModifyTime().getTime();
                }
            });
    dataStream.print();
    Kafka011JsonTableSource couponCodeSource = new CouponCodeStreamKafkaReader(Constants.COUPON_CODE_INFO).getTableSource("crm_stream");
    tableEnv.registerTableSource("CouponCodeTable", couponCodeSource);
    DataStream<CouponCodeInfo4Stat> couponCodeStream = tableEnv.toAppendStream(tableEnv.sqlQuery("select * from CouponCodeTable"), CouponCodeInfo4Stat.class).
            assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<CouponCodeInfo4Stat>(Time.seconds(5)) {
                private static final long serialVersionUID = 8867904924628932955L;

                @Override
                public long extractTimestamp(CouponCodeInfo4Stat element) {
                    return element.getLastModifyTime().getTime();
                }
            });
    couponCodeStream.print();
    DataStream<OrderDetailInfo4Stat> realDataStream = dataStream.coGroup(couponCodeStream)
            .where((KeySelector<OrderDetailInfo4Stat, Long>) OrderDetailInfo4Stat::getOrderBatchId)
            .equalTo((KeySelector<CouponCodeInfo4Stat, Long>) CouponCodeInfo4Stat::getOrderBatchId)
            .window(TumblingEventTimeWindows.of(Time.seconds(5)))
            .apply(new CoGroupFunction<OrderDetailInfo4Stat, CouponCodeInfo4Stat, OrderDetailInfo4Stat>() {
                private static final long serialVersionUID = -1240462675125711867L;

                @Override
                public void coGroup(Iterable<OrderDetailInfo4Stat> first, Iterable<CouponCodeInfo4Stat> second, Collector<OrderDetailInfo4Stat> out) throws Exception {
                    for (OrderDetailInfo4Stat orderDetailInfo4Stat : first) {
                        for (CouponCodeInfo4Stat couponCodeInfo4Stat : second) {
                            if (orderDetailInfo4Stat.getOrderBatchId().equals(couponCodeInfo4Stat.getOrderBatchId())) {
                                orderDetailInfo4Stat.setOrderSalesmanId(couponCodeInfo4Stat.getReceiveUserId());
                            }
                        }
                        out.collect(orderDetailInfo4Stat);
                    }
                }
            });
    realDataStream.print();
    env.execute("stream join test");
}

到这里就结束了。

努力吧,皮卡丘。

Logo

Kafka开源项目指南提供详尽教程,助开发者掌握其架构、配置和使用,实现高效数据流管理和实时处理。它高性能、可扩展,适合日志收集和实时数据处理,通过持久化保障数据安全,是企业大数据生态系统的核心。

更多推荐