四眼仔_

Flink源码剖析：flink-examples-streaming 自带demo示例

文章目录

1.wordcount
2.socket
3.async
4.iteration
5.join
6.sideoutput
7.windowing

7.1 session window
7.2 count window

7.2.1 slide count window
7.2.2 tumble count window

本文主要分析下 Flink 源码中 flink-examples-streaming 模块，带大家跑一下其中的例子，让大家可以更熟悉 DataStream API 的使用，以及 flink streaming 能解决的问题场景等。
通常我们看源码都是从一个源码的 examples 开始入手的，大家以后要想实现 flink streaming 相关应用，可以直接在这个模块中修改，因为各种依赖都已经配置好了。
笔者对源码中的示例会有些许改动，并把代码粘贴在了文中。

1.wordcount

实时统计单词数量，每来一个计算一次并输出一次。

public class WordCount {

	// *************************************************************************
	// PROGRAM
	// *************************************************************************

	public static void main(String[] args) throws Exception {

		final ParameterTool params = ParameterTool.fromArgs(args);
		final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		env.getConfig().setGlobalJobParameters(params);
		DataStream<String> text;
		if (params.has("input")) {
			// read the text file from given input path
			text = env.readTextFile(params.get("input"));
		} else {
			// get default test text data
			text = env.fromElements(new String[] {
				"miao,She is a programmer",
				"wu,He is a programmer",
				"zhao,She is a programmer"
			});
		}

		DataStream<Tuple2<String, Integer>> counts =
			// split up the lines in pairs (2-tuples) containing: (word,1)
			text.flatMap(new Tokenizer())
			// group by the tuple field "0" and sum up tuple field "1"
			.keyBy(0).sum(1);

		// emit result
		if (params.has("output")) {
			counts.writeAsText(params.get("output"));
		} else {
			System.out.println("Printing result to stdout. Use --output to specify output path.");
			counts.print();
		}

		// execute program
		env.execute("Streaming WordCount");
	}

	// *************************************************************************
	// USER FUNCTIONS
	// *************************************************************************
	public static final class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> {

		@Override
		public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
			// normalize and split the line
			String[] tokens = value.toLowerCase().split("\\W+");

			// emit the pairs
			for (String token : tokens) {
				if (token.length() > 0) {
					out.collect(new Tuple2<>(token, 1));
				}
			}
		}
	}
}

输出结果：

8> (wu,1)
6> (a,1)
8> (is,1)
4> (programmer,1)
2> (he,1)
5> (miao,1)
4> (programmer,2)
6> (a,2)
3> (she,1)
8> (is,2)
3> (she,2)
8> (is,3)
6> (zhao,1)
4> (programmer,3)
6> (a,3)

2.socket

监听socket端口输入的单词，进行单词统计。

public class SocketWindowWordCount {

	public static void main(String[] args) throws Exception {

		// the host and the port to connect to
		final String hostname;
		final int port;
		try {
			final ParameterTool params = ParameterTool.fromArgs(args);
			hostname = params.has("hostname") ? params.get("hostname") : "localhost";
			port = 9999;
		} catch (Exception e) {
			return;
		}

		final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

		// get input data by connecting to the socket
		// 数据来源是从socket读取，元素可以用分隔符切分
		DataStream<String> text = env.socketTextStream(hostname, port, "\n");

		// parse the data, group it, window it, and aggregate the counts
		DataStream<WordWithCount> windowCounts = text

			.flatMap(new FlatMapFunction<String, WordWithCount>() {
				@Override
				public void flatMap(String value, Collector<WordWithCount> out) {
					for (String word : value.split("\\s")) {
						out.collect(new WordWithCount(word, 1L));
					}
				}
			})

			.keyBy("word")
			.timeWindow(Time.seconds(10))

			.reduce(new ReduceFunction<WordWithCount>() {
				// 统计单词个数
				// reduce返回单个的结果值，并且reduce每处理一个元素总是创建一个新值。常用的average,sum,min,max,count,使用reduce方法都可以实现
				@Override
				public WordWithCount reduce(WordWithCount a, WordWithCount b) {
					return new WordWithCount(a.word, a.count + b.count);
				}
			});

		// print the results with a single thread, rather than in parallel
		windowCounts.print().setParallelism(1);

		env.execute("Socket Window WordCount");
	}

	// ------------------------------------------------------------------------

	/**
	 * Data type for words with count.
	 */
	public static class WordWithCount {

		public String word;
		public long count;

		public WordWithCount() {
		}

		public WordWithCount(String word, long count) {
			this.word = word;
			this.count = count;
		}

		@Override
		public String toString() {
			return word + " : " + count;
		}
	}
}

本机启用监听端口：

nc -l 9999

socket监听端口输入以下内容：

miao she is a programmer
wu he is a programmer
zhao she is a programmer

输出的结果：

she : 2
programmer : 3
he : 1
a : 3
zhao : 1
miao : 1
wu : 1
is : 3

3.async

主要通过以下示例了解下 AsyncFunction 作用到 DataStream 上的使用方法。没有用测试数据去跑。

public class AsyncIOExample {

	private static final Logger LOG = LoggerFactory.getLogger(AsyncIOExample.class);

	private static final String EXACTLY_ONCE_MODE = "exactly_once";
	private static final String EVENT_TIME = "EventTime";
	private static final String INGESTION_TIME = "IngestionTime";
	private static final String ORDERED = "ordered";

	public static void main(String[] args) throws Exception {

		// obtain execution environment
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

		// parse parameters
		final ParameterTool params = ParameterTool.fromArgs(args);

		// 状态存放路径
		final String statePath;
		// checkpoint模式
		final String cpMode;
		// source生成的最大值
		final int maxCount;
		// RichAsyncFunction 中 线程休眠的因子
		final long sleepFactor;
		// 模拟RichAsyncFunction出错的概率因子
		final float failRatio;
		// 标志RichAsyncFunction 中的消息是有序还是无序的
		final String mode;
		// 设置任务的并行度
		final int taskNum;
		// 使用的Flink时间类型
		final String timeType;
		// 优雅停止RichAsyncFunction中线程池的等待毫秒数
		final long shutdownWaitTS;
		// RichAsyncFunction中执行异步操作的超时时间
		final long timeout;

		try {
			// check the configuration for the job
			statePath = params.get("fsStatePath", null);
			cpMode = params.get("checkpointMode", "exactly_once");
			maxCount = params.getInt("maxCount", 100000);
			sleepFactor = params.getLong("sleepFactor", 100);
			failRatio = params.getFloat("failRatio", 0.001f);
//			failRatio = params.getFloat("failRatio", 0.5f);
			mode = params.get("waitMode", "ordered");
			taskNum = params.getInt("waitOperatorParallelism", 1);
			timeType = params.get("eventType", "EventTime");
			shutdownWaitTS = params.getLong("shutdownWaitTS", 20000);
			timeout = params.getLong("timeout", 10000L);
		} catch (Exception e) {
			printUsage();

			throw e;
		}

		StringBuilder configStringBuilder = new StringBuilder();

		final String lineSeparator = System.getProperty("line.separator");

		configStringBuilder
			.append("Job configuration").append(lineSeparator)
			.append("FS state path=").append(statePath).append(lineSeparator)
			.append("Checkpoint mode=").append(cpMode).append(lineSeparator)
			.append("Max count of input from source=").append(maxCount).append(lineSeparator)
			.append("Sleep factor=").append(sleepFactor).append(lineSeparator)
			.append("Fail ratio=").append(failRatio).append(lineSeparator)
			.append("Waiting mode=").append(mode).append(lineSeparator)
			.append("Parallelism for async wait operator=").append(taskNum).append(lineSeparator)
			.append("Event type=").append(timeType).append(lineSeparator)
			.append("Shutdown wait timestamp=").append(shutdownWaitTS);

		LOG.info(configStringBuilder.toString());

		if (statePath != null) {
			// setup state and checkpoint mode
			env.setStateBackend(new FsStateBackend(statePath));
		}

		if (EXACTLY_ONCE_MODE.equals(cpMode)) {
			// 生成checkpoint的默认间隔是1s
			env.enableCheckpointing(1000L, CheckpointingMode.EXACTLY_ONCE);
		} else {
			env.enableCheckpointing(1000L, CheckpointingMode.AT_LEAST_ONCE);
		}

		// enable watermark or not
		if (EVENT_TIME.equals(timeType)) {
			env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
		} else if (INGESTION_TIME.equals(timeType)) {
			env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
		}

		// 创建一个数据源
		// create input stream of an single integer
		DataStream<Integer> inputStream = env.addSource(new SimpleSource(maxCount));

		// 创建async函数，通过等待来模拟异步i/o的过程
		// create async function, which will *wait* for a while to simulate the process of async i/o
		AsyncFunction<Integer, String> function =
			new SampleAsyncFunction(sleepFactor, failRatio, shutdownWaitTS);

		// add async operator to streaming job
		DataStream<String> result;
		if (ORDERED.equals(mode)) {
			result = AsyncDataStream.orderedWait(
				inputStream,
				function,
				timeout,
				TimeUnit.MILLISECONDS,
				20).setParallelism(taskNum);
		} else {
			result = AsyncDataStream.unorderedWait(
				inputStream,
				function,
				timeout,
				TimeUnit.MILLISECONDS,
				20).setParallelism(taskNum);
		}

		// add a reduce to get the sum of each keys.
		// 统计 key-> 源头数据除以10的余数，分别对应的个数
		result.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
			private static final long serialVersionUID = -938116068682344455L;

			@Override
			public void flatMap(String value, Collector<Tuple2<String, Integer>> out) throws Exception {
				out.collect(new Tuple2<>(value, 1));
			}
		}).keyBy(0).sum(1).print();

		// execute the program
		env.execute("Async IO Example");
	}

	/**
	 * A checkpointed source.
	 * 具体功能：一个数据流 -> 不断发送一个从0递增的整数
	 */
	private static class SimpleSource implements SourceFunction<Integer>, ListCheckpointed<Integer> {
		private static final long serialVersionUID = 1L;

		private volatile boolean isRunning = true;
		/**
		 * 计数器
		 */
		private int counter = 0;
		/**
		 * 起始值
		 */
		private int start = 0;

		/**
		 * 储存快照状态：每次作业执行成功之后，会保存成功的上一条数据的状态，也就是start的值
		 */
		@Override
		public List<Integer> snapshotState(long checkpointId, long timestamp) throws Exception {
			return Collections.singletonList(start);
		}

		/**
		 * 当执行到某个作业流发生异常时，Flink会调用次方法，将状态还原到上一次的成功checkpoint的那个状态点
		 */
		@Override
		public void restoreState(List<Integer> state) throws Exception {
			// 找到最新的一次checkpoint成功时start的值
			for (Integer i : state) {
				this.start = i;
			}
		}

		public SimpleSource(int maxNum) {
			this.counter = maxNum;
		}

		@Override
		public void run(SourceContext<Integer> ctx) throws Exception {
			while ((start < counter || counter == -1) && isRunning) {
				synchronized (ctx.getCheckpointLock()) {
					ctx.collect(start);
					++start;

					// loop back to 0
					if (start == Integer.MAX_VALUE) {
						start = 0;
					}
				}
				Thread.sleep(10L);
			}
		}

		@Override
		public void cancel() {
			isRunning = false;
		}
	}


	/**
	 * 一个异步函数示例：用线程池模拟多个异步操作
	 * 具体功能：处理流任务的异步函数
	 */
	private static class SampleAsyncFunction extends RichAsyncFunction<Integer, String> {
		private static final long serialVersionUID = 2098635244857937717L;

		private transient ExecutorService executorService;

		/**
		 * The result of multiplying sleepFactor with a random float is used to pause
		 * the working thread in the thread pool, simulating a time consuming async operation.
		 * 模拟耗时的异步操作用的：就是假装这个异步操作很耗时，耗时时长为sleepFactor
		 */
		private final long sleepFactor;

		/**
		 * The ratio to generate an exception to simulate an async error. For example, the error
		 * may be a TimeoutException while visiting HBase.
		 * 模拟异步操作出现了异常：就是假装我的流任务的异步操作出现异常啦~ 会报错：Exception : wahahaha...
		 */
		private final float failRatio;

		private final long shutdownWaitTS;

		SampleAsyncFunction(long sleepFactor, float failRatio, long shutdownWaitTS) {
			this.sleepFactor = sleepFactor;
			this.failRatio = failRatio;
			this.shutdownWaitTS = shutdownWaitTS;
		}

		@Override
		public void open(Configuration parameters) throws Exception {
			super.open(parameters);

			executorService = Executors.newFixedThreadPool(30);
		}

		@Override
		public void close() throws Exception {
			super.close();
			ExecutorUtils.gracefulShutdown(shutdownWaitTS, TimeUnit.MILLISECONDS, executorService);
		}

		/**
		 * 真正执行异步IO的方法
		 * 这里用线程池模拟 source支持异步发送数据流
		 */
		@Override
		public void asyncInvoke(final Integer input, final ResultFuture<String> resultFuture) {
			executorService.submit(() -> {
				// wait for while to simulate async operation here
				// 模拟元素的操作时长：就是这个元素与外部系统交互的时长，然后sleep这么长的时间
				long sleep = (long) (ThreadLocalRandom.current().nextFloat() * sleepFactor);
				try {
					Thread.sleep(sleep);

					if (ThreadLocalRandom.current().nextFloat() < failRatio) {
						// 模拟触发异常：就是与外部系统交互时，假装出错发出了一个异常，此处可以查看日志，观察flink如何checkpoint恢复
						// restart-strategy.fixed-delay.attempts 默认为3，重启3次
						resultFuture.completeExceptionally(new Exception("wahahahaha..."));
					} else {
						// 根据输入的input/10的余数生成key
						resultFuture.complete(
							Collections.singletonList("key-" + (input % 10)));
					}
				} catch (InterruptedException e) {
					resultFuture.complete(new ArrayList<>(0));
				}
			});
		}
	}
}

4.iteration

本示例为输入int值键值对，迭代计算斐波那契数列值大于100时需要计算的步长。

斐波那契数列：
F(n) = F(n-1) + F(n-2)

假设初始键值对为(34,11)，则产生的斐波那契数列为：34,11,45,56,101
经过本代码运行的结果应该是((34,11),3)，需要经历3步长，即累加3次才能使得数列值101 > 100，才能打到output流中。

public class IterateExample {

	private static final int BOUND = 100;

	// *************************************************************************
	// PROGRAM
	// *************************************************************************

	public static void main(String[] args) throws Exception {

		final ParameterTool params = ParameterTool.fromArgs(args);
		// obtain execution environment and set setBufferTimeout to 1 to enable
		// continuous flushing of the output buffers (lowest latency)
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
			.setBufferTimeout(1);

		// make parameters available in the web interface
		env.getConfig().setGlobalJobParameters(params);

		// create input stream of integer pairs
		DataStream<Tuple2<Integer, Integer>> inputStream;
		if (params.has("input")) {
			inputStream = env.readTextFile(params.get("input")).map(new FibonacciInputMap());
		} else {
			inputStream = env.addSource(new RandomFibonacciSource());
		}

		// create an iterative data stream from the input with 5 second timeout
		// 经过InputMap() 将Tuple2 转成 Tuple5
		// 指定等待反馈输入的最大时间间隔，如果超过该时间间隔没有反馈元素到来，那么该迭代将会终止
		IterativeStream<Tuple5<Integer, Integer, Integer, Integer, Integer>> it = inputStream.map(new InputMap())
			.iterate(5000);

		// apply the step function to get the next Fibonacci number
		// increment the counter and split the output with the output selector
		// 再经过Step()步函数 获取下一个斐波那契数
		// 并进行分流
		SplitStream<Tuple5<Integer, Integer, Integer, Integer, Integer>> step = it.map(new Step())
			.split(new MySelector());

		// close the iteration by selecting the tuples that were directed to the
		// 'iterate' channel in the output selector
		it.closeWith(step.select("iterate"));

		// to produce the final output select the tuples directed to the
		// 'output' channel then get the input pairs that have the greatest iteration counter
		// on a 1 second sliding window
		DataStream<Tuple2<Tuple2<Integer, Integer>, Integer>> numbers = step.select("output")
			.map(new OutputMap());

		// emit results
		if (params.has("output")) {
			numbers.writeAsText(params.get("output"));
		} else {
			System.out.println("Printing result to stdout. Use --output to specify output path.");
			numbers.print();
		}

		long startTime = System.currentTimeMillis();
		System.out.println("Start time: " + startTime);
		// execute the program
		env.execute("Streaming Iteration Example");
		System.out.println("Spent time: " + (System.currentTimeMillis() - startTime));

	}

	// *************************************************************************
	// USER FUNCTIONS
	// *************************************************************************

	/**
	 * Generate BOUND number of random integer pairs from the range from 1 to BOUND/2.
	 * 随机生成Integer键值对的source
	 */
	private static class RandomFibonacciSource implements SourceFunction<Tuple2<Integer, Integer>> {
		private static final long serialVersionUID = 1L;

		private Random rnd = new Random();

		private volatile boolean isRunning = true;
		private int counter = 0;

		@Override
		public void run(SourceContext<Tuple2<Integer, Integer>> ctx) throws Exception {

			while (isRunning && counter < BOUND) {
				int first = rnd.nextInt(BOUND / 2 - 1) + 1;
				int second = rnd.nextInt(BOUND / 2 - 1) + 1;

				ctx.collect(new Tuple2<>(first, second));
				counter++;
				Thread.sleep(50L);
			}
		}

		@Override
		public void cancel() {
			isRunning = false;
		}
	}

	/**
	 * Generate random integer pairs from the range from 0 to BOUND/2.
	 */
	private static class FibonacciInputMap implements MapFunction<String, Tuple2<Integer, Integer>> {
		private static final long serialVersionUID = 1L;

		@Override
		public Tuple2<Integer, Integer> map(String value) throws Exception {
			String record = value.substring(1, value.length() - 1);
			String[] splitted = record.split(",");
			return new Tuple2<>(Integer.parseInt(splitted[0]), Integer.parseInt(splitted[1]));
		}
	}

	/**
	 * Map the inputs so that the next Fibonacci numbers can be calculated while preserving the original input tuple.
	 * A counter is attached to the tuple and incremented in every iteration step.
	 */
	public static class InputMap implements MapFunction<Tuple2<Integer, Integer>, Tuple5<Integer, Integer, Integer,
		Integer, Integer>> {
		private static final long serialVersionUID = 1L;

		@Override
		public Tuple5<Integer, Integer, Integer, Integer, Integer> map(Tuple2<Integer, Integer> value) throws
			Exception {
			// 结果map转换将Tuple2转成Tuple5
			return new Tuple5<>(value.f0, value.f1, value.f0, value.f1, 0);
		}
	}

	/**
	 * Iteration step function that calculates the next Fibonacci number.
	 */
	public static class Step implements
		MapFunction<Tuple5<Integer, Integer, Integer, Integer, Integer>, Tuple5<Integer, Integer, Integer,
			Integer, Integer>> {
		private static final long serialVersionUID = 1L;

		@Override
		public Tuple5<Integer, Integer, Integer, Integer, Integer> map(Tuple5<Integer, Integer, Integer, Integer,
			Integer> value) throws Exception {
			return new Tuple5<>(value.f0, value.f1, value.f3, value.f2 + value.f3, ++value.f4);
		}
	}

	/**
	 * OutputSelector testing which tuple needs to be iterated again.
	 */
	public static class MySelector implements OutputSelector<Tuple5<Integer, Integer, Integer, Integer, Integer>> {
		private static final long serialVersionUID = 1L;

		@Override
		public Iterable<String> select(Tuple5<Integer, Integer, Integer, Integer, Integer> value) {
			List<String> output = new ArrayList<>();
			if (value.f2 < BOUND && value.f3 < BOUND) {
				// 指定流的一部分用于反馈给迭代头
				// 反馈流反馈给迭代头就意味着一个迭代的完整逻辑的完成
				output.add("iterate");
			} else {
				// 指定流的另一部分发给下游
				output.add("output");
			}
			return output;
		}
	}

	/**
	 * Giving back the input pair and the counter.
	 */
	public static class OutputMap implements MapFunction<Tuple5<Integer, Integer, Integer, Integer, Integer>,
		Tuple2<Tuple2<Integer, Integer>, Integer>> {
		private static final long serialVersionUID = 1L;

		@Override
		public Tuple2<Tuple2<Integer, Integer>, Integer> map(Tuple5<Integer, Integer, Integer, Integer, Integer>
																 value) throws
			Exception {
			return new Tuple2<>(new Tuple2<>(value.f0, value.f1), value.f4);
		}
	}
}

输出的部分结果：

6> ((43,24),3)
7> ((23,25),3)
8> ((4,46),3)
1> ((42,49),2)
2> ((29,22),3)
3> ((36,3),4)
4> ((30,36),2)
5> ((18,24),3)
6> ((38,9),3)
7> ((19,44),2)
8> ((42,40),2)
1> ((16,9),5)
2> ((2,7),6)

5.join

本示例演示了如何使用DataStream API进行双流join。

public class WindowJoin {

	// *************************************************************************
	// PROGRAM
	// *************************************************************************

	public static void main(String[] args) throws Exception {
		// parse the parameters
		final ParameterTool params = ParameterTool.fromArgs(args);
		final long windowSize = params.getLong("windowSize", 2000);
		final long rate = params.getLong("rate", 3L);

		System.out.println("Using windowSize=" + windowSize + ", data rate=" + rate);
		System.out.println("To customize example, use: WindowJoin [--windowSize ] [--rate ]");

		// obtain execution environment, run this example in "ingestion time"
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);

		// make parameters available in the web interface
		env.getConfig().setGlobalJobParameters(params);

		// create the data sources for both grades and salaries
		// john,4
		DataStream<Tuple2<String, Integer>> grades = GradeSource.getSource(env, rate);
		// john,18000
		DataStream<Tuple2<String, Integer>> salaries = SalarySource.getSource(env, rate);

		// run the actual window join program
		// for testability, this functionality is in a separate method.
		DataStream<Tuple3<String, Integer, Integer>> joinedStream = runWindowJoin(grades, salaries, windowSize);

		// print the results with a single thread, rather than in parallel
		// 输出 john,4,18000
		joinedStream.print().setParallelism(1);

		// execute program
		env.execute("Windowed Join Example");
	}

	public static DataStream<Tuple3<String, Integer, Integer>> runWindowJoin(
			DataStream<Tuple2<String, Integer>> grades,
			DataStream<Tuple2<String, Integer>> salaries,
			long windowSize) {

		return grades.join(salaries)
				.where(new NameKeySelector())
				.equalTo(new NameKeySelector())

				.window(TumblingEventTimeWindows.of(Time.milliseconds(windowSize)))

				.apply(new JoinFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple3<String, Integer, Integer>>() {

					@Override
					public Tuple3<String, Integer, Integer> join(
									Tuple2<String, Integer> first,
									Tuple2<String, Integer> second) {
						return new Tuple3<String, Integer, Integer>(first.f0, first.f1, second.f1);
					}
				});
	}

	private static class NameKeySelector implements KeySelector<Tuple2<String, Integer>, String> {
		@Override
		public String getKey(Tuple2<String, Integer> value) {
			return value.f0;
		}
	}
}

6.sideoutput

当想要拆分数据流时，通常需要复制流，使用旁路输出可以直接过滤出不要的数据。示例中过滤出了长度 > 5的单词。

public class SideOutputExample {

	/**
	 * OutputTag用来标识一个旁路输出流
	 */
	private static final OutputTag<String> rejectedWordsTag = new OutputTag<String>("rejected") {};

	public static void main(String[] args) throws Exception {

		final ParameterTool params = ParameterTool.fromArgs(args);
		final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);

		// make parameters available in the web interface
		env.getConfig().setGlobalJobParameters(params);

		// get input data
		DataStream<String> text;
		if (params.has("input")) {
			// read the text file from given input path
			text = env.readTextFile(params.get("input"));
		} else {
			// get default test text data
			text = env.fromElements(WordCountData.WORDS);
		}

		SingleOutputStreamOperator<Tuple2<String, Integer>> tokenized = text
				.keyBy(new KeySelector<String, Integer>() {
					private static final long serialVersionUID = 1L;

					@Override
					public Integer getKey(String value) throws Exception {
						return 0;
					}
				})
				.process(new Tokenizer());

		// 旁路输出流
		DataStream<String> rejectedWords = tokenized
			    // 通过OutputTag从源source中获取旁路输出
				.getSideOutput(rejectedWordsTag)
				.map(new MapFunction<String, String>() {
					private static final long serialVersionUID = 1L;

					@Override
					public String map(String value) throws Exception {
						return "rejected: " + value;
					}
				});

		// 正常输出流，对正常数据做5s翻滚窗口的单词统计
		DataStream<Tuple2<String, Integer>> counts = tokenized
				.keyBy(0)
				.window(TumblingEventTimeWindows.of(Time.seconds(5)))
				// group by the tuple field "0" and sum up tuple field "1"
				.sum(1);

		// emit result
		if (params.has("output")) {
			counts.writeAsText(params.get("output"));
			rejectedWords.writeAsText(params.get("rejected-words-output"));
		} else {
			System.out.println("Printing result to stdout. Use --output to specify output path.");
			counts.print();
			rejectedWords.print();
		}

		// execute program
		env.execute("Streaming WordCount SideOutput");
	}

	// *************************************************************************
	// USER FUNCTIONS
	// *************************************************************************
	public static final class Tokenizer extends ProcessFunction<String, Tuple2<String, Integer>> {
		private static final long serialVersionUID = 1L;

		@Override
		public void processElement(
				String value,
				Context ctx,
				Collector<Tuple2<String, Integer>> out) throws Exception {
			// normalize and split the line
			String[] tokens = value.toLowerCase().split("\\W+");

			// emit the pairs
			for (String token : tokens) {
				if (token.length() > 5) {
					// 长度大于5的单词会作为旁路输出，ctx.output是将数据发送到旁路输出中
					ctx.output(rejectedWordsTag, token);
				} else if (token.length() > 0) {
					// 将数据发送到常规输出中
					out.collect(new Tuple2<>(token, 1));
				}
			}

		}
	}
}

输出的结果：

6> rejected: question
6> rejected: whether
6> rejected: nobler
6> rejected: suffer
......
1> (them,1)
4> (not,2)
2> (would,2)
1> (weary,1)
......

7.windowing

7.1 session window

示例演示了基于 EventTime 的会话窗口，分析了 watermark 生成以及触发窗口计算的时机。对于将被窗口丢弃的数据，如(“a”,1L,2)，可以 sideoutput。

public class SessionWindowing {

	@SuppressWarnings("serial")
	public static void main(String[] args) throws Exception {

		final ParameterTool params = ParameterTool.fromArgs(args);
		final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

		env.getConfig().setGlobalJobParameters(params);
		env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
		env.setParallelism(1);

		final boolean fileOutput = params.has("output");

		final List<Tuple3<String, Long, Integer>> input = new ArrayList<>();

		// key、event time时间戳、key出现的次数
		input.add(new Tuple3<>("a", 1L, 1));

		input.add(new Tuple3<>("b", 1L, 1));
		input.add(new Tuple3<>("b", 3L, 1));
		input.add(new Tuple3<>("b", 5L, 1));

		input.add(new Tuple3<>("c", 6L, 1));

		// 即将被窗口丢弃的数据
		input.add(new Tuple3<>("a", 1L, 2));
		// We expect to detect the session "a" earlier than this point (the old
		// functionality can only detect here when the next starts)
		input.add(new Tuple3<>("a", 10L, 1));
		// We expect to detect session "b" and "c" at this point as well
		input.add(new Tuple3<>("c", 11L, 1));

		DataStream<Tuple3<String, Long, Integer>> source = env
				.addSource(new SourceFunction<Tuple3<String, Long, Integer>>() {
					private static final long serialVersionUID = 1L;

					@Override
					public void run(SourceContext<Tuple3<String, Long, Integer>> ctx) throws Exception {
						for (Tuple3<String, Long, Integer> value : input) {
							ctx.collectWithTimestamp(value, value.f1);
							// 发射watermark
							ctx.emitWatermark(new Watermark(value.f1 - 1));
						}
						// input输入流中的数据读取完毕之后，发射一个大的watermark，确保触发最后的窗口计算
						// 无限流，表示终止的watermark，需要一个超过window的end time的watermark来触发window计算
						ctx.emitWatermark(new Watermark(Long.MAX_VALUE));
					}

					@Override
					public void cancel() {
					}
				});

		// We create sessions for each id with max timeout of 3 time units
		DataStream<Tuple3<String, Long, Integer>> aggregated = source
				.keyBy(0)
				.window(EventTimeSessionWindows.withGap(Time.milliseconds(3L)))
				.sum(2);

		if (fileOutput) {
			aggregated.writeAsText(params.get("output"));
		} else {
			System.out.println("Printing result to stdout. Use --output to specify output path.");
			aggregated.print();
		}

		env.execute();
	}
}

输出的结果：

(a,1,1)

(b,1,3)
(c,6,1)

(a,10,1)
(c,11,1)

在类 InternalTimeServiceImpl 的 advanceWatermark() 中添加打印增量 Watermark 和 Timer 信息：

public class InternalTimerServiceImpl {
	public void advanceWatermark(long time) throws Exception {
		// 打印增量watermark
		System.out.println(String.format("Advanced watermark %s", time));
		currentWatermark = time;

		InternalTimer<K, N> timer;

		while ((timer = eventTimeTimersQueue.peek()) != null && timer.getTimestamp() <= time) {
			// 打印触发时的timer
			System.out.println(timer);
			eventTimeTimersQueue.poll();
			keyContext.setCurrentKey(timer.getKey());
			triggerTarget.onEventTime(timer);
		}
	}
}

watermark生成与触发计算详情分析如下：

Advanced watermark 0
Advanced watermark 2
Advanced watermark 4
# watermark 4 = a 窗口的结束时间4，所以触发a计算输出
Timer{timestamp=3, key=(a), namespace=TimeWindow{start=1, end=4}}
(a,1,1)
Advanced watermark 5
Advanced watermark 9
# (b,1,1)、(b,3,1)、(b,5,1)，因为 3-1=2 b 窗口的结束时间8，所以触发b计算输出
Timer{timestamp=7, key=(b), namespace=TimeWindow{start=1, end=8}}
(b,1,3)
# watermark 9 = c 窗口的结束时间9，所以触发c计算输出
Timer{timestamp=8, key=(c), namespace=TimeWindow{start=6, end=9}}
(c,6,1)
Advanced watermark 10
Advanced watermark 9223372036854775807
# watermark 9223372036854775807 > a 窗口的结束时间13，所以触发a计算输出
Timer{timestamp=12, key=(a), namespace=TimeWindow{start=10, end=13}}
(a,10,1)
# watermark 9223372036854775807 > c 窗口的结束时间14，所以触发c计算输出
Timer{timestamp=13, key=(c), namespace=TimeWindow{start=11, end=14}}
(c,11,1)

由此也可以看出 Flink 何时触发window ？

1. watermark > window_end_time (对于out-of-order以及正常的数据而言)
2. 在[window_start_time,window_end_time)区间有数据存在

窗口合并过程示例：

图1: session窗口合并过程示例

AssignWindows 给每一条到来的元素根据事件时间分配 session 窗口：k1,v1,13:02,[13:02,13:32];k1,v4,13:20,[13:20,13:50]
DropTimestamps 丢弃数据事件时间戳：k1,v1,[13:02,13:32];k1,v4,[13:20,13:50]
GroupByKey 根据 key 进行聚合，根据不同的 key 进行窗口合并
MergeWindows 窗口合并，k1,v1对应的会话窗口[13:02,13:32]，k1,v4对应的会话窗口[13:20,13:50]，13:20 - 13:02 = 18min < Gap30min，
所以这两个窗口会进行合并，合并的结果为 k1,v1,[13:02,13:50];k1,v4,[13:02,13:50]
GroupAlsoByWindow k1,(v1,v4),[13:02,13:50]
ExpandToElements 将新的 session 窗口结束时间作为元素的事件时间： k1,(v1,v4),13:50,[13:02,13:50]

7.2 count window

以下这两个示例是笔者自己写的，分别演示滑动 count 窗口和翻滚 count 窗口。

7.2.1 slide count window

public class SlideCountWindowExample {

	public static void main(String[] args) throws Exception {
		final ParameterTool params = ParameterTool.fromArgs(args);
		final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		env.getConfig().setGlobalJobParameters(params);
		env.setParallelism(1);
		final int windowSize = params.getInt("window", 3);
		final int slideSize = params.getInt("slide", 2);

		Tuple2[] elements = new Tuple2[]{
			Tuple2.of("a", "1"),
			Tuple2.of("a", "2"),
			Tuple2.of("a", "3"),
			Tuple2.of("a", "4"),
			Tuple2.of("a", "5"),
			Tuple2.of("a", "6"),
			Tuple2.of("b", "7"),
			Tuple2.of("b", "8"),
			Tuple2.of("b", "9"),
			Tuple2.of("b", "0")
		};

		// read source data
		DataStreamSource<Tuple2<String, String>> inStream = env.fromElements(elements);

		// calculate
		DataStream<Tuple2<String, String>> outStream = inStream
			.keyBy(0)
			// sliding count window of 3 elements size and 2 elements trigger interval
			.countWindow(windowSize, slideSize)
			.reduce(
				new ReduceFunction<Tuple2<String, String>>() {
					@Override
					public Tuple2<String, String> reduce(Tuple2<String, String> value1, Tuple2<String, String> value2) throws Exception {
						return Tuple2.of(value1.f0, value1.f1 + "" + value2.f1);
					}
				}
			);
		outStream.print();
		env.execute("WindowWordCount");
	}
}

输出的结果：

(a,12)
(a,234)
(a,456)
(b,78)
(b,890)

countWindow(3,2)，3为窗口大小，2为滑动步长。每进来2个元素，就对最近的3个元素计算一遍并输出。

查看内部源码：

public class KeyedStream<T, KEY> extends DataStream<T> {

    public WindowedStream<T, KEY, GlobalWindow> countWindow(long size, long slide) {
		return window(GlobalWindows.create())
			    // 在触发计算之前之后的剔除操作
				.evictor(CountEvictor.of(size))
			    // 触发条件是slide步长的个数
				.trigger(CountTrigger.of(slide));
    }
}

7.2.2 tumble count window

public class TumbleCountWindowExample {

	public static void main(String[] args) throws Exception {
		final ParameterTool params = ParameterTool.fromArgs(args);
		final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		env.getConfig().setGlobalJobParameters(params);
		env.setParallelism(1);
		final int windowSize = params.getInt("window", 3);

		Tuple2[] elements = new Tuple2[]{
			Tuple2.of("a", "1"),
			Tuple2.of("a", "2"),
			Tuple2.of("a", "3"),
			Tuple2.of("a", "4"),
			Tuple2.of("a", "5"),
			Tuple2.of("a", "6"),
			Tuple2.of("b", "7"),
			Tuple2.of("b", "8"),
			Tuple2.of("b", "9"),
			Tuple2.of("b", "0")
		};

		// read source data
		DataStreamSource<Tuple2<String, String>> inStream = env.fromElements(elements);

		// calculate
		DataStream<Tuple2<String, String>> outStream = inStream
			.keyBy(0)
			// tumbling count window of 3 elements size
			.countWindow(windowSize)
			.reduce(
				new ReduceFunction<Tuple2<String, String>>() {
					@Override
					public Tuple2<String, String> reduce(Tuple2<String, String> value1, Tuple2<String, String> value2) throws Exception {
						return Tuple2.of(value1.f0, value1.f1 + "" + value2.f1);
					}
				}
			);
		outStream.print();
		env.execute("WindowWordCount");
	}
}

输出的结果：

(a,123)
(a,456)
(b,789)

最后一条 Tuple2.of(“b”, “0”) 被丢弃，因为最后一条数据已经无法触发计算了。

你可能感兴趣的:(Flink)

大数据技术之Flink
第1章Flink概述1.1Flink是什么1.2Flink特点1.3FlinkvsSparkStreaming表Flink和Streaming对比FlinkStreaming计算模型流计算微批处理时间语义事件时间、处理时间处理时间窗口多、灵活少、不灵活（窗口必须是批次的整数倍）状态有没有流式SQL有没有1.4Flink的应用场景1.5Flink分层API第2章Flink快速上手2.1创建项目在准备
Hadoop核心组件最全介绍 Cachel wood 大数据开发 hadoop 大数据分布式 spark 数据库计算机网络
文章目录一、Hadoop核心组件1.HDFS(HadoopDistributedFileSystem)2.YARN(YetAnotherResourceNegotiator)3.MapReduce二、数据存储与管理1.HBase2.Hive3.HCatalog4.Phoenix三、数据处理与计算1.Spark2.Flink3.Tez4.Storm5.Presto6.Impala四、资源调度与集群管
flink数据同步mysql到hive_基于Canal与Flink实现数据实时增量同步(二)
背景在数据仓库建模中，未经任何加工处理的原始业务层数据，我们称之为ODS(OperationalDataStore)数据。在互联网企业中，常见的ODS数据有业务日志数据(Log)和业务DB数据(DB)两类。对于业务DB数据来说，从MySQL等关系型数据库的业务数据进行采集，然后导入到Hive中，是进行数据仓库生产的重要环节。如何准确、高效地把MySQL数据同步到Hive中？一般常用的解决方案是批量
Flink OceanBase CDC 环境配置与验证 Edingbrugh.南空运维大数据 flink flink oceanbase 大数据
一、OceanBase数据库核心配置1.环境准备与版本要求版本要求：OceanBaseCE4.0+或OceanBaseEE2.2+组件依赖：需部署LogProxy服务（社区版/企业版部署方式不同）兼容模式：支持MySQL模式（默认）和Oracle模式2.创建用户与权限配置在sys租户创建管理用户（社区版示例）：--连接sys租户（默认端口2881）mysql-h127.0.0.1-P2881-ur
Flink MongoDB CDC 环境配置与验证 Edingbrugh.南空运维大数据 flink flink mongodb 大数据
一、MongoDB数据库核心配置1.环境准备与集群要求MongoDBCDC依赖ChangeStreams特性，需满足以下条件：版本要求：MongoDB≥3.6集群模式：副本集（ReplicaSet）或分片集群（ShardedCluster）存储引擎：WiredTiger（默认自3.2版本起）副本集协议：pv1（MongoDB4.0+默认）验证集群配置：#连接MongoDBshellmongo--h
Flink将数据流写入Kafka,Redis,ES,Mysql 浅唱战无双 flink mysql es redis kafka
Flink写入不同的数据源写入到Mysql写入到ES向Redis写入向kafka写入导入公共依赖org.slf4jslf4j-simple1.7.25compileorg.apache.flinkflink-java1.10.1org.apache.flinkflink-streaming-java_2.121.10.1写入到Mysql导入依赖mysqlmysql-connector-java5.
Flink TiDB CDC 环境配置与验证
一、TiDB数据库核心配置1.启用TiCDC服务确保TiDB集群已部署TiCDC组件（版本需兼容FlinkCDC3.0.1），并启动同步服务：#示例：启动TiCDC捕获changefeedcdcclichangefeedcreate\--pd="localhost:2379"\--sink-uri="blackhole://"\--changefeed-id="flink-cdc-demo"2.验
Flink CDC支持Oracle RAC架构CDB+PDB模式的实时数据同步吗，可以上生产环境吗智海观潮 Flink flink cdc oracle flink 数据同步大数据
众所周知，FlinkCDC是一个流数据集成工具，支持多种数据源的实时数据同步，包括大家所熟知的MySQL，MongoDB等。原本是作为Flink的子项目运行，后来捐献给Apache基金会，底层实现比较依赖于Flink生态。具体到数据同步底层实现则相对比较依赖于Debezium。对于Oracle实时数据同步有需求的用户来说，经常会有疑问，比如FlinkCDC支持Oracle实时数据同步吗，可以应用到
Flink Oracle CDC 环境配置与验证
一、Oracle数据库核心配置详解1.启用归档日志（ArchivingLog）OracleCDC依赖归档日志获取增量变更数据，需按以下步骤启用：非CDB数据库配置：--以DBA身份连接数据库CONNECTsys/passwordASSYSDBA;--配置归档目标路径和大小ALTERSYSTEMSETdb_recovery_file_dest_size=10G;ALTERSYSTEMSETdb_re
flink读取kafka的数据处理完毕写入redis JinVijay flink kafka redis flink
/**从Kafka读取数据处理完毕写入Redis*/publicclassKafkaToRedis{publicstaticvoidmain(String[]args)throwsException{StreamExecutionEnvironmentenv=StreamExecutionEnvironment.getExecutionEnvironment();//开启checkpointing
阿里云Flink：开启大数据实时处理新时代云资源服务商阿里云大数据云计算
走进阿里云Flink在大数据处理的广袤领域中，阿里云Flink犹如一颗璀璨的明星，占据着举足轻重的地位。随着数据量呈指数级增长，企业对数据处理的实时性、高效性和准确性提出了前所未有的挑战。传统的数据处理方式逐渐难以满足这些严苛的需求，而阿里云Flink凭借其卓越的特性和强大的功能，成为众多企业实现数据价值挖掘与业务创新的关键技术。它不仅继承了开源Flink的优秀基因，还融入了阿里云自主研发的创新技
大数据集群架构hadoop集群、Hbase集群、zookeeper、kafka、spark、flink、doris、dataeas(二) 争取不加班！ hadoop hbase zookeeper 大数据运维
zookeeper单节点部署wget-chttps://dlcdn.apache.org/zookeeper/zookeeper-3.8.4/apache-zookeeper-3.8.4-bin.tar.gz下载地址tarxfapache-zookeeper-3.8.4-bin.tar.gz-C/data/&&mv/data/apache-zookeeper-3.8.4-bin//data/zoo
Hadoop、Spark、Flink 三大大数据处理框架的能力与应用场景
一、技术能力与应用场景对比产品能力特点应用场景Hadoop-基于MapReduce的批处理框架-HDFS分布式存储-容错性强、适合离线分析-作业调度使用YARN-日志离线分析-数据仓库存储-T+1报表分析-海量数据处理Spark-基于内存计算，速度快-支持批处理、流处理（StructuredStreaming）-支持SQL、ML、图计算等-支持多语言（Scala、Java、Python）-近实时处
数据同步工具对比：Canal、DataX与Flink CDC 智慧源点大数据 flink 大数据
在现代数据架构中，数据同步是构建数据仓库、实现实时分析、支持业务决策的关键环节。Canal、DataX和FlinkCDC作为三种主流的数据同步工具，各自有着不同的设计理念和适用场景。本文将深入探讨这三者的技术特点、使用场景以及实践中的差异，帮助开发者根据实际需求选择合适的工具。1.工具概述1.1CanalCanal是阿里巴巴开源的一款基于MySQL数据库增量日志(binlog)解析的组件，主要用于
4_Flink CEP frimiku flink 大数据云计算
FlinkCEP1、何为CEP？CEP，全称为复杂事件处理（ComplexEventProcessing），是一种用于实时监测和分析数据流的技术。CEP详细讲解：CEP是基于动态环境的事件流的分析技术，事件是状态变化（持续生成数据）的。通过分析事件间的关系，利用过滤、关联、聚合等技术，根据事件间的【时序关系和聚合关系】制定检测规则，持续地从事件流中查询出【符合规则要求】的事件序列，最终分析得到更复
Flink项目基础配置指南 Edingbrugh.南空 flink 大数据 flink 大数据
在大数据处理领域，ApacheFlink凭借强大的实时流处理和批处理能力，成为众多开发者的首选工具。在日常工作中，开发FlinkJar任务是常见需求，但每次都需重复配置日志、梳理pom依赖、设置打包插件等，流程繁琐且易出错。为提升开发效率，减少重复劳动，将这些基础配置进行整理归纳十分必要。本文将围绕Flink项目的本地日志配置、pom依赖及插件配置展开详细介绍，为开发者提供一套可直接复用的基础配置
Apache SeaTunnel Flink引擎执行流程源码分析 Code Monkey’s Lab 源码分析 Flink flink 大数据架构 seatunnel
目录1.任务启动入口2.任务执行命令类：FlinkTaskExecuteCommand3.FlinkExecution的创建与初始化3.1核心组件初始化3.2关键对象说明4.任务执行：FlinkExecution.execute()5.Source处理流程5.1插件初始化5.2数据流生成6.Transform处理流程6.1插件初始化6.2转换执行7.Sink处理流程7.1插件初始化7.2数据输出执
Beam2.61.0版本消费kafka重复问题排查隔壁寝室老吴 kafka linq 分布式
1.问题出现过程在测试环境测试flink的job的任务消费kafka的情况，通过往job任务发送一条消息，然后flinkwebui上消费出现了两条。然后通过重启JobManager和TaskManager后，任务从checkpoint恢复后就会出现重复消费。当任务不从checkpoint恢复的时候，任务不会出现重复消费的情况。由此可见是beam从checkpoint恢复的时候出现了重复消费的问题。
Flink CDC同步Oracle无主键表 Zzz...209 java flink oracle
FlinkCDC同步Oracle无主键表问题背景问题解决问题背景FlinkCDC是一种很强大且实用的实时数据同步工具，官网如下。链接:link但是在实际使用过程中还是会有些不足之处，比如说同步Oracle数据库中无主键以及唯一键的表时，关于目标端的幂等性时无法保证的。问题解决在Oracle数据库中，表中有一个伪列ROWID，而在CDC同步过来的数据中是不包含此列的。修改源码如下，使之携带ROWID
Flink Oracle CDC Connector详解 24k小善 flink java 大数据
1.FlinkOracleCDCConnector核心功能功能模块描述实时数据捕获实时捕捉Oracle数据库中的DML操作（INSERT,UPDATE,DELETE）。Schema变更支持支持部分DDL操作的检测（如表结构变更）。端到端一致性确保数据从Oracle到Flink的传输过程中的完整性和一致性。可扩展性支持高吞吐量和大规模数据处理需求。容错机制具备断点续传能力，确保在中断后能够从上次的位
Apache Flink深度解析：现代流处理引擎暴躁哥大数据技术 apache flink 大数据
好的，我来帮您写一篇关于Flink技术的详细介绍博客：ApacheFlink深度解析：现代流处理引擎一、Flink简介ApacheFlink是一个开源的分布式流处理和批处理统一计算引擎。它提供了数据流上的状态计算、精确一次性语义保证、高吞吐、低延迟等特性，能够运行在所有常见的集群环境中。1.1核心特性统一的流批处理精确一次性语义事件时间处理有状态计算高吞吐和低延迟高可用性配置内存管理二、Flink
Flink SQL Connector Kafka 核心参数全解析与实战指南 Edingbrugh.南空 kafka flink 大数据 flink sql kafka
FlinkSQLConnectorKafka是连接FlinkSQL与Kafka的核心组件，通过将Kafka主题抽象为表结构，允许用户使用标准SQL语句完成数据读写操作。本文基于ApacheFlink官方文档（2.0版本），系统梳理从表定义、参数配置到实战调优的全流程指南，帮助开发者高效构建实时数据管道。一、依赖配置与环境准备1.1Maven依赖引入在FlinkSQL项目中使用Kafka连接器需添加
Flink部署与应用——Flink集群模式黄雪超从0开始学Flink flink 大数据
Flink集群模式在大数据处理领域，ApacheFlink凭借其卓越的流批一体化处理能力，成为众多企业的首选框架。而Flink集群模式的选择与运用，对于充分发挥Flink的性能优势、满足不同业务场景的需求至关重要。接下来，我们将深入探讨Flink的多种集群模式，剖析其特点、适用场景及相互间的差异。集群部署模式对比Flink的集群部署模式可依据两个关键维度进行分类：一是集群的生命周期和资源隔离方式；
Spark Streaming 与 Flink 实时数据处理方案对比与选型指南浅沫云归后端技术栈小结 spark-streaming flink real-time
SparkStreaming与Flink实时数据处理方案对比与选型指南实时数据处理在互联网、电商、物流、金融等领域均有大量应用，面对海量流式数据，SparkStreaming和Flink成为两大主流开源引擎。本文基于生产环境需求，从整体架构、编程模型、容错机制、性能表现、实践案例等维度进行深入对比，并给出选型建议。一、问题背景介绍业务场景日志实时统计与告警用户行为实时画像实时订单或交易监控流式ET
现代数据湖架构全景解析：存储、表格式、计算引擎与元数据服务的协同生态讲文明的喜羊羊拒绝pua 大数据架构数据湖 Spark Iceberg Amoro 对象存储
本文全面剖析现代数据湖架构的核心组件，深入探讨对象存储（OSS/S3）、表格式（Iceberg/Hudi/DeltaLake）、计算引擎（Spark/Flink/Presto）及元数据服务（HMS/Amoro）的协作关系，并提供企业级选型指南。一、数据湖架构演进与核心价值数据湖架构演进历程现代数据湖核心价值矩阵维度传统数仓现代数据湖存储成本高（专有硬件）低（对象存储）数据时效性小时/天级分钟/秒级
69、Flink 的 DataStream Connector 之 Kafka 连接器详解猫猫爱吃小鱼粮 Flink-1.19 从0到精通 flink kafka 大数据
1.概述Flink提供了Kafka连接器使用精确一次（Exactly-once）的语义在Kafkatopic中读取和写入数据。目前还没有Flink1.19可用的连接器。2.KafkaSourcea）使用方法KafkaSource提供了构建类来创建KafkaSource的实例。以下代码片段展示了如何构建KafkaSource来消费“input-topic”最早位点的数据，使用消费组“my-group
Flink SourceFunction深度解析：数据输入的起点与奥秘 Edingbrugh.南空 flink 大数据 flink 大数据
在Flink的数据处理流程中，StreamGraph构建起了作业执行的逻辑框架，而数据的源头则始于SourceFunction。作为Flink数据输入的关键组件，SourceFunction负责从外部数据源读取数据，并将其转换为Flink作业能够处理的格式。深入理解SourceFunction的原理与实现，对于构建高效、稳定的数据处理链路至关重要。接下来，我们将结合有道云笔记内容，对FlinkSo
【Flink实战】 Flink SQL 中处理字符串 `‘NULL‘` 并转换为 `BIGINT` roman_日积跬步-终至千里 #flink 实战 sql flink 数据库
文章目录一、问题描述解决方案解释一、问题描述当我们尝试将字符串'NULL'直接转换为BIGINT时，会遇到NumberFormatException，因为'NULL'不是一个有效的数字字符串。为了避免这种错误，我们需要在转换之前进行检查。解决方案我们可以使用CASE语句来实现条件转换。具体步骤如下：使用CASE语句进行条件判断：检查字符串是否为'NULL'，如果是'NULL'，则返回0；否则，将字
Flink状态和容错-基础篇有数的编程笔记 Flink flink 大数据
1.概念flink的状态和容错绕不开3个概念，statebackends和checkpoint、savepoint。本文重心即搞清楚这3部分内容。容错机制是基于在状态快照的一种恢复方式。但是状态和容错要分开来看。什么是状态，为什么需要状态？流计算和批计算在数据源上最大的区别是，流计算中的数据是无边界的，数据持续不断，而批计算中数据是有边界的，在计算时可以一次性将数据全部拿到。在流计算中无法拿到全部
flink:风控/反欺诈检测系统案例研究1,2,3 菠萝科技 java·未分类 flink flink 风控欺诈
https://flink.apache.org/news/2020/01/15/demo-fraud-detection.htmlhttps://flink.apache.org/news/2020/03/24/demo-fraud-detection-2.htmlhttps://flink.apache.org/news/2020/07/30/demo-fraud-detection-3.ht
多线程编程之卫生间周凡杨 java 并发卫生间线程厕所
如大家所知，火车上车厢的卫生间很小，每次只能容纳一个人，一个车厢只有一个卫生间，这个卫生间会被多个人同时使用，在实际使用时，当一个人进入卫生间时则会把卫生间锁上，等出来时打开门，下一个人进去把门锁上，如果有一个人在卫生间内部则别人的人发现门是锁的则只能在外面等待。问题分析：首先问题中有两个实体，一个是人，一个是厕所，所以设计程序时就可以设计两个类。人是多数的，厕所只有一个（暂且模拟的是一个车厢）。
How to Install GUI to Centos Minimal sunjing linux Install Desktop GUI
http://www.namhuy.net/475/how-to-install-gui-to-centos-minimal.html I have centos 6.3 minimal running as web server. I’m looking to install gui to my server to vnc to my server. You can insta
Shell 函数 daizj shell 函数
Shell 函数 linux shell 可以用户定义函数，然后在shell脚本中可以随便调用。 shell中函数的定义格式如下： [function] funname [()]{ action; [return int;] } 说明： 1、可以带function fun() 定义，也可以直接fun() 定义,不带任何参数。 2、参数返回
Linux服务器新手操作之一周凡杨 Linux 简单操作
1.whoami 当一个用户登录Linux系统之后，也许他想知道自己是发哪个用户登录的。此时可以使用whoami命令。 [ecuser@HA5-DZ05 ~]$ whoami e
浅谈Socket通信（一）朱辉辉33 socket
在java中ServerSocket用于服务器端，用来监听端口。通过服务器监听，客户端发送请求，双方建立链接后才能通信。当服务器和客户端建立链接后，两边都会产生一个Socket实例，我们可以通过操作Socket来建立通信。首先我建立一个ServerSocket对象。当然要导入java.net.ServerSocket包 ServerSock
关于框架的简单认识西蜀石兰框架
入职两个月多，依然是一个不会写代码的小白，每天的工作就是看代码，写wiki。前端接触CSS、HTML、JS等语言，一直在用的CS模型，自然免不了数据库的链接及使用，真心涉及框架，项目中用到的BootStrap算一个吧，哦，JQuery只能算半个框架吧，我更觉得它是另外一种语言。后台一直是纯Java代码，涉及的框架是Quzrtz和log4j。都说学前端的要知道三大框架，目前node.
You have an error in your SQL syntax; check the manual that corresponds to your 林鹤霄
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'option,changed_ids ) values('0ac91f167f754c8cbac00e9e3dc372
MySQL5.6的my.ini配置 aigo mysql
注意：以下配置的服务器硬件是：8核16G内存 [client] port=3306 [mysql] default-character-set=utf8 [mysqld] port=3306 basedir=D:/mysql-5.6.21-win
mysql 全文模糊查找便捷解决方案 alxw4616 mysql
mysql 全文模糊查找便捷解决方案 2013/6/14 by 半仙 [email protected] 目的: 项目需求实现模糊查找. 原则: 查询不能超过 1秒. 问题: 目标表中有超过1千万条记录. 使用like '%str%' 进行模糊查询无法达到性能需求. 解决方案: 使用mysql全文索引. 1.全文索引 : MySQL支持全文索引和搜索功能。MySQL中的全文索
自定义数据结构链表(单项 ,双向,环形) 百合不是茶单项链表双向链表
链表与动态数组的实现方式差不多, 数组适合快速删除某个元素链表则可以快速的保存数组并且可以是不连续的单项链表;数据从第一个指向最后一个实现代码: //定义动态链表 clas
threadLocal实例 bijian1013 java thread java多线程 threadLocal
实例1： package com.bijian.thread; public class MyThread extends Thread { private static ThreadLocal tl = new ThreadLocal() { protected synchronized Object initialValue() { return new Inte
activemq安全设置—设置admin的用户名和密码 bijian1013 java activemq
ActiveMQ使用的是jetty服务器, 打开conf/jetty.xml文件，找到 <bean id="adminSecurityConstraint" class="org.eclipse.jetty.util.security.Constraint"> <p
【Java范型一】Java范型详解之范型集合和自定义范型类 bit1129 java
本文详细介绍Java的范型，写一篇关于范型的博客原因有两个，前几天要写个范型方法(返回值根据传入的类型而定)，竟然想了半天，最后还是从网上找了个范型方法的写法；再者，前一段时间在看Gson, Gson这个JSON包的精华就在于对范型的优雅简单的处理，看它的源代码就比较迷糊，只其然不知其所以然。所以，还是花点时间系统的整理总结下范型吧。范型内容范型集合类范型类
【HBase十二】HFile存储的是一个列族的数据 bit1129 hbase
在HBase中，每个HFile存储的是一个表中一个列族的数据，也就是说，当一个表中有多个列簇时，针对每个列簇插入数据，最后产生的数据是多个HFile，每个对应一个列族，通过如下操作验证 1. 建立一个有两个列族的表 create 'members','colfam1','colfam2' 2. 在members表中的colfam1中插入50*5
Nginx 官方一个配置实例 ronin47 nginx 配置实例
user www www; worker_processes 5; error_log logs/error.log; pid logs/nginx.pid; worker_rlimit_nofile 8192; events { worker_connections 4096;} http { include conf/mim
java-15.输入一颗二元查找树，将该树转换为它的镜像，即在转换后的二元查找树中，左子树的结点都大于右子树的结点。用递归和循环 bylijinnan java
//use recursion public static void mirrorHelp1(Node node){ if(node==null)return; swapChild(node); mirrorHelp1(node.getLeft()); mirrorHelp1(node.getRight()); } //use no recursion bu
返回null还是empty bylijinnan java apache spring 编程
第一个问题，函数是应当返回null还是长度为0的数组（或集合）？第二个问题，函数输入参数不当时，是异常还是返回null？先看第一个问题有两个约定我觉得应当遵守： 1.返回零长度的数组或集合而不是null（详见《Effective Java》）理由就是，如果返回empty，就可以少了很多not-null判断： List<Person> list
[科技与项目]工作流厂商的战略机遇期 comsci 工作流
在新的战略平衡形成之前，这里有一个短暂的战略机遇期，只有大概最短6年，最长14年的时间，这段时间就好像我们森林里面的小动物，在秋天中，必须抓紧一切时间存储坚果一样，否则无法熬过漫长的冬季。。。。在微软，甲骨文，谷歌，IBM,SONY
过度设计-举例 cuityang 过度设计
过度设计，需要更多设计时间和测试成本，如无必要，还是尽量简洁一些好。未来的事情，比如访问量，比如数据库的容量，比如是否需要改成分布式都是无法预料的再举一个例子，对闰年的判断逻辑：　　1、 if($Year%4==0) return True; else return Fasle; 　　2、if ( ($Year%4==0 &am
java进阶，《Java性能优化权威指南》试读 darkblue086 java性能优化
记得当年随意读了微软出版社的.NET 2.0应用程序调试，才发现调试器如此强大，应用程序开发调试其实真的简单了很多，不仅仅是因为里面介绍了很多调试器工具的使用，更是因为里面寻找问题并重现问题的思想让我震撼，时隔多年，Java已经如日中天，成为许多大型企业应用的首选，而今天，这本《Java性能优化权威指南》让我再次找到了这种感觉，从不经意的开发过程让我刮目相看，原来性能调优不是简单地看看热点在哪里，
网络学习笔记初识OSI七层模型与TCP协议 dcj3sjt126com 学习笔记
协议：在计算机网络中通信各方面所达成的、共同遵守和执行的一系列约定　　计算机网络的体系结构：计算机网络的层次结构和各层协议的集合。　　两类服务：　　面向连接的服务通信双方在通信之前先建立某种状态，并在通信过程中维持这种状态的变化，同时为服务对象预先分配一定的资源。这种服务叫做面向连接的服务。　　面向无连接的服务通信双方在通信前后不建立和维持状态，不为服务对象
mac中用命令行运行mysql dcj3sjt126com mysql linux mac
参考这篇博客：http://www.cnblogs.com/macro-cheng/archive/2011/10/25/mysql-001.html 感觉workbench不好用（有点先入为主了）。 1，安装mysql 在mysql的官方网站下载 mysql 5.5.23 http://www.mysql.com/downloads/mysql/，根据我的机器的配置情况选择了64
MongDB查询（1）——基本查询[五] eksliang mongodb mongodb 查询 mongodb find
MongDB查询转载请出自出处：http://eksliang.iteye.com/blog/2174452 一、find简介 MongoDB中使用find来进行查询。 API:如下 function ( query , fields , limit , skip, batchSize, options ){.....} 参数含义： query:查询参数 fie
base64，加密解密经融加密，对接 y806839048 经融加密对接
String data0 = new String(Base64.encode(bo.getPaymentResult().getBytes(("GBK")))); String data1 = new String(Base64.decode(data0.toCharArray()),"GBK"); // 注意编码格式，注意用于加密，解密的要是同
JavaWeb之JSP概述 ihuning javaweb
什么是JSP？为什么使用JSP？ JSP表示Java Server Page，即嵌有Java代码的HTML页面。使用JSP是因为在HTML中嵌入Java代码比在Java代码中拼接字符串更容易、更方便和更高效。 JSP起源在很多动态网页中，绝大部分内容都是固定不变的，只有局部内容需要动态产生和改变。如果使用Servl
apple watch 指南啸笑天 apple
1. 文档 WatchKit Programming Guide（中译在线版 By @CocoaChina）译文译者原文概览 - 开始为 Apple Watch 进行开发 @星夜暮晨 Overview - Developing for Apple Watch 概览 - 配置 Xcode 项目 - Overview - Configuring Yo
java经典的基础题目 macroli java 编程
1.列举出 10个JAVA语言的优势 a:免费，开源，跨平台(平台独立性)，简单易用，功能完善，面向对象，健壮性，多线程，结构中立，企业应用的成熟平台, 无线应用 2.列举出JAVA中10个面向对象编程的术语 a:包，类，接口，对象，属性，方法，构造器，继承，封装，多态，抽象，范型 3.列举出JAVA中6个比较常用的包 Java.lang;java.util;java.io;java.sql;ja
你所不知道神奇的js replace正则表达式 qiaolevip 每天进步一点点学习永无止境纵观千象 regex
var v = 'C9CFBAA3CAD0'; console.log(v); var arr = v.split(''); for (var i = 0; i < arr.length; i ++) { if (i % 2 == 0) arr[i] = '%' + arr[i]; } console.log(arr.join('')); console.log(v.r
[一起学Hive]之十五-分析Hive表和分区的统计信息(Statistics) superlxw1234 hive hive分析表 hive统计信息 hive Statistics
关键字：Hive统计信息、分析Hive表、Hive Statistics 类似于Oracle的分析表，Hive中也提供了分析表和分区的功能，通过自动和手动分析Hive表，将Hive表的一些统计信息存储到元数据中。表和分区的统计信息主要包括：行数、文件数、原始数据大小、所占存储大小、最后一次操作时间等； 14.1 新表的统计信息对于一个新创建
Spring Boot 1.2.5 发布 wiselyman spring boot
Spring Boot 1.2.5已在7月2日发布，现在可以从spring的maven库和maven中心库下载。这个版本是一个维护的发布版，主要是一些修复以及将Spring的依赖提升至4.1.7(包含重要的安全修复)。官方建议所有的Spring Boot用户升级这个版本。项目首页 | 源