对于有状态的spark streaming数据处理,官方提供了两种方案updateStateByKey 和 mapWithState,可以通过在内存中维护一个状态值,进行比较/统计处理,二者的区别与联系大致如下:
根据key 维护并更新state到内存中(源码中存储调用persist(MEMORY_ONLY_SER)-内存中序列化存储)
底层实现进行co-group,所有数据都需要经过mapFunc(自定义的Function算子)运算,性能较低,这样计算性能会随着维护状态的增加越来越低,使用checkpoint备份快照的话,也会占用较大存储(待验证)
根据key 维护并更新state到内存中(源码中存储调用persist(StorageLevel.MEMORY_ONLY) -内存中存储)
底层通过分区等策略,实现对部分数据的mapFunc(自定义的Function算子)运算,实现增量数据计算,官方称相比updateStateByKey 有10倍性能提升
由于状态的存储都是在内存中,所以要借助spark的checkpoint特性,实现对spark计算上下文环境的备份,确保维护的state在服务器宕机或者服务升级、重启后,能够恢复之前的state,继续进行运算。
注:从checkpoint恢复的
弊端:业务不可改变,任何spark任务上下文的改变(经验证,包括checkpoint周期秒数的改变),都会使恢复失败,抛出异常
解决措施:删除checkPointDir下的文件,放弃之前维护的state
1、使用spark本地运行模式,进行本地验证
2、使用netcat网络工具,模拟socket server(nc –lk 9999)
3、安装hadoop依赖(下载winutils 的windows版本https://github.com/srccodes/hadoop-common-2.2.0-bin, 解压后,配置环境变量 path里添加 上述解压dir\bin,后重启机器)
4、新建maven项目,maven依赖如下:(注意如果显式引入了jackson相关jar包,要使用2.6+版本,新版本可能与spark-core使用的版本出现冲突)
org.apache.spark
spark-core_2.11
2.1.0
provided
org.apache.spark
spark-streaming_2.11
2.1.0
provided
org.apache.spark
spark-sql_2.10
1.3.0
provided
以常见的word count场景(统计每个word出现的次数)为例,分别使用
updateStateByKey/mapWithState进行计算。
import java.util.Arrays;
import java.util.Iterator;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.Optional;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function0;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import scala.Tuple2;
public class TestNewSparkTaskUpdateByKey
{
private static final String SOCKET_SERVER_IP = "xxx";
private static final int SOCKET_SERVER_PORT = 9999;
private static final String CHECK_POINT_DIR = "D:\\spark\\checkpoint\\updatestatebykey";
private static final int CHECK_POINT_DURATION_SECONDS = 30;
private static JavaStreamingContext getJavaStreamingContext()
{
SparkConf conf = new SparkConf().setAppName("SparkUpdateStateByKey").setMaster("local[2]");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(CHECK_POINT_DURATION_SECONDS));
jssc.checkpoint(CHECK_POINT_DIR);
JavaReceiverInputDStream messages = jssc.socketTextStream(SOCKET_SERVER_IP, SOCKET_SERVER_PORT);
JavaDStream words = messages.flatMap(new FlatMapFunction() {
/**
* serialVersionUID
*/
private static final long serialVersionUID = -8511938709723688992L;
@Override
public Iterator call(String t) throws Exception
{
return Arrays.asList(t.split(" ")).iterator();
}
});
JavaPairDStream pairs = words.mapToPair(new PairFunction() {
/**
* serialVersionUID
*/
private static final long serialVersionUID = 7494315448364736838L;
public Tuple2 call(String word) throws Exception
{
return new Tuple2(word, 1);
}
});
// 统计全局的word count,而不是单一的某一批次
JavaPairDStream wordcounts = pairs.updateStateByKey(
new Function2, Optional, Optional>()
{
/**
* serialVersionUID
*/
private static final long serialVersionUID = -7837221857493546768L;
// 参数valueList:相当于这个batch,这个key新的值,可能有多个,比如(hadoop,1)(hadoop,1)传入的可能是(1,1)
// 参数oldState:就是指这个key之前的状态
public Optional call(List valueList,
Optional oldState)
throws Exception
{
Integer newState = 0;
// 如果oldState之前已经存在,那么这个key可能之前已经被统计过,否则说明这个key第一次出现
if (oldState.isPresent())
{
newState = oldState.get();
}
// 更新state
for (Integer value : valueList)
{
newState += value;
}
return Optional.of(newState);
}
});
wordcounts.print();
return jssc;
}
private static void testSpark()
{
JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(CHECK_POINT_DIR, new Function0(){
/**
* serialVersionUID
*/
private static final long serialVersionUID = -6070032440759098908L;
@Override
public JavaStreamingContext call() throws Exception
{
return getJavaStreamingContext();
}
});
jssc.start();
try
{
jssc.awaitTermination();
}
catch (InterruptedException e)
{
e.printStackTrace();
}
jssc.close();
}
public static void main(String[] args)
{
testSpark();
}
}
import java.util.Arrays;
import java.util.Iterator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.Optional;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function0;
import org.apache.spark.api.java.function.Function3;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.State;
import org.apache.spark.streaming.StateSpec;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaMapWithStateDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import scala.Tuple2;
public class TestNewSparkTaskMulWordCountMapWithState
{
private static final String SOCKET_SERVER_IP = "xxx";
private static final int SOCKET_SERVER_PORT_BMP = 5555;
private static final int SOCKET_SERVER_PORT_PREFIX = 9999;
private static final String CHECK_POINT_DIR = "D:\\spark\\checkpoint\\mapwithstate\\wordcount";
private static final int CHECK_POINT_DURATION_SECONDS = 10;
// 新建JavaStreamingContext
private static JavaStreamingContext getJavaStreamingContext()
{
/**
* Spark中本地运行模式有3种,如下
*(1)local 模式:本地单线程运行;
*(2)local[k]模式:本地K个线程运行;
*(3)local[*]模式:用本地尽可能多的线程运行。
*/
SparkConf conf = new SparkConf().setAppName("SparkMapWithState").setMaster("local[*]");
// 1、设置任务间隔
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(CHECK_POINT_DURATION_SECONDS));
// 2、设置checkpoint目录
jssc.checkpoint(CHECK_POINT_DIR);
// 3、启动两个socket连接, 模拟两个输入,远程Linux可使用nc -lk [port] 启动server
JavaReceiverInputDStream word1 = jssc.socketTextStream(SOCKET_SERVER_IP, SOCKET_SERVER_PORT_BMP);
JavaReceiverInputDStream word2 = jssc.socketTextStream(SOCKET_SERVER_IP, SOCKET_SERVER_PORT_PREFIX);
// 4、处理接收过来的数据
FlatMapFunction flatMapFunction = new FlatMapFunction() {
/**
* serialVersionUID
*/
private static final long serialVersionUID = -8511938709723688992L;
@Override
public Iterator call(String t) throws Exception
{
return Arrays.asList(t.split(" ")).iterator();
}
};
JavaDStream word1Stream = word1.flatMap(flatMapFunction);
JavaDStream word2Stream = word2.flatMap(flatMapFunction);
// 5、合并两个stream数据
JavaDStream unionData = word1Stream.union(word2Stream);
// 6、转换为JavaPairDStream
JavaPairDStream pairs = unionData.mapToPair(new PairFunction() {
/**
* serialVersionUID
*/
private static final long serialVersionUID = 7494315448364736838L;
public Tuple2 call(String word) throws Exception
{
return new Tuple2(word, 1);
}
});
// 7、 统计全局的word count,而不是单一的某一批次
Function3, State, String> mappingFunction = new Function3, State, String>(){
/**
* serialVersionUID
*/
private static final long serialVersionUID = -4105602513005256270L;
// curState为当前key对应的state
@Override
public String call(String key, Optional value,
State curState) throws Exception
{
if (value.isPresent())
{
Integer curValue = value.get();
System.out.println("value ------------->" + curValue);
if(curState.exists())
{
curState.update(curState.get() + curValue);
}
else
{
curState.update(curValue);
}
}
System.out.println("key ------------->" + key);
System.out.println("curState ------------->" + curState);
return key;
}
};
JavaMapWithStateDStream wordcounts = pairs.mapWithState(StateSpec.function(mappingFunction));
// 8、输出
wordcounts.print();
return jssc;
}
private static void testSpark()
{
// 1、从checkpoint恢复,没有备份点则新建JavaStreamingContext
JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(CHECK_POINT_DIR, new Function0(){
/**
* serialVersionUID
*/
private static final long serialVersionUID = -6070032440759098908L;
@Override
public JavaStreamingContext call() throws Exception
{
return getJavaStreamingContext();
}
});
// 2、任务开启
jssc.start();
try
{
jssc.awaitTermination();
}
catch (InterruptedException e)
{
e.printStackTrace();
}
jssc.close();
}
public static void main(String[] args)
{
testSpark();
}
}
1、使用checkpoint时,要使用getOrCreate()方法,从checkpoint恢复,没有备份点则新建JavaStreamingContext
2、如果是单输入,即一个socketTextStream任务,本地模式可以采用2个线程运行,setMaster("local[2]"),如果多个输入,即多个socketTextStream任务,加上spark任务,则2个线程不能正常运行,需要setMaster("local[*]")采用尽可能多的线程运行。
1、使用updateStateByKeyKey,总是能够及时的在重启后,读取到之前的state变量
2、使用mapWithState,重启后不能读取到最新的state,存在state丢失现象。初步实验表明,最多丢失9个checkpoint周期的数据,即设置10s checkpoint周期,后90s维护的state在重启服务后会丢失,没有被备份到checkpoint文件中。
结果验证及分析如下:
mapwithstate checkpoint 恢复周期延迟
10秒间隔checkpoint设置: 约90s
30秒间隔checkpoint设置: 约210s
原因分析:
关键日志信息:
此日志打印说明进行了InternalMapWithState备份动作,出现该日志,说明备份了最新的state,重启服务后能恢复该时间点之前的数据及state.
18/11/14 15:02:00 INFO InternalMapWithStateDStream: Marking RDD 54 for time 1542178920000 ms for checkpointing
分析触发条件:(与checkpoint周期相关)
(validTime-zeroTime).mills % checkpointDuration.milliseconds == 0
该日志代码位置:
DSStream->getOrCompute() /函数用途**compute-and-cache RDD corresponding**/
上层调用:
InternalMapWithStateDStream->compute() /** Method that generates an RDD for the given time */
1、https://blog.csdn.net/UUfFO/article/details/78938335
2、https://blog.csdn.net/zhanglh046/article/details/78505124
3、https://www.cnblogs.com/zq-inlook/p/4386216.html
4、https://blog.csdn.net/luobailian/article/details/51547162
5、https://blog.csdn.net/u010454030/article/details/54985740