Spark Streaming 有状态流 updateStateByKey & mapWithState 实践(Java版) 及 CheckPoint使用

背景:

对于有状态的spark streaming数据处理,官方提供了两种方案updateStateByKey 和 mapWithState,可以通过在内存中维护一个状态值,进行比较/统计处理,二者的区别与联系大致如下:

1、updateStateByKey

根据key 维护并更新state到内存中(源码中存储调用persist(MEMORY_ONLY_SER)-内存中序列化存储)

底层实现进行co-group,所有数据都需要经过mapFunc(自定义的Function算子)运算,性能较低,这样计算性能会随着维护状态的增加越来越低,使用checkpoint备份快照的话,也会占用较大存储(待验证)

2、mapWithState(1.6-2.1版本至今仍为试验性接口,但是目前验证没发现问题)

根据key 维护并更新state到内存中(源码中存储调用persist(StorageLevel.MEMORY_ONLY) -内存中存储)

底层通过分区等策略,实现对部分数据的mapFunc(自定义的Function算子)运算,实现增量数据计算,官方称相比updateStateByKey 有10倍性能提升

 

由于状态的存储都是在内存中,所以要借助spark的checkpoint特性,实现对spark计算上下文环境的备份,确保维护的state在服务器宕机或者服务升级、重启后,能够恢复之前的state,继续进行运算。

注:从checkpoint恢复的

弊端:业务不可改变,任何spark任务上下文的改变(经验证,包括checkpoint周期秒数的改变),都会使恢复失败,抛出异常

解决措施:删除checkPointDir下的文件,放弃之前维护的state

 

环境准备:

1、使用spark本地运行模式,进行本地验证

2、使用netcat网络工具,模拟socket server(nc –lk 9999)

3、安装hadoop依赖(下载winutils 的windows版本https://github.com/srccodes/hadoop-common-2.2.0-bin, 解压后,配置环境变量 path里添加 上述解压dir\bin,后重启机器)

4、新建maven项目,maven依赖如下:(注意如果显式引入了jackson相关jar包,要使用2.6+版本,新版本可能与spark-core使用的版本出现冲突)

           

                    org.apache.spark

                    spark-core_2.11

                    2.1.0

                    provided

           

           

                    org.apache.spark

                    spark-streaming_2.11

                    2.1.0

                    provided

           

           

                    org.apache.spark

                    spark-sql_2.10

                    1.3.0

                    provided

           

 

代码实践:(Java)

以常见的word count场景(统计每个word出现的次数)为例,分别使用

updateStateByKey/mapWithState进行计算。

updateStateByKey

import java.util.Arrays;
import java.util.Iterator;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.Optional;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function0;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;

import scala.Tuple2;

public class TestNewSparkTaskUpdateByKey
{

    private static final String SOCKET_SERVER_IP = "xxx";
    
    private static final int SOCKET_SERVER_PORT = 9999;
    
    private static final String CHECK_POINT_DIR = "D:\\spark\\checkpoint\\updatestatebykey";
    
    private static final int CHECK_POINT_DURATION_SECONDS = 30;
    
    private static JavaStreamingContext getJavaStreamingContext()
    {
        
        SparkConf conf = new SparkConf().setAppName("SparkUpdateStateByKey").setMaster("local[2]");
        JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(CHECK_POINT_DURATION_SECONDS));
        
        jssc.checkpoint(CHECK_POINT_DIR);
        
        JavaReceiverInputDStream messages = jssc.socketTextStream(SOCKET_SERVER_IP, SOCKET_SERVER_PORT);
        
        JavaDStream words = messages.flatMap(new FlatMapFunction() {

            /**
             * serialVersionUID
             */
            private static final long serialVersionUID = -8511938709723688992L;

            @Override
            public Iterator call(String t) throws Exception
            {
                return Arrays.asList(t.split(" ")).iterator();
            }
        });
        
        JavaPairDStream pairs = words.mapToPair(new PairFunction() {

            /**
             * serialVersionUID
             */
            private static final long serialVersionUID = 7494315448364736838L;

            public Tuple2 call(String word) throws Exception
            {
                return new Tuple2(word, 1);
            }
        });
        
        // 统计全局的word count,而不是单一的某一批次
        JavaPairDStream wordcounts = pairs.updateStateByKey(
            new Function2, Optional, Optional>()
            {
                /**
                 * serialVersionUID
                 */
                private static final long serialVersionUID = -7837221857493546768L;

                // 参数valueList:相当于这个batch,这个key新的值,可能有多个,比如(hadoop,1)(hadoop,1)传入的可能是(1,1)
                // 参数oldState:就是指这个key之前的状态
                public Optional call(List valueList,
                                              Optional oldState)
                                                  throws Exception
                {
                    Integer newState = 0;
                    // 如果oldState之前已经存在,那么这个key可能之前已经被统计过,否则说明这个key第一次出现
                    if (oldState.isPresent())
                    {
                        newState = oldState.get();
                    }

                    // 更新state
                    for (Integer value : valueList)
                    {
                        newState += value;
                    }
                    return Optional.of(newState);
                }
            });

        wordcounts.print();
        return jssc;
    }
    
    private static void testSpark()
    {
        JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(CHECK_POINT_DIR, new Function0(){

            /**
             * serialVersionUID
             */
            private static final long serialVersionUID = -6070032440759098908L;

            @Override
            public JavaStreamingContext call() throws Exception
            {
                return getJavaStreamingContext();
            }
            
        });

        
        jssc.start();
        try
        {
            jssc.awaitTermination();
        }
        catch (InterruptedException e)
        {
            e.printStackTrace();
        }
        jssc.close();
    }
    
    public static void main(String[] args)
    {
        testSpark();
    }

}

mapWithState

import java.util.Arrays;
import java.util.Iterator;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.Optional;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function0;
import org.apache.spark.api.java.function.Function3;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.State;
import org.apache.spark.streaming.StateSpec;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaMapWithStateDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;

import scala.Tuple2;

public class TestNewSparkTaskMulWordCountMapWithState
{
    private static final String SOCKET_SERVER_IP = "xxx";
    
    private static final int SOCKET_SERVER_PORT_BMP = 5555;
    
    private static final int SOCKET_SERVER_PORT_PREFIX = 9999;
    
    private static final String CHECK_POINT_DIR = "D:\\spark\\checkpoint\\mapwithstate\\wordcount";
    
    private static final int CHECK_POINT_DURATION_SECONDS = 10;
    
    // 新建JavaStreamingContext
    private static JavaStreamingContext getJavaStreamingContext()
    {
        /**
         * Spark中本地运行模式有3种,如下
         *(1)local 模式:本地单线程运行;
         *(2)local[k]模式:本地K个线程运行;
         *(3)local[*]模式:用本地尽可能多的线程运行。
         */
        SparkConf conf = new SparkConf().setAppName("SparkMapWithState").setMaster("local[*]");
        // 1、设置任务间隔
        JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(CHECK_POINT_DURATION_SECONDS));
        
        // 2、设置checkpoint目录
        jssc.checkpoint(CHECK_POINT_DIR);
        
        // 3、启动两个socket连接, 模拟两个输入,远程Linux可使用nc -lk [port] 启动server
        JavaReceiverInputDStream word1 = jssc.socketTextStream(SOCKET_SERVER_IP, SOCKET_SERVER_PORT_BMP);
        JavaReceiverInputDStream word2 = jssc.socketTextStream(SOCKET_SERVER_IP, SOCKET_SERVER_PORT_PREFIX);
        
        // 4、处理接收过来的数据
        FlatMapFunction flatMapFunction = new FlatMapFunction() {

            /**
             * serialVersionUID
             */
            private static final long serialVersionUID = -8511938709723688992L;

            @Override
            public Iterator call(String t) throws Exception
            {
                return Arrays.asList(t.split(" ")).iterator();
            }
        };
        
        JavaDStream word1Stream = word1.flatMap(flatMapFunction);
        JavaDStream word2Stream = word2.flatMap(flatMapFunction);
        
        // 5、合并两个stream数据
        JavaDStream unionData = word1Stream.union(word2Stream);
        
        // 6、转换为JavaPairDStream
        JavaPairDStream pairs = unionData.mapToPair(new PairFunction() {

            /**
             * serialVersionUID
             */
            private static final long serialVersionUID = 7494315448364736838L;

            public Tuple2 call(String word) throws Exception
            {
                return new Tuple2(word, 1);
            }
        });
        
        // 7、 统计全局的word count,而不是单一的某一批次
        Function3, State, String> mappingFunction = new Function3, State, String>(){

            /**
             * serialVersionUID
             */
            private static final long serialVersionUID = -4105602513005256270L;

            // curState为当前key对应的state
            @Override
            public String call(String key, Optional value,
                               State curState) throws Exception
            {
                if (value.isPresent())
                {
                    Integer curValue = value.get();
                    System.out.println("value ------------->" + curValue);
                    if(curState.exists())
                    {
                        curState.update(curState.get() + curValue);
                    }
                    else
                    {
                        curState.update(curValue);
                    }
                }
                System.out.println("key ------------->" + key);
                System.out.println("curState ------------->" + curState);
                return key;

            }
            
        };
        JavaMapWithStateDStream wordcounts = pairs.mapWithState(StateSpec.function(mappingFunction));

        // 8、输出
        wordcounts.print();
        return jssc;
    }
    
    private static void testSpark()
    {
        // 1、从checkpoint恢复,没有备份点则新建JavaStreamingContext
        JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(CHECK_POINT_DIR, new Function0(){

            /**
             * serialVersionUID
             */
            private static final long serialVersionUID = -6070032440759098908L;

            @Override
            public JavaStreamingContext call() throws Exception
            {
                return getJavaStreamingContext();
            }
            
        });

        // 2、任务开启
        jssc.start();
        try
        {
            jssc.awaitTermination();
        }
        catch (InterruptedException e)
        {
            e.printStackTrace();
        }
        jssc.close();
    }
    
    public static void main(String[] args)
    {
        testSpark();
    }
}

 

代码分析

1、使用checkpoint时,要使用getOrCreate()方法,从checkpoint恢复,没有备份点则新建JavaStreamingContext

2、如果是单输入,即一个socketTextStream任务,本地模式可以采用2个线程运行,setMaster("local[2]"),如果多个输入,即多个socketTextStream任务,加上spark任务,则2个线程不能正常运行,需要setMaster("local[*]")采用尽可能多的线程运行。

checkpoint恢复state分析:

1、使用updateStateByKeyKey,总是能够及时的在重启后,读取到之前的state变量

2、使用mapWithState,重启后不能读取到最新的state,存在state丢失现象。初步实验表明,最多丢失9个checkpoint周期的数据,即设置10s checkpoint周期,后90s维护的state在重启服务后会丢失,没有被备份到checkpoint文件中。

mapWithState checkpoint 恢复周期延迟结果验证及简单分析

结果验证及分析如下:

mapwithstate checkpoint 恢复周期延迟

10秒间隔checkpoint设置: 约90s
30秒间隔checkpoint设置: 约210s


原因分析:

关键日志信息:
此日志打印说明进行了InternalMapWithState备份动作,出现该日志,说明备份了最新的state,重启服务后能恢复该时间点之前的数据及state.
18/11/14 15:02:00 INFO InternalMapWithStateDStream: Marking RDD 54 for time 1542178920000 ms for checkpointing

分析触发条件:(与checkpoint周期相关)
(validTime-zeroTime).mills % checkpointDuration.milliseconds == 0 

该日志代码位置:
DSStream->getOrCompute()            /函数用途**compute-and-cache RDD corresponding**/
上层调用:
InternalMapWithStateDStream->compute()   /** Method that generates an RDD for the given time */

 

参考博客

1、https://blog.csdn.net/UUfFO/article/details/78938335

2、https://blog.csdn.net/zhanglh046/article/details/78505124

3、https://www.cnblogs.com/zq-inlook/p/4386216.html

4、https://blog.csdn.net/luobailian/article/details/51547162

5、https://blog.csdn.net/u010454030/article/details/54985740

你可能感兴趣的:(Spark,Streaming)