TinkerPop中使用Spark on Yarn模式运行OLAP

TinkerPop中可以结合SparkGraphComputerHadoopGraph实现使用大数据集群资源分布式对图进行OLAP。官方文档只是介绍了使用The Gremlin Console进行OLAP,但是实际的生产环境中通常还是需要写程序打成jar包来执行的。本文将介绍如何使用java -cp的方式执行程序实现spark on yarn。

客户端程序

将官网提供的样例Using CloneVertexProgram改造成java程序,业务逻辑很简单,加载hdfs上的原始数据文件tinkerpop-modern.json(官方提供)生成图,然后通过CloneVertexProgram复制这个图并将复制后的图以json的格式再存储到hdfs中。

public class HadoopGraphSparkComputerDemo {
    public static void main(String[] args) throws Exception {
        FileConfiguration configuration = new PropertiesConfiguration();
        configuration.load(new File(args[0]));

        final Configuration hadoopConfig = new Configuration(false);

        if ("kerberos".equalsIgnoreCase(hadoopConfig.get("hadoop.security.authentication"))) {//1

            UserGroupInformation.setConfiguration(hadoopConfig);
            try {
                UserGroupInformation userGroupInformation =
                        UserGroupInformation.loginUserFromKeytabAndReturnUGI(configuration.getString("user.principal"), configuration.getString("user.keytab"));
                UserGroupInformation.setLoginUser(userGroupInformation);

                System.out.println("Login successfully!");
            } catch (Exception e) {
                e.printStackTrace();
            }
        }

        HadoopGraph graph = HadoopGraph.open(configuration);
        graph.compute(SparkGraphComputer.class).program(CloneVertexProgram.build().create()).submit().get();
    }
}
  1. 适配Kerberos环境
  2. 最后一行就是执行复制图的操作

配置文件

创建一个配置文件hadoop-graphson.properties,内容如下

# the graph class
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# 输入输出都是json格式
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
# 输入数据源的路径
gremlin.hadoop.inputLocation=/tmp/tinkerpop-modern.json
#输出结果的路径
gremlin.hadoop.outputLocation=/tmp/output
# if the job jars are not on the classpath of every hadoop node, then they must be provided to the distributed cache at runtime
gremlin.hadoop.jarsInDistributedCache=true

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
#SparkGraphComputer只支持client模式
spark.submit.deployMode=client
#spark on yarn运行时依赖jar包的路径
spark.yarn.jars=/tmp/graph-jars/*.jar
spark.driver.extraJavaOptions=-Dhdp.version=2.6.0.3-8
spark.yarn.am.extraJavaOptions=-Dhdp.version=2.6.0.3-8

spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
[email protected]
user.keytab=/tmp/hdfs.headless.keytab

程序编译

将必要的依赖都定义在pom文件中,然后使用maven-dependency-plugin插件在编译的时候会将依赖的jar放到lib目录



    4.0.0

    com.woople
    graph-tutorials
    1.0-SNAPSHOT

    
        
            org.apache.tinkerpop
            gremlin-core
            3.4.4
        
        
            org.apache.tinkerpop
            tinkergraph-gremlin
            3.4.4
        
        
            org.apache.tinkerpop
            spark-gremlin
            3.4.4
        
        
            org.apache.tinkerpop
            hadoop-gremlin
            3.4.4
        
        
            org.apache.spark
            spark-yarn_2.11
            2.4.0
        
        
            org.scala-lang
            scala-reflect
            2.11.8
        
        
            com.sun.jersey
            jersey-core
            1.9
        
        
            com.sun.jersey
            jersey-client
            1.9
        

        
            log4j
            log4j
            1.2.17
        

        
            org.slf4j
            slf4j-api
            1.7.28
        
        
            ch.qos.logback
            logback-classic
            1.2.3
        
        
            ch.qos.logback
            logback-core
            1.2.3
        
    
    
        package
        
            
                org.apache.maven.plugins
                maven-resources-plugin
                
                    UTF-8
                
                
                    
                        
                            copy-resources
                        
                    
                
            
            
                net.alchim31.maven
                scala-maven-plugin
                3.2.2
                
                    
                        eclipse-add-source
                        
                            add-source
                        
                    
                    
                        scala-compile-first
                        process-resources
                        
                            compile
                        
                    
                    
                        scala-test-compile-first
                        process-test-resources
                        
                            testCompile
                        
                    
                    
                        attach-scaladocs
                        verify
                        
                            doc-jar
                        
                    
                
                
                    2.11.8
                    incremental
                    true
                
            
            
                org.apache.maven.plugins
                maven-compiler-plugin
                3.5.1
                
                    
                        compile
                        
                            compile
                        
                    
                
                
                    8
                    8
                
            
            
                org.apache.maven.plugins
                maven-dependency-plugin
                3.1.1
                
                    
                        copy-dependencies
                        prepare-package
                        
                            copy-dependencies
                        
                        
                            ${project.build.directory}/lib
                            false
                            false
                            true
                        
                    
                
            
        
    

部署运行

  1. 将上一步编译时在lib下生成的所有依赖的jar放到spark.yarn.jars所指定的hdfs路径
  2. 将生成的lib文件夹、graph-tutorials-1.0-SNAPSHOT.jar放到运行环境,例如/opt/graph-tutorials
  3. 把hadoop-graphson.properties、core-site.xml、hdfs-site.xml和yarn-site.xml放到指定目录,例如/opt/graph-tutorials/conf
  4. cd /opt/graph-tutorials然后执行java -cp lib/*:conf:graph-tutorials-1.0-SNAPSHOT.jar com.woople.tinkerpop.gremlin.HadoopGraphSparkComputerDemo hadoop-graphson.properties

总结

在调试过程中遇到的问题主要是hadoop相关配置文件加载不到,jar包冲突,缺少类等问题。需要注意的是要根据tinkerpop的版本选择spark版本。本文完整示例请参考graph-tutorials。

你可能感兴趣的:(TinkerPop中使用Spark on Yarn模式运行OLAP)