TinkerPop中可以结合SparkGraphComputer和HadoopGraph实现使用大数据集群资源分布式对图进行OLAP。官方文档只是介绍了使用The Gremlin Console进行OLAP,但是实际的生产环境中通常还是需要写程序打成jar包来执行的。本文将介绍如何使用java -cp
的方式执行程序实现spark on yarn。
客户端程序
将官网提供的样例Using CloneVertexProgram改造成java程序,业务逻辑很简单,加载hdfs上的原始数据文件tinkerpop-modern.json(官方提供)生成图,然后通过CloneVertexProgram复制这个图并将复制后的图以json的格式再存储到hdfs中。
public class HadoopGraphSparkComputerDemo {
public static void main(String[] args) throws Exception {
FileConfiguration configuration = new PropertiesConfiguration();
configuration.load(new File(args[0]));
final Configuration hadoopConfig = new Configuration(false);
if ("kerberos".equalsIgnoreCase(hadoopConfig.get("hadoop.security.authentication"))) {//1
UserGroupInformation.setConfiguration(hadoopConfig);
try {
UserGroupInformation userGroupInformation =
UserGroupInformation.loginUserFromKeytabAndReturnUGI(configuration.getString("user.principal"), configuration.getString("user.keytab"));
UserGroupInformation.setLoginUser(userGroupInformation);
System.out.println("Login successfully!");
} catch (Exception e) {
e.printStackTrace();
}
}
HadoopGraph graph = HadoopGraph.open(configuration);
graph.compute(SparkGraphComputer.class).program(CloneVertexProgram.build().create()).submit().get();
}
}
- 适配Kerberos环境
- 最后一行就是执行复制图的操作
配置文件
创建一个配置文件hadoop-graphson.properties
,内容如下
# the graph class
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
# 输入输出都是json格式
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
# 输入数据源的路径
gremlin.hadoop.inputLocation=/tmp/tinkerpop-modern.json
#输出结果的路径
gremlin.hadoop.outputLocation=/tmp/output
# if the job jars are not on the classpath of every hadoop node, then they must be provided to the distributed cache at runtime
gremlin.hadoop.jarsInDistributedCache=true
####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
#SparkGraphComputer只支持client模式
spark.submit.deployMode=client
#spark on yarn运行时依赖jar包的路径
spark.yarn.jars=/tmp/graph-jars/*.jar
spark.driver.extraJavaOptions=-Dhdp.version=2.6.0.3-8
spark.yarn.am.extraJavaOptions=-Dhdp.version=2.6.0.3-8
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
[email protected]
user.keytab=/tmp/hdfs.headless.keytab
程序编译
将必要的依赖都定义在pom文件中,然后使用maven-dependency-plugin
插件在编译的时候会将依赖的jar放到lib目录
4.0.0
com.woople
graph-tutorials
1.0-SNAPSHOT
org.apache.tinkerpop
gremlin-core
3.4.4
org.apache.tinkerpop
tinkergraph-gremlin
3.4.4
org.apache.tinkerpop
spark-gremlin
3.4.4
org.apache.tinkerpop
hadoop-gremlin
3.4.4
org.apache.spark
spark-yarn_2.11
2.4.0
org.scala-lang
scala-reflect
2.11.8
com.sun.jersey
jersey-core
1.9
com.sun.jersey
jersey-client
1.9
log4j
log4j
1.2.17
org.slf4j
slf4j-api
1.7.28
ch.qos.logback
logback-classic
1.2.3
ch.qos.logback
logback-core
1.2.3
package
org.apache.maven.plugins
maven-resources-plugin
UTF-8
copy-resources
net.alchim31.maven
scala-maven-plugin
3.2.2
eclipse-add-source
add-source
scala-compile-first
process-resources
compile
scala-test-compile-first
process-test-resources
testCompile
attach-scaladocs
verify
doc-jar
2.11.8
incremental
true
org.apache.maven.plugins
maven-compiler-plugin
3.5.1
compile
compile
8
org.apache.maven.plugins
maven-dependency-plugin
3.1.1
copy-dependencies
prepare-package
copy-dependencies
${project.build.directory}/lib
false
false
true
部署运行
- 将上一步编译时在lib下生成的所有依赖的jar放到
spark.yarn.jars
所指定的hdfs路径 - 将生成的lib文件夹、
graph-tutorials-1.0-SNAPSHOT.jar
放到运行环境,例如/opt/graph-tutorials
- 把hadoop-graphson.properties、core-site.xml、hdfs-site.xml和yarn-site.xml放到指定目录,例如
/opt/graph-tutorials/conf
-
cd /opt/graph-tutorials
然后执行java -cp lib/*:conf:graph-tutorials-1.0-SNAPSHOT.jar com.woople.tinkerpop.gremlin.HadoopGraphSparkComputerDemo hadoop-graphson.properties
总结
在调试过程中遇到的问题主要是hadoop相关配置文件加载不到,jar包冲突,缺少类等问题。需要注意的是要根据tinkerpop的版本选择spark版本。本文完整示例请参考graph-tutorials。