前提:hadoop集群已部署完成
Hadoop集群: Hadoop 2.6.5 spark-2.3.0
*.*.*.1 hadoop1
*.*.*.2 hadoop2
*.*.*.3 hadoop3
*.*.*.4 hadoop4
hbase-env.xml
Windows10环境准备
需要安装Java并设置好Java的环境变量,Java的安装目录里最好不要有空格!!!
环境: Pycharm2019,Python3
在本地安装pyspark包,使用pip命令安装:
pip install pyspark
(pip install pyspark==2.3.0)
## pyspark 此处需要选择与spark版本一致的版本,服务器上的spark版本为2.3,否则会报错
1.服务器环境部署
1.1 确保服务器上pyspark环境已经部署完成
在服务器命令行输入pyspark时,可以正常进入spark:
1.2 部署pyspark操作hbase的jar包
在每台主机节点上的spark安装目录里下的jars文件夹下创建名为hbase的文件夹:
/usr/local/spark-2.3.0/jars
然后将hbase安装目录下lib文件夹中所有的hbase开头的jar包以及guava-12.0.1.jar、protobuf-java-2.5.0.jar、htrace-core-3.1.0-incubating.jar拷贝到该hbase目录下:
需要注意:在Spark 2.0版本上缺少相关把hbase的数据转换python可读取的jar包,需要我们另行下载。
打开spark-example-1.6.0.jar下载jar包,然后将该jar包放入到如下文件夹中: /usr/local/spark-2.3.0/jars/hbase
然后,使用vim编辑器打开spark-env.sh文件,设置Spark的spark-env.sh文件,告诉Spark可以在哪个路径下找到HBase相关的jar文件,在文件最前面增加下面一行内容:
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath):$(/usr/local/hbase/bin/hbase classpath):/usr/local/spark/jars/hbase/*
可以测试一下服务器环境是否部署完成:
输入pyspark命令,进入到spark的shell下,依次输入如下代码:
```
host = 'xxx.xxx.x.1'
table = 'demotable'
spark_host = "spark://xxx.xxx.x.1:7077"
spark = SparkSession.builder.master(spark_host).appName("ps").getOrCreate()
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat","org.apache.hadoop.hbase.io.ImmutableBytesWritable","org.apache.hadoop.hbase.client.Result",keyConverter=keyConv,valueConverter=valueConv,conf=conf)
count = hbase_rdd.count()
hbase_rdd.cache()
output = hbase_rdd.collect()
for (k, v) in output:
print (k, v)
```
查看是否有报错,正常情况应该打印出demotable表中的数据。
2. windows10本地环境部署
修改windows的hosts文件(C:\Windows\System32\drivers\etc),在文件末尾添加如下代码:
其IP地址和名字要和服务器集群配置一致。
XXX.XXX.x.4 hadoop4
XXX.XXX.x.3 hadoop3
XXX.XXX.x.1 hadoop1
XXX.XXX.x.2 hadoop2
2.1 从官网下载spark的部署包到本地
https://archive.apache.org/dist/spark/spark-2.3.0/
解压缩到本地目录,注意目录中不要有空格,例如:D:\ProgramData\spark-2.3.0-bin-hadoop2.6
从服务器的Hbase安装目录的lib文件夹中所有的hbase开头的jar包以及guava-12.0.1.jar、protobuf-java-2.5.0.jar、htrace-core-3.1.0-incubating.jar包放入到%SPARK_HOME%\jars文件夹下,包括下载的spark-examples包:
创建环境变量:
2.2 hadoop本地环境部署
从官网下载或者服务器上拷贝hadoop-2.6.5.tar.gz压缩包到本地,解压,得到如下目录:
D:\ProgramData\hadoop-2.6.5
下载hadoop本地库: hadoop-native-64-2.6.0.tar,解压锁后,将其放入D:\ProgramData\hadoop-2.6.5\lib目录下:
通过如下连接,下载hadoop-common-2.2.0-bin-master
https://github.com/srccodes/hadoop-common-2.2.0-bin
解压缩,将bin文件夹里的内容复制到D:\ProgramData\hadoop-2.6.5\bin文件夹中。
修改环境变量:
修改path:
3. pycharm项目设置
新建项目目,创建一个python文件,使用如下代码,访问hbase数据表 'H_99SleepData' ,应该打印出该表的数据
from pyspark.sql import SparkSession
spark_local = "local[*]"
host = 'XXX.XXX.X.X'
#table name
table = 'demotable'
#建立spark连接
spark = SparkSession.builder.master(spark_local).appName("test").getOrCreate()
hbaseconf = {"hbase.zookeeper.quorum": host,
"hbase.mapreduce.inputtable": table,
#定义起止行
#"hbase.mapreduce.scan.row.start": row,
# "hbase.mapreduce.scan.row.stop": row1
}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
#得到rdd
hbase_rdd = spark.sparkContext.newAPIHadoopRDD("org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv, valueConverter=valueConv, conf=hbaseconf)
count = hbase_rdd.count()
hbase_rdd.cache()
output = hbase_rdd.collect()
for (k, v) in output:
print (k, v)
运行代码,输出结果
4. Hbase 写数据
# Hbase写数据 已创建数据表:studentSparkdemo,列族为:info
rawData = ['5,info,name,Xinqing','5,info,gender,女']
# ( rowkey , [ row key , column family , column name , value ] )
keyConv_w = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv_w = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
conf = {"hbase.zookeeper.quorum": "XXX.XXX.X.X",
"hbase.mapred.outputtable": "studentSparkdemo",
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
spark.sparkContext.parallelize(rawData)\
.map(lambda x: (x[0], x.split(',')))\
.saveAsNewAPIHadoopDataset(
conf=conf,
keyConverter=keyConv_w,
valueConverter=valueConv_w)
5. 遇到的问题:
5.1. winutils :
将路径加入到classpath中:
5.2. 反复提示找不到一个jar包:
java.lang.ClassNotFoundException: com.yammer.metrics.core.Gauge
出现这个错误的原因是找不到metrics-core-2.2.0.jar这个包,而在我们的目录D:\ProgramData\spark-2.3.0-bin-hadoop2.6\jars中存在另一个版本的jar包:metrics-core-3.1.5.jar
解决办法:从服务器的hbase的安装目录下找到2.2.0.jar包(/usr/local/hadoop/hbase/lib/metrics-core-2.2.0.jar),将其复制到D:\ProgramData\spark-2.3.0-bin-hadoop2.6\jars目录中,重启pycharm,问题解决。
5.3. 异常: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
原因:pyspark版本与Spark版本不匹配
2019-11-13 16:19:18 ERROR SparkUncaughtExceptionHandler:91 - Uncaught exception in thread Thread[main,5,main]
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
...............
Traceback (most recent call last):
File "D:/virtualenvs/enssiEnv1/EnssiclinicPath/HbaseOperate/views.py", line 60, in
spark = SparkSession.builder.master("local[*]").appName("patient").getOrCreate()
File "D:\virtualenvs\enssiEnv1\lib\site-packages\pyspark\sql\session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "D:\virtualenvs\enssiEnv1\lib\site-packages\pyspark\context.py", line 367, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "D:\virtualenvs\enssiEnv1\lib\site-packages\pyspark\context.py", line 133, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "D:\virtualenvs\enssiEnv1\lib\site-packages\pyspark\context.py", line 316, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "D:\virtualenvs\enssiEnv1\lib\site-packages\pyspark\java_gateway.py", line 46, in launch_gateway
return _launch_gateway(conf)
File "D:\virtualenvs\enssiEnv1\lib\site-packages\pyspark\java_gateway.py", line 108, in _launch_gateway
raise Exception("Java gateway process exited before sending its port number")
Exception: Java gateway process exited before sending its port number
Process finished with exit code 1