1、启动spark时,提示JAVA_HOME not set
(1)下载jdk-8u291-linux-x64.tar.gz
(2)解压到/usr/local/java目录下
(3)在~/.bashrc中添加以下内容
export JAVA_HOME="/usr/local/java/jdk1.8.0_291"
export PATH=$JAVA_HOME/bin:$PATH
(4)source ~/.bashrc
(5)测试
(py3_spark) [root@100-020-gpuserver controller]# java -version
java version "1.8.0_291"
Java(TM) SE Runtime Environment (build 1.8.0_291-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)
2、Exception: Java gateway process exited before sending the driver its port number异常
(py3_spark) [root@100-020-gpuserver controller]# python conn_spark.py
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.SparkConf$.(SparkConf.scala:668)
at org.apache.spark.SparkConf$.(SparkConf.scala)
at org.apache.spark.SparkConf.set(SparkConf.scala:94)
at org.apache.spark.SparkConf.set(SparkConf.scala:83)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:367)
at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:367)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:367)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.UnknownHostException: 100-020-gpuserver: 100-020-gpuserver: Name or service not known
at java.net.InetAddress.getLocalHost(InetAddress.java:1506)
at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:911)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:904)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:904)
at org.apache.spark.util.Utils$$anonfun$localCanonicalHostName$1.apply(Utils.scala:961)
at org.apache.spark.util.Utils$$anonfun$localCanonicalHostName$1.apply(Utils.scala:961)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.localCanonicalHostName(Utils.scala:961)
at org.apache.spark.internal.config.package$.(package.scala:282)
at org.apache.spark.internal.config.package$.(package.scala)
... 15 more
Caused by: java.net.UnknownHostException: 100-020-gpuserver: Name or service not known
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
at java.net.InetAddress.getLocalHost(InetAddress.java:1501)
... 24 more
Traceback (most recent call last):
File "conn_spark.py", line 26, in
spark = SparkSession.builder.appName('ManagerEval').master('local').getOrCreate()
File "/root/anaconda3/lib/python3.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/root/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 331, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/root/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 115, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/root/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 280, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/root/anaconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 95, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
问题分析:java.net.UnknownHostException: 100-020-gpuserver: 100-020-gpuserver: Name or service not known
解决方案:将/etc/hosts文件中127.0.0.1的主机名改为100-020-gpuserver
127.0.0.1 100-020-gpuserver localhost.localdomain localhost4 localhost4.localdomain4
3、py4j.protocol.Py4JJavaError: An error occurred while calling o31.jdbc.
(py3_spark) [root@100-020-gpuserver controller]# python conn_spark.py
2021-10-09 15:47:10 WARN Utils:66 - Your hostname, 100-020-gpuserver resolves to a loopback address: 127.0.0.1; using 172.19.100.20 instead (on interface eth0)
2021-10-09 15:47:10 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2021-10-09 15:47:10 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "conn_spark.py", line 30, in
piggery_data = spark.read.jdbc(url=url, table=sow_piggery_informations_tb, properties=prop)
File "/data/.virtualenvs/py3_spark/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 525, in jdbc
return self._df(self._jreader.jdbc(url, table, jprop))
File "/data/.virtualenvs/py3_spark/lib/python3.7/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/data/.virtualenvs/py3_spark/lib/python3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/data/.virtualenvs/py3_spark/lib/python3.7/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o31.jdbc.
: java.lang.ClassNotFoundException: org.postgresql.Driver
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:79)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:79)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:79)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:35)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:254)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
问题分析:java.lang.ClassNotFoundException: org.postgresql.Driver,缺postgresql数据库相关的jar包
解决办法:(1)官网下载相应版本的jar包,例如postgresql-9.4.1212.jar
(2)移动到python的pyspark安装包下的jars目录下,例如/data/.virtualenvs/py3_spark/lib/python3.7/site-packages/pyspark/jars/