pyspark使用报错记录

1、启动spark时,提示JAVA_HOME not set

(1)下载jdk-8u291-linux-x64.tar.gz

(2)解压到/usr/local/java目录下

(3)在~/.bashrc中添加以下内容

export JAVA_HOME="/usr/local/java/jdk1.8.0_291"
export PATH=$JAVA_HOME/bin:$PATH

(4)source ~/.bashrc

(5)测试

(py3_spark) [root@100-020-gpuserver controller]# java -version
java version "1.8.0_291"
Java(TM) SE Runtime Environment (build 1.8.0_291-b10)
Java HotSpot(TM) 64-Bit Server VM (build 25.291-b10, mixed mode)

2、Exception: Java gateway process exited before sending the driver its port number异常

(py3_spark) [root@100-020-gpuserver controller]# python conn_spark.py
Exception in thread "main" java.lang.ExceptionInInitializerError
        at org.apache.spark.SparkConf$.(SparkConf.scala:668)
        at org.apache.spark.SparkConf$.(SparkConf.scala)
        at org.apache.spark.SparkConf.set(SparkConf.scala:94)
        at org.apache.spark.SparkConf.set(SparkConf.scala:83)
        at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:367)
        at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$1.apply(SparkSubmit.scala:367)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
        at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
        at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
        at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
        at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:367)
        at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:170)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.UnknownHostException: 100-020-gpuserver: 100-020-gpuserver: Name or service not known
        at java.net.InetAddress.getLocalHost(InetAddress.java:1506)
        at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:911)
        at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:904)
        at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:904)
        at org.apache.spark.util.Utils$$anonfun$localCanonicalHostName$1.apply(Utils.scala:961)
        at org.apache.spark.util.Utils$$anonfun$localCanonicalHostName$1.apply(Utils.scala:961)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.util.Utils$.localCanonicalHostName(Utils.scala:961)
        at org.apache.spark.internal.config.package$.(package.scala:282)
        at org.apache.spark.internal.config.package$.(package.scala)
        ... 15 more
Caused by: java.net.UnknownHostException: 100-020-gpuserver: Name or service not known
        at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
        at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
        at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
        at java.net.InetAddress.getLocalHost(InetAddress.java:1501)
        ... 24 more
Traceback (most recent call last):
  File "conn_spark.py", line 26, in 
    spark = SparkSession.builder.appName('ManagerEval').master('local').getOrCreate()
  File "/root/anaconda3/lib/python3.7/site-packages/pyspark/sql/session.py", line 173, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/root/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 331, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/root/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 115, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "/root/anaconda3/lib/python3.7/site-packages/pyspark/context.py", line 280, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "/root/anaconda3/lib/python3.7/site-packages/pyspark/java_gateway.py", line 95, in launch_gateway
    raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number

问题分析:java.net.UnknownHostException: 100-020-gpuserver: 100-020-gpuserver: Name or service not known

解决方案:将/etc/hosts文件中127.0.0.1的主机名改为100-020-gpuserver

127.0.0.1   100-020-gpuserver localhost.localdomain localhost4 localhost4.localdomain4

3、py4j.protocol.Py4JJavaError: An error occurred while calling o31.jdbc.

(py3_spark) [root@100-020-gpuserver controller]# python conn_spark.py
2021-10-09 15:47:10 WARN  Utils:66 - Your hostname, 100-020-gpuserver resolves to a loopback address: 127.0.0.1; using 172.19.100.20 instead (on interface eth0)
2021-10-09 15:47:10 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2021-10-09 15:47:10 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "conn_spark.py", line 30, in 
    piggery_data = spark.read.jdbc(url=url, table=sow_piggery_informations_tb, properties=prop)
  File "/data/.virtualenvs/py3_spark/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 525, in jdbc
    return self._df(self._jreader.jdbc(url, table, jprop))
  File "/data/.virtualenvs/py3_spark/lib/python3.7/site-packages/py4j/java_gateway.py", line 1160, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/data/.virtualenvs/py3_spark/lib/python3.7/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/data/.virtualenvs/py3_spark/lib/python3.7/site-packages/py4j/protocol.py", line 320, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o31.jdbc.
: java.lang.ClassNotFoundException: org.postgresql.Driver
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
        at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:45)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:79)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$6.apply(JDBCOptions.scala:79)
        at scala.Option.foreach(Option.scala:257)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:79)
        at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:35)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
        at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
        at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:254)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)

问题分析:java.lang.ClassNotFoundException: org.postgresql.Driver,缺postgresql数据库相关的jar包

解决办法:(1)官网下载相应版本的jar包,例如postgresql-9.4.1212.jar

                  (2)移动到python的pyspark安装包下的jars目录下,例如/data/.virtualenvs/py3_spark/lib/python3.7/site-packages/pyspark/jars/

你可能感兴趣的:(大数据,spark,java)