基于Hadoop CDH进行Spark编译

Spark-2.4.0下载地址:

官方地址:https://archive.apache.org/dist/spark/spark-2.4.2/spark-2.4.2.tgz

编译Spark源码的文档(参考官方文档)

http://spark.apache.org/docs/latest/building-spark.html

编译Spark源码的前置要求

软件

Hadoop

scala

maven

JDK

版本

2.6.0-cdh5.7.0

2.11.12

3.6.1

jdk1.8.0_45

编译与配置:

1解压Spark源码:

1

2

3

4

5

6

7

[hadoop@hadoop001 software]$ ll spark-2.4.2.tgz

 

-rw-r--r--. 1 hadoop hadoop 16165557 4月  28 04:41 spark-2.4.2.tgz

 

[hadoop@hadoop001 software]$ tar -zxvf spark-2.4.2.tgz

 

[hadoop@hadoop001 software]$ cd spark-2.4.2

2 修改make-make-distribution.sh中的版本号,避免编译时自己取寻找,此过程比较耗时

make-distribution.sh脚本的Github地址:

https://github.com/apache/spark/blob/master/dev/make-distribution.sh

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

[hadoop@hadoop001 spark-2.4.2]$ vim dev/make-distribution.sh

//修改

VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null\

| grep -v "INFO"\

| grep -v "WARNING"\

| tail -n 1)

SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null\

| grep -v "INFO"\

| grep -v "WARNING"\

| tail -n 1)

SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null\

| grep -v "INFO"\

| grep -v "WARNING"\

| tail -n 1)

SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\

| grep -v "INFO"\

| grep -v "WARNING"\

| fgrep --count "hive";\

# Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\

# because we use "set -o pipefail"

echo -n)

 

//修改为:

VERSION=2.4.2

SCALA_VERSION=2.11

SPARK_HADOOP_VERSION=2.6.0-cdh5.7.0

SPARK_HIVE=1

3.修改 pom.xml文件

如果要编译 cdh,必须要添加一个仓库

1

2

3

4

5

6

7

8

[hadoop@hadoop614 spark-2.4.2]$ vim pom.xml

 

cloudera

https://repository.cloudera.com/artifactory/cloudera-repos/

4.编译命令

通过观察pom.xml,可以观察到编译Spark的时候,如果不手动指定hadoop与yarn的版本,会默认采用hadoop、yarn的版本

基于Hadoop CDH进行Spark编译_第1张图片

 

1

2

3

4

5

6

7

8

9

10

11

[hadoop@hadoop001 spark-2.4.2]$ pwd

/home/hadoop/software/spark-2.4.2

[hadoop@hadoop614 spark-2.4.2]$ ./dev/make-distribution.sh --name 2.6.0-cdh5.7.0 --tgz -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.0 -Phive -Phive-thriftserver -Pyarn -Pkubernetes

 

 

--name:设置打包后的包名字中添加2.6.0-cdh-5.7.0,方便自己知道支持哪个版本

-Phadoop-2.6:指定hadoop的版本是2.6,通过-P进行指定profile

-Dhadoop.version=2.6.0-cdh-5.7.0 通过-D 设定Properties属性值,指定hadoop具体是使用哪一个版本,如果不指定,竟会使用默认版本

-Phive:支持使用hive

-Phive-thriftserver 支持使用Jdbc连接池

-Pyarn:支持使用yarn,并且版本号与hadoop相同,如果想更换版本号,则采用-Dhadoop.version

另外在编译之前需要设置MAVEN_OPTS,否则会CompileFailed,以下是官网的说明

Building Apache Spark

Apache Maven

The Maven-based build is the build of reference for Apache Spark. Building Spark using Maven requires Maven 3.5.4 and Java 8. Note that support for Java 7 was removed as of Spark 2.2.0.

Setting up Maven’s Memory Usage

You’ll need to configure Maven to use more memory than usual by setting MAVEN_OPTS:

export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

(The ReservedCodeCacheSize setting is optional but recommended.) If you don’t add these parameters to MAVEN_OPTS, you may see errors and warnings like the following:

[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.12/classes...
[ERROR] Java heap space -> [Help 1]

You can fix these problems by setting the MAVEN_OPTS variable as discussed before.

Note:

  • If using build/mvn with no MAVEN_OPTS set, the script will automatically add the above options to the MAVEN_OPTS environment variable.
  • The test phase of the Spark build will automatically add these options to MAVEN_OPTS, even when not using build/mvn.

 

解压部署

1.解压

1

2

3

4

5

6

7

8

[hadoop@hadoop001 spark-2.4.2]$ ll spark-2.4.2-bin-2.6.0-cdh5.7.0.tgz

-rw-rw-r--. 1 hadoop hadoop 231193116 4月  28 06:32 spark-2.4.2-bin-2.6.0-cdh5.7.0.tgz

[hadoop@hadoop001 spark-2.4.2]$ pwd

/home/hadoop/software/spark-2.4.2

[hadoop@hadoop001 spark-2.4.2]$ tar -zxvf spark-2.4.2-bin-2.6.0-cdh5.7.0.tgz -C ~/app

[hadoop@hadoop001 spark-2.4.2]$ cd ~/app

[hadoop@hadoop001 app]$ ls -ld spark-2.4.2-bin-2.6.0-cdh5.7.0/

drwxrwxr-x. 11 hadoop hadoop 4096 4月  28 06:31 spark-2.4.2-bin-2.6.0-cdh5.7.0/

2.配置环境变量

1

2

3

4

5

6

[hadoop@hadoop001 app]$ vim ~/.bash_profile

 

export SPARK_HOME=/home/hadoop/app/spark-2.4.2-bin-2.6.0-cdh5.7.0

export PATH=${SPARK_HOME}/bin:$PATH

 

[hadoop@hadoop001 app]$ source ~/.bash_profile

启动Spark

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

[hadoop@hadoop001 spark-2.4.2]$ ./spark-shell

19/04/28 06:44:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties

Setting default log level to "WARN".

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Spark context Web UI available at http://hadoop614:4040

Spark context available as 'sc' (master = local[*], app id = local-1556405067469).

Spark session available as 'spark'.

Welcome to

____              __

/ __/__  ___ _____/ /__

_\ \/ _ \/ _ `/ __/  '_/

/___/ .__/\_,_/_/ /_/\_\   version 2.4.2

/_/

 

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45)

Type in expressions to have them evaluated.

Type :help for more information.

 

scala>

master:运行的模式

local[*]:表示在本地上运行

 

参考

你可能感兴趣的:(大数据)