carbondata是一种带索引的列型数据格式,用于大数据快速分析平台例如:hadoop、spark等。说白了:一种数据格式!
由于CarbonData刚刚开源,目前官方文档不规范并且较少。
OS: centos(类unix OS)
Apache Maven(推荐版本3.3或更高版本)
Oracle Java 7或8
Apache Thrift 0.9.3
以上条件缺一不可
git 下载carbondata
官网下载已经发行的版本。
地址:https://dist.apache.org/repos/dist/release/carbondata/1.1.0/
进入到cabondata 的目录下,运行构建命令
构建无需测试,默认情况下carbondata采用Spark 1.6.2进行构建
mvn -DskipTests clean package(默认)
carbondata也支持使用不同版本的Spark构建(目前支持的版本有一下几个)。
mvn -DskipTests -Pspark-1.5 -Dspark.version=1.5.1 clean package
mvn -DskipTests -Pspark-1.5 -Dspark.version=1.5.2 clean package
mvn -DskipTests -Pspark-1.6 -Dspark.version=1.6.1 clean package
mvn -DskipTests -Pspark-1.6 -Dspark.version=1.6.2 clean package
mvn -DskipTests -Pspark-1.6 -Dspark.version=1.6.3 clean package
mvn -DskipTests -Pspark-2.1 -Dspark.version=2.1.0 clean package
cc@lcc carbondata-parent-1.1.0$ pwd
/Users/lcc/soft/carbondata/carbondata-parent-1.1.0
lcc@lcc carbondata-parent-1.1.0$ mvn -DskipTests -Pspark-2.1 -Dspark.version=2.1.0 clean package
...
[INFO] Reactor Summary:
[INFO]
[INFO] Apache CarbonData :: Parent ........................ SUCCESS [ 6.080 s]
[INFO] Apache CarbonData :: Common ........................ SUCCESS [ 13.184 s]
[INFO] Apache CarbonData :: Core .......................... SUCCESS [ 29.356 s]
[INFO] Apache CarbonData :: Processing .................... SUCCESS [ 10.520 s]
[INFO] Apache CarbonData :: Hadoop ........................ SUCCESS [ 7.743 s]
[INFO] Apache CarbonData :: Spark Common .................. SUCCESS [01:45 min]
[INFO] Apache CarbonData :: Spark2 ........................ SUCCESS [01:58 min]
[INFO] Apache CarbonData :: Spark Common Test ............. SUCCESS [ 46.705 s]
[INFO] Apache CarbonData :: Assembly ...................... SUCCESS [ 6.791 s]
[INFO] Apache CarbonData :: Spark2 Examples ............... SUCCESS [ 19.808 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 06:05 min
[INFO] Finished at: 2018-09-26T16:08:08+08:00
[INFO] Final Memory: 92M/810M
[INFO] ------------------------------------------------------------------------
lcc@lcc carbondata-parent-1.1.0$ ll assembly/target/scala-2.11/
total 19512
drwxr-xr-x 3 lcc staff 96 9 26 16:07 ./
drwxr-xr-x 5 lcc staff 160 9 26 16:07 ../
-rw-r--r-- 1 lcc staff 9986219 9 26 16:07 carbondata_2.11-1.1.0-shade-hadoop2.2.0.jar
直接编译成功了
./assembly/target/scala-2.1x/carbondata_xxx.jar
到SPARK_HOME/carbonlib
文件夹。 注意:如果carbonlib
文件夹在SPARK_HOME
路径中不存在,则创建它。carbonlib
文件夹路径。(SPARK_HOME/conf/spark-env.sh
文件并修改SPARK_CLASSPATH附加SPARK_HOME/carbonlib/*
到现有值的值)./conf/carbon.properties.template
文件从CarbonData存储库复制到文件$SPARK_HOME/conf/
夹,并将文件重命名为carbon.properties。$SPARK_HOME/conf/spark-defaults.conf
文件中下表中提到的属性。-Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.properties
spark.executor.extraJavaOptions
-Dcarbon.properties.filepath=$SPARK_HOME/conf/carbon.properties
额外传递给executors的JVM选项。例如,GC设置或其他日志记录。注意:您可以输入以空格分隔的多个值。
lcc@lcc carbondata-parent-1.1.0$ mkdir /Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7/carbonlib
lcc@lcc carbondata-parent-1.1.0$ cp assembly/target/scala-2.11/carbondata_2.11-1.1.0-shade-hadoop2.2.0.jar /Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7/carbonlib
lcc@lcc carbondata-parent-1.1.0$
lcc@lcc carbondata-parent-1.1.0$ cp conf/carbon.properties.template /Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7/conf/carbon.properties
lcc@lcc spark-2.0.1-bin-hadoop2.7$ vim conf/spark-env.sh
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_131.jdk/Contents/Home
HADOOP_CONF_DIR=/Users/lcc/soft/hadoop/hadoop/etc/hadoop
SCALA_HOME=/Users/lcc/soft/scala/scala-2.12.6
SPARK_MASTER_HOST=lcc
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_CORES=1
SPQRK_WORKER_MEMORY=1000m
SPARK_WORKER_PORT=7078
SPARK_WORKER_WEBUI_PORT=8081
SPARK_WORKER_INSTANCES=1
SPARK_WORKER_MEMORY=512M
SPARK_WORKER_INSTANCES=1
#spark.executor.extraClassPath
#spark.driver.extraClassPath=/Users/lcc/IdeaProjects/spark-authorizer/spark-auth/target/*
SPARK_CLASSPATH=$SPARK_HOME/carbonlib/*
lcc@lcc spark-2.0.1-bin-hadoop2.7$ vim conf/spark-defaults.conf
spark.driver.extraJavaOptions="-Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties"
spark.executor.extraJavaOptions="-Dcarbon.properties.filepath = $SPARK_HOME/conf/carbon.properties"
lcc@lcc spark-2.0.1-bin-hadoop2.7$ vi conf/carbon.properties
carbon.storelocation=hdfs://lcc:9000/Opt/CarbonStore
直接执行
lcc@lcc spark-2.0.1-bin-hadoop2.7$ spark-shell
lcc@lcc spark-2.0.1-bin-hadoop2.7$ spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7/carbonlib/carbondata_2.11-1.1.0-shade-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
18/09/26 17:34:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/09/26 17:34:44 WARN SparkConf:
SPARK_CLASSPATH was detected (set to '/Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7/carbonlib/*').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --driver-class-path to augment the driver classpath
- spark.executor.extraClassPath to augment the executor classpath
18/09/26 17:34:44 WARN SparkConf: Setting 'spark.executor.extraClassPath' to '/Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7/carbonlib/*' as a work-around.
18/09/26 17:34:44 WARN SparkConf: Setting 'spark.driver.extraClassPath' to '/Users/lcc/soft/spark/spark-2.0.1-bin-hadoop2.7/carbonlib/*' as a work-around.
18/09/26 17:34:44 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '1').
This is deprecated in Spark 1.0+.
Please instead use:
- ./spark-submit with --num-executors to specify the number of executors
- Or set SPARK_EXECUTOR_INSTANCES
- spark.executor.instances to configure the number of instances in the spark config.
18/09/26 17:34:44 WARN Utils: Your hostname, lcc resolves to a loopback address: 127.0.0.1; using 192.168.1.184 instead (on interface en0)
18/09/26 17:34:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/09/26 17:34:45 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
18/09/26 17:34:46 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://192.168.1.184:4041
Spark context available as 'sc' (master = local[*], app id = local-1537954485827).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> import org.apache.spark.sql.CarbonSession._
import org.apache.spark.sql.CarbonSession._
scala> val carbon = SparkSession.builder().config(sc.getConf).getOrCreateCarbonSession("hdfs://lcc:9000/Opt/CarbonStore")
18/09/26 17:36:04 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
carbon: org.apache.spark.sql.SparkSession = org.apache.spark.sql.CarbonSession@708dfe10
scala>
重点看上面出现了。红色的字体,就对了,我没有开启hadoop,也没用开启hive,后面的没有继续测试
参考:
https://blog.csdn.net/u013181284/article/details/73331170
这里说编译需要连接,实际上不需要