更多代码请见:https://github.com/xubo245/SparkLearning
1.数据准备:
1.1 下载数据文件
wget http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv
hadoop fs -put flights.csv ./
2.1 默认本地运行:
spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 --master local data-manipulation.R flights.csv
运行记录:
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 --master local data-manipulation.R flights.csv
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.4.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 364ms :: artifacts dl 11ms
:: modules in use:
com.databricks#spark-csv_2.10;1.4.0 from central in [default]
com.univocity#univocity-parsers;1.5.1 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/8ms)
WARNING: ignoring environment value of R_HOME
Loading required package: methods
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
filter, na.omit
The following objects are masked from ‘package:base’:
intersect, rbind, sample, subset, summary, table, transform
root
|-- date: string (nullable = true)
|-- hour: string (nullable = true)
|-- minute: string (nullable = true)
|-- dep: string (nullable = true)
|-- arr: string (nullable = true)
|-- dep_delay: string (nullable = true)
|-- arr_delay: string (nullable = true)
|-- carrier: string (nullable = true)
|-- flight: string (nullable = true)
|-- dest: string (nullable = true)
|-- plane: string (nullable = true)
|-- cancelled: string (nullable = true)
|-- time: string (nullable = true)
|-- dist: string (nullable = true)
DataFrame[date:string, hour:string, minute:string, dep:string, arr:string, dep_delay:string, arr_delay:string, carrier:string, flight:string, dest:string, plane:string, cancelled:string, time:string, dist:string]
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
| date|hour|minute| dep| arr|dep_delay|arr_delay|carrier|flight|dest| plane|cancelled|time|dist|
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
|2011-01-01 12:00:00| 14| 0|1400|1500| 0| -10| AA| 428| DFW|N576AA| 0| 40| 224|
|2011-01-02 12:00:00| 14| 1|1401|1501| 1| -9| AA| 428| DFW|N557AA| 0| 45| 224|
|2011-01-03 12:00:00| 13| 52|1352|1502| -8| -8| AA| 428| DFW|N541AA| 0| 48| 224|
|2011-01-04 12:00:00| 14| 3|1403|1513| 3| 3| AA| 428| DFW|N403AA| 0| 39| 224|
|2011-01-05 12:00:00| 14| 5|1405|1507| 5| -3| AA| 428| DFW|N492AA| 0| 44| 224|
|2011-01-06 12:00:00| 13| 59|1359|1503| -1| -7| AA| 428| DFW|N262AA| 0| 45| 224|
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
only showing top 6 rows
date hour minute dep arr dep_delay arr_delay carrier flight
1 2011-01-01 12:00:00 14 0 1400 1500 0 -10 AA 428
2 2011-01-02 12:00:00 14 1 1401 1501 1 -9 AA 428
3 2011-01-03 12:00:00 13 52 1352 1502 -8 -8 AA 428
4 2011-01-04 12:00:00 14 3 1403 1513 3 3 AA 428
5 2011-01-05 12:00:00 14 5 1405 1507 5 -3 AA 428
6 2011-01-06 12:00:00 13 59 1359 1503 -1 -7 AA 428
dest plane cancelled time dist
1 DFW N576AA 0 40 224
2 DFW N557AA 0 45 224
3 DFW N541AA 0 48 224
4 DFW N403AA 0 39 224
5 DFW N492AA 0 44 224
6 DFW N262AA 0 45 224
[1] "date" "hour" "minute" "dep" "arr" "dep_delay"
[7] "arr_delay" "carrier" "flight" "dest" "plane" "cancelled"
[13] "time" "dist"
[1] 227496
dest cancelled
1 DFW 0
2 DFW 0
3 DFW 0
4 DFW 0
5 DFW 0
6 DFW 0
2.2 集群运行:
运行指令:
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 --master spark://MasterIP:7077 data-manipulation.R flights.csv
MasterIP需要改为实际IP
集群运行比默认本地运行快很多
运行记录:
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 --master spark://MasterIP:7077 data-manipulation.R flights.csv
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.4.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 342ms :: artifacts dl 12ms
:: modules in use:
com.databricks#spark-csv_2.10;1.4.0 from central in [default]
com.univocity#univocity-parsers;1.5.1 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/8ms)
WARNING: ignoring environment value of R_HOME
Loading required package: methods
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
filter, na.omit
The following objects are masked from ‘package:base’:
intersect, rbind, sample, subset, summary, table, transform
root
|-- date: string (nullable = true)
|-- hour: string (nullable = true)
|-- minute: string (nullable = true)
|-- dep: string (nullable = true)
|-- arr: string (nullable = true)
|-- dep_delay: string (nullable = true)
|-- arr_delay: string (nullable = true)
|-- carrier: string (nullable = true)
|-- flight: string (nullable = true)
|-- dest: string (nullable = true)
|-- plane: string (nullable = true)
|-- cancelled: string (nullable = true)
|-- time: string (nullable = true)
|-- dist: string (nullable = true)
DataFrame[date:string, hour:string, minute:string, dep:string, arr:string, dep_delay:string, arr_delay:string, carrier:string, flight:string, dest:string, plane:string, cancelled:string, time:string, dist:string]
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
| date|hour|minute| dep| arr|dep_delay|arr_delay|carrier|flight|dest| plane|cancelled|time|dist|
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
|2011-01-01 12:00:00| 14| 0|1400|1500| 0| -10| AA| 428| DFW|N576AA| 0| 40| 224|
|2011-01-02 12:00:00| 14| 1|1401|1501| 1| -9| AA| 428| DFW|N557AA| 0| 45| 224|
|2011-01-03 12:00:00| 13| 52|1352|1502| -8| -8| AA| 428| DFW|N541AA| 0| 48| 224|
|2011-01-04 12:00:00| 14| 3|1403|1513| 3| 3| AA| 428| DFW|N403AA| 0| 39| 224|
|2011-01-05 12:00:00| 14| 5|1405|1507| 5| -3| AA| 428| DFW|N492AA| 0| 44| 224|
|2011-01-06 12:00:00| 13| 59|1359|1503| -1| -7| AA| 428| DFW|N262AA| 0| 45| 224|
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
only showing top 6 rows
date hour minute dep arr dep_delay arr_delay carrier flight
1 2011-01-01 12:00:00 14 0 1400 1500 0 -10 AA 428
2 2011-01-02 12:00:00 14 1 1401 1501 1 -9 AA 428
3 2011-01-03 12:00:00 13 52 1352 1502 -8 -8 AA 428
4 2011-01-04 12:00:00 14 3 1403 1513 3 3 AA 428
5 2011-01-05 12:00:00 14 5 1405 1507 5 -3 AA 428
6 2011-01-06 12:00:00 13 59 1359 1503 -1 -7 AA 428
dest plane cancelled time dist
1 DFW N576AA 0 40 224
2 DFW N557AA 0 45 224
3 DFW N541AA 0 48 224
4 DFW N403AA 0 39 224
5 DFW N492AA 0 44 224
6 DFW N262AA 0 45 224
[1] "date" "hour" "minute" "dep" "arr" "dep_delay"
[7] "arr_delay" "carrier" "flight" "dest" "plane" "cancelled"
[13] "time" "dist"
[1] 227496
dest cancelled
1 DFW 0
2 DFW 0
3 DFW 0
4 DFW 0
5 DFW 0
6 DFW 0
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# For this example, we shall use the "flights" dataset
# The dataset consists of every flight departing Houston in 2011.
# The data set is made up of 227,496 rows x 14 columns.
# To run this example use
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
# examples/src/main/r/data-manipulation.R
# Load SparkR library into your R session
library(SparkR)
args <- commandArgs(trailing = TRUE)
if (length(args) != 1) {
print("Usage: data-manipulation.R %
summarize(avg(flightsDF$dep_delay), avg(flightsDF$arr_delay)) -> dailyDelayDF
# Print the computed data frame
head(dailyDelayDF)
}
# Stop the SparkContext now
sparkR.stop()
3.1 路径对但读取不了,未理解=》解决:把文件发到用户目录下就可以了
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 data-manipulation.R /xubo/spark/data/r/input/flights.csv
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.4.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 357ms :: artifacts dl 11ms
:: modules in use:
com.databricks#spark-csv_2.10;1.4.0 from central in [default]
com.univocity#univocity-parsers;1.5.1 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/9ms)
WARNING: ignoring environment value of R_HOME
Loading required package: methods
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
filter, na.omit
The following objects are masked from ‘package:base’:
intersect, rbind, sample, subset, summary, table, transform
Error in file(file, "rt") : cannot open the connection
Calls: read.csv -> read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file '/xubo/spark/data/r/input/flights.csv': No such file or directory
Execution halted
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 data-manipulation.R flights.csv
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.4.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 371ms :: artifacts dl 12ms
:: modules in use:
com.databricks#spark-csv_2.10;1.4.0 from central in [default]
com.univocity#univocity-parsers;1.5.1 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/8ms)
WARNING: ignoring environment value of R_HOME
Loading required package: methods
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
filter, na.omit
The following objects are masked from ‘package:base’:
intersect, rbind, sample, subset, summary, table, transform
16/04/20 12:41:53 ERROR RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://Master:9000/user/hadoop/flights.csv
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RD
Calls: read.df -> callJStatic -> invokeJava
Execution halted
运行记录:
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit data-manipulation.R flights.csv
WARNING: ignoring environment value of R_HOME
Loading required package: methods
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
filter, na.omit
The following objects are masked from ‘package:base’:
intersect, rbind, sample, subset, summary, table, transform
16/04/20 12:28:18 ERROR RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.csv.
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:67)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:87)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:156)
at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:132)
at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:79)
at org.apache.
Calls: read.df -> callJStatic -> invokeJava
Execution halted
3.4 不声明--master会很慢:
hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0 data-manipulation.R flights.csv
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.4.0 in central
found org.apache.commons#commons-csv;1.1 in central
found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 342ms :: artifacts dl 25ms
:: modules in use:
com.databricks#spark-csv_2.10;1.4.0 from central in [default]
com.univocity#univocity-parsers;1.5.1 from central in [default]
org.apache.commons#commons-csv;1.1 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 3 | 0 | 0 | 0 || 3 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 3 already retrieved (0kB/8ms)
WARNING: ignoring environment value of R_HOME
Loading required package: methods
Attaching package: ‘SparkR’
The following objects are masked from ‘package:stats’:
filter, na.omit
The following objects are masked from ‘package:base’:
intersect, rbind, sample, subset, summary, table, transform
root
|-- date: string (nullable = true)
|-- hour: string (nullable = true)
|-- minute: string (nullable = true)
|-- dep: string (nullable = true)
|-- arr: string (nullable = true)
|-- dep_delay: string (nullable = true)
|-- arr_delay: string (nullable = true)
|-- carrier: string (nullable = true)
|-- flight: string (nullable = true)
|-- dest: string (nullable = true)
|-- plane: string (nullable = true)
|-- cancelled: string (nullable = true)
|-- time: string (nullable = true)
|-- dist: string (nullable = true)
DataFrame[date:string, hour:string, minute:string, dep:string, arr:string, dep_delay:string, arr_delay:string, carrier:string, flight:string, dest:string, plane:string, cancelled:string, time:string, dist:string]
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
| date|hour|minute| dep| arr|dep_delay|arr_delay|carrier|flight|dest| plane|cancelled|time|dist|
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
|2011-01-01 12:00:00| 14| 0|1400|1500| 0| -10| AA| 428| DFW|N576AA| 0| 40| 224|
|2011-01-02 12:00:00| 14| 1|1401|1501| 1| -9| AA| 428| DFW|N557AA| 0| 45| 224|
|2011-01-03 12:00:00| 13| 52|1352|1502| -8| -8| AA| 428| DFW|N541AA| 0| 48| 224|
|2011-01-04 12:00:00| 14| 3|1403|1513| 3| 3| AA| 428| DFW|N403AA| 0| 39| 224|
|2011-01-05 12:00:00| 14| 5|1405|1507| 5| -3| AA| 428| DFW|N492AA| 0| 44| 224|
|2011-01-06 12:00:00| 13| 59|1359|1503| -1| -7| AA| 428| DFW|N262AA| 0| 45| 224|
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
only showing top 6 rows
date hour minute dep arr dep_delay arr_delay carrier flight
1 2011-01-01 12:00:00 14 0 1400 1500 0 -10 AA 428
2 2011-01-02 12:00:00 14 1 1401 1501 1 -9 AA 428
3 2011-01-03 12:00:00 13 52 1352 1502 -8 -8 AA 428
4 2011-01-04 12:00:00 14 3 1403 1513 3 3 AA 428
5 2011-01-05 12:00:00 14 5 1405 1507 5 -3 AA 428
6 2011-01-06 12:00:00 13 59 1359 1503 -1 -7 AA 428
dest plane cancelled time dist
1 DFW N576AA 0 40 224
2 DFW N557AA 0 45 224
3 DFW N541AA 0 48 224
4 DFW N403AA 0 39 224
5 DFW N492AA 0 44 224
6 DFW N262AA 0 45 224
[1] "date" "hour" "minute" "dep" "arr" "dep_delay"
[7] "arr_delay" "carrier" "flight" "dest" "plane" "cancelled"
[13] "time" "dist"
[Stage 4:=============================> (1 + 0) / 2]
[Stage 4:=============================> (1 + 0) / 2]
[Stage 4:=============================> (1 + 0) / 2]
[Stage 4:=============================> (1 + 0) / 2]
[Stage 4:=============================> (1 + 0) / 2]
[Stage 4:=============================> (1 + 0) / 2]