Spark组件之SparkR学习3--使用spark-submit向集群提交R代码文件data-manipulation.R

更多代码请见:https://github.com/xubo245/SparkLearning


1.数据准备:

1.1 下载数据文件

wget http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv

1.2 上传到hdfs:

 hadoop fs -put flights.csv ./

2.运行

2.1 默认本地运行:

<pre name="code" class="plain">spark-submit --packages com.databricks:spark-csv_2.10:1.4.0  --master local  data-manipulation.R  flights.csv
 
 

运行记录:

hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0  --master local  data-manipulation.R  flights.csv
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found com.databricks#spark-csv_2.10;1.4.0 in central
	found org.apache.commons#commons-csv;1.1 in central
	found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 364ms :: artifacts dl 11ms
	:: modules in use:
	com.databricks#spark-csv_2.10;1.4.0 from central in [default]
	com.univocity#univocity-parsers;1.5.1 from central in [default]
	org.apache.commons#commons-csv;1.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
	confs: [default]
	0 artifacts copied, 3 already retrieved (0kB/8ms)
WARNING: ignoring environment value of R_HOME
Loading required package: methods

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    filter, na.omit

The following objects are masked from ‘package:base’:

    intersect, rbind, sample, subset, summary, table, transform

root
 |-- date: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)
 |-- dep: string (nullable = true)
 |-- arr: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_delay: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- plane: string (nullable = true)
 |-- cancelled: string (nullable = true)
 |-- time: string (nullable = true)
 |-- dist: string (nullable = true)
DataFrame[date:string, hour:string, minute:string, dep:string, arr:string, dep_delay:string, arr_delay:string, carrier:string, flight:string, dest:string, plane:string, cancelled:string, time:string, dist:string]
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
|               date|hour|minute| dep| arr|dep_delay|arr_delay|carrier|flight|dest| plane|cancelled|time|dist|
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
|2011-01-01 12:00:00|  14|     0|1400|1500|        0|      -10|     AA|   428| DFW|N576AA|        0|  40| 224|
|2011-01-02 12:00:00|  14|     1|1401|1501|        1|       -9|     AA|   428| DFW|N557AA|        0|  45| 224|
|2011-01-03 12:00:00|  13|    52|1352|1502|       -8|       -8|     AA|   428| DFW|N541AA|        0|  48| 224|
|2011-01-04 12:00:00|  14|     3|1403|1513|        3|        3|     AA|   428| DFW|N403AA|        0|  39| 224|
|2011-01-05 12:00:00|  14|     5|1405|1507|        5|       -3|     AA|   428| DFW|N492AA|        0|  44| 224|
|2011-01-06 12:00:00|  13|    59|1359|1503|       -1|       -7|     AA|   428| DFW|N262AA|        0|  45| 224|
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
only showing top 6 rows
                 date hour minute  dep  arr dep_delay arr_delay carrier flight
1 2011-01-01 12:00:00   14      0 1400 1500         0       -10      AA    428
2 2011-01-02 12:00:00   14      1 1401 1501         1        -9      AA    428
3 2011-01-03 12:00:00   13     52 1352 1502        -8        -8      AA    428
4 2011-01-04 12:00:00   14      3 1403 1513         3         3      AA    428
5 2011-01-05 12:00:00   14      5 1405 1507         5        -3      AA    428
6 2011-01-06 12:00:00   13     59 1359 1503        -1        -7      AA    428
  dest  plane cancelled time dist
1  DFW N576AA         0   40  224
2  DFW N557AA         0   45  224
3  DFW N541AA         0   48  224
4  DFW N403AA         0   39  224
5  DFW N492AA         0   44  224
6  DFW N262AA         0   45  224
 [1] "date"      "hour"      "minute"    "dep"       "arr"       "dep_delay"
 [7] "arr_delay" "carrier"   "flight"    "dest"      "plane"     "cancelled"
[13] "time"      "dist"     
[1] 227496
  dest cancelled
1  DFW         0
2  DFW         0
3  DFW         0
4  DFW         0
5  DFW         0
6  DFW         0



2.2 集群运行:

运行指令:

hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0  --master spark://<strong>MasterIP</strong>:7077  data-manipulation.R  flights.csv
<strong>
</strong>
<strong>MasterIP需要改为实际IP</strong>
<strong>集群运行比默认本地运行快很多</strong>
<strong>
</strong>
运行记录:

hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0  --master spark://<strong>MasterIP</strong>:7077  data-manipulation.R  flights.csv
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found com.databricks#spark-csv_2.10;1.4.0 in central
	found org.apache.commons#commons-csv;1.1 in central
	found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 342ms :: artifacts dl 12ms
	:: modules in use:
	com.databricks#spark-csv_2.10;1.4.0 from central in [default]
	com.univocity#univocity-parsers;1.5.1 from central in [default]
	org.apache.commons#commons-csv;1.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
	confs: [default]
	0 artifacts copied, 3 already retrieved (0kB/8ms)
WARNING: ignoring environment value of R_HOME
Loading required package: methods

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    filter, na.omit

The following objects are masked from ‘package:base’:

    intersect, rbind, sample, subset, summary, table, transform

root
 |-- date: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)
 |-- dep: string (nullable = true)
 |-- arr: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_delay: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- plane: string (nullable = true)
 |-- cancelled: string (nullable = true)
 |-- time: string (nullable = true)
 |-- dist: string (nullable = true)
DataFrame[date:string, hour:string, minute:string, dep:string, arr:string, dep_delay:string, arr_delay:string, carrier:string, flight:string, dest:string, plane:string, cancelled:string, time:string, dist:string]
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
|               date|hour|minute| dep| arr|dep_delay|arr_delay|carrier|flight|dest| plane|cancelled|time|dist|
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
|2011-01-01 12:00:00|  14|     0|1400|1500|        0|      -10|     AA|   428| DFW|N576AA|        0|  40| 224|
|2011-01-02 12:00:00|  14|     1|1401|1501|        1|       -9|     AA|   428| DFW|N557AA|        0|  45| 224|
|2011-01-03 12:00:00|  13|    52|1352|1502|       -8|       -8|     AA|   428| DFW|N541AA|        0|  48| 224|
|2011-01-04 12:00:00|  14|     3|1403|1513|        3|        3|     AA|   428| DFW|N403AA|        0|  39| 224|
|2011-01-05 12:00:00|  14|     5|1405|1507|        5|       -3|     AA|   428| DFW|N492AA|        0|  44| 224|
|2011-01-06 12:00:00|  13|    59|1359|1503|       -1|       -7|     AA|   428| DFW|N262AA|        0|  45| 224|
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
only showing top 6 rows
                 date hour minute  dep  arr dep_delay arr_delay carrier flight
1 2011-01-01 12:00:00   14      0 1400 1500         0       -10      AA    428
2 2011-01-02 12:00:00   14      1 1401 1501         1        -9      AA    428
3 2011-01-03 12:00:00   13     52 1352 1502        -8        -8      AA    428
4 2011-01-04 12:00:00   14      3 1403 1513         3         3      AA    428
5 2011-01-05 12:00:00   14      5 1405 1507         5        -3      AA    428
6 2011-01-06 12:00:00   13     59 1359 1503        -1        -7      AA    428
  dest  plane cancelled time dist
1  DFW N576AA         0   40  224
2  DFW N557AA         0   45  224
3  DFW N541AA         0   48  224
4  DFW N403AA         0   39  224
5  DFW N492AA         0   44  224
6  DFW N262AA         0   45  224
 [1] "date"      "hour"      "minute"    "dep"       "arr"       "dep_delay"
 [7] "arr_delay" "carrier"   "flight"    "dest"      "plane"     "cancelled"
[13] "time"      "dist"     
[1] 227496                                                                      
  dest cancelled                                                                
1  DFW         0
2  DFW         0
3  DFW         0
4  DFW         0
5  DFW         0
6  DFW         0

2.3 源码文件:

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# For this example, we shall use the "flights" dataset
# The dataset consists of every flight departing Houston in 2011.
# The data set is made up of 227,496 rows x 14 columns. 

# To run this example use
# ./bin/sparkR --packages com.databricks:spark-csv_2.10:1.0.3
#     examples/src/main/r/data-manipulation.R <path_to_csv>

# Load SparkR library into your R session
library(SparkR)

args <- commandArgs(trailing = TRUE)

if (length(args) != 1) {
  print("Usage: data-manipulation.R <path-to-flights.csv")
  print("The data can be downloaded from: http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv ")
  q("no")
}

## Initialize SparkContext
sc <- sparkR.init(appName = "SparkR-data-manipulation-example")

## Initialize SQLContext
sqlContext <- sparkRSQL.init(sc)

flightsCsvPath <- args[[1]]

# Create a local R dataframe
flights_df <- read.csv(flightsCsvPath, header = TRUE)
flights_df$date <- as.Date(flights_df$date)

## Filter flights whose destination is San Francisco and write to a local data frame
SFO_df <- flights_df[flights_df$dest == "SFO", ] 

# Convert the local data frame into a SparkR DataFrame
SFO_DF <- createDataFrame(sqlContext, SFO_df)

#  Directly create a SparkR DataFrame from the source data
flightsDF <- read.df(sqlContext, flightsCsvPath, source = "com.databricks.spark.csv", header = "true")

# Print the schema of this Spark DataFrame
printSchema(flightsDF)

# Cache the DataFrame
cache(flightsDF)

# Print the first 6 rows of the DataFrame
showDF(flightsDF, numRows = 6) ## Or
head(flightsDF)

# Show the column names in the DataFrame
columns(flightsDF)

# Show the number of rows in the DataFrame
count(flightsDF)

# Select specific columns
destDF <- select(flightsDF, "dest", "cancelled")

# Using SQL to select columns of data
# First, register the flights DataFrame as a table
registerTempTable(flightsDF, "flightsTable")
destDF <- sql(sqlContext, "SELECT dest, cancelled FROM flightsTable")

# Use collect to create a local R data frame
local_df <- collect(destDF)

# Print the newly created local data frame
head(local_df)

# Filter flights whose destination is JFK
jfkDF <- filter(flightsDF, "dest = \"JFK\"") ##OR
jfkDF <- filter(flightsDF, flightsDF$dest == "JFK")

# If the magrittr library is available, we can use it to
# chain data frame operations
if("magrittr" %in% rownames(installed.packages())) {
  library(magrittr)

  # Group the flights by date and then find the average daily delay
  # Write the result into a DataFrame
  groupBy(flightsDF, flightsDF$date) %>%
    summarize(avg(flightsDF$dep_delay), avg(flightsDF$arr_delay)) -> dailyDelayDF

  # Print the computed data frame
  head(dailyDelayDF)
}

# Stop the SparkContext now
sparkR.stop()




3 错误记录:

3.1 路径对但读取不了,未理解=》解决:把文件发到用户目录下就可以了

hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0  data-manipulation.R  /xubo/spark/data/r/input/flights.csv
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found com.databricks#spark-csv_2.10;1.4.0 in central
	found org.apache.commons#commons-csv;1.1 in central
	found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 357ms :: artifacts dl 11ms
	:: modules in use:
	com.databricks#spark-csv_2.10;1.4.0 from central in [default]
	com.univocity#univocity-parsers;1.5.1 from central in [default]
	org.apache.commons#commons-csv;1.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
	confs: [default]
	0 artifacts copied, 3 already retrieved (0kB/9ms)
WARNING: ignoring environment value of R_HOME
Loading required package: methods

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    filter, na.omit

The following objects are masked from ‘package:base’:

    intersect, rbind, sample, subset, summary, table, transform

Error in file(file, "rt") : cannot open the connection
Calls: read.csv -> read.table -> file
In addition: Warning message:
In file(file, "rt") :
  cannot open file '/xubo/spark/data/r/input/flights.csv': No such file or directory
Execution halted

3.2 文件不存在错误: =》解决办法:传上去就可以了

hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0  data-manipulation.R  flights.csv
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found com.databricks#spark-csv_2.10;1.4.0 in central
	found org.apache.commons#commons-csv;1.1 in central
	found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 371ms :: artifacts dl 12ms
	:: modules in use:
	com.databricks#spark-csv_2.10;1.4.0 from central in [default]
	com.univocity#univocity-parsers;1.5.1 from central in [default]
	org.apache.commons#commons-csv;1.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
	confs: [default]
	0 artifacts copied, 3 already retrieved (0kB/8ms)
WARNING: ignoring environment value of R_HOME
Loading required package: methods

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    filter, na.omit

The following objects are masked from ‘package:base’:

    intersect, rbind, sample, subset, summary, table, transform

16/04/20 12:41:53 ERROR RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
  org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://Master:9000/user/hadoop/flights.csv
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RD
Calls: read.df -> callJStatic -> invokeJava
Execution halted

3.3 没有找到 com.databricks.spark.csv模版:=》解决办法:加入 com.databricks.spark.csv : spark-submit --packages com.databricks:spark-csv_2.10:1.4.0  data-manipulation.R  flights.csv 

运行记录:

hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit data-manipulation.R  flights.csv
WARNING: ignoring environment value of R_HOME
Loading required package: methods

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    filter, na.omit

The following objects are masked from ‘package:base’:

    intersect, rbind, sample, subset, summary, table, transform

16/04/20 12:28:18 ERROR RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
  java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.csv.
	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:67)
	at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:87)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
	at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:156)
	at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:132)
	at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:79)
	at org.apache.
Calls: read.df -> callJStatic -> invokeJava
Execution halted
</pre><pre code_snippet_id="1654209" snippet_file_name="blog_20160420_13_9087207" name="code" class="plain">

3.4 不声明--master会很慢:

hadoop@Master:~/cloud/testByXubo/spark/R$ spark-submit --packages com.databricks:spark-csv_2.10:1.4.0  data-manipulation.R  flights.csv
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/home/hadoop/cloud/spark-1.5.2/lib/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found com.databricks#spark-csv_2.10;1.4.0 in central
	found org.apache.commons#commons-csv;1.1 in central
	found com.univocity#univocity-parsers;1.5.1 in central
:: resolution report :: resolve 342ms :: artifacts dl 25ms
	:: modules in use:
	com.databricks#spark-csv_2.10;1.4.0 from central in [default]
	com.univocity#univocity-parsers;1.5.1 from central in [default]
	org.apache.commons#commons-csv;1.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
	confs: [default]
	0 artifacts copied, 3 already retrieved (0kB/8ms)
WARNING: ignoring environment value of R_HOME
Loading required package: methods

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    filter, na.omit

The following objects are masked from ‘package:base’:

    intersect, rbind, sample, subset, summary, table, transform

root
 |-- date: string (nullable = true)
 |-- hour: string (nullable = true)
 |-- minute: string (nullable = true)
 |-- dep: string (nullable = true)
 |-- arr: string (nullable = true)
 |-- dep_delay: string (nullable = true)
 |-- arr_delay: string (nullable = true)
 |-- carrier: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- dest: string (nullable = true)
 |-- plane: string (nullable = true)
 |-- cancelled: string (nullable = true)
 |-- time: string (nullable = true)
 |-- dist: string (nullable = true)
DataFrame[date:string, hour:string, minute:string, dep:string, arr:string, dep_delay:string, arr_delay:string, carrier:string, flight:string, dest:string, plane:string, cancelled:string, time:string, dist:string]
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
|               date|hour|minute| dep| arr|dep_delay|arr_delay|carrier|flight|dest| plane|cancelled|time|dist|
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
|2011-01-01 12:00:00|  14|     0|1400|1500|        0|      -10|     AA|   428| DFW|N576AA|        0|  40| 224|
|2011-01-02 12:00:00|  14|     1|1401|1501|        1|       -9|     AA|   428| DFW|N557AA|        0|  45| 224|
|2011-01-03 12:00:00|  13|    52|1352|1502|       -8|       -8|     AA|   428| DFW|N541AA|        0|  48| 224|
|2011-01-04 12:00:00|  14|     3|1403|1513|        3|        3|     AA|   428| DFW|N403AA|        0|  39| 224|
|2011-01-05 12:00:00|  14|     5|1405|1507|        5|       -3|     AA|   428| DFW|N492AA|        0|  44| 224|
|2011-01-06 12:00:00|  13|    59|1359|1503|       -1|       -7|     AA|   428| DFW|N262AA|        0|  45| 224|
+-------------------+----+------+----+----+---------+---------+-------+------+----+------+---------+----+----+
only showing top 6 rows
                 date hour minute  dep  arr dep_delay arr_delay carrier flight
1 2011-01-01 12:00:00   14      0 1400 1500         0       -10      AA    428
2 2011-01-02 12:00:00   14      1 1401 1501         1        -9      AA    428
3 2011-01-03 12:00:00   13     52 1352 1502        -8        -8      AA    428
4 2011-01-04 12:00:00   14      3 1403 1513         3         3      AA    428
5 2011-01-05 12:00:00   14      5 1405 1507         5        -3      AA    428
6 2011-01-06 12:00:00   13     59 1359 1503        -1        -7      AA    428
  dest  plane cancelled time dist
1  DFW N576AA         0   40  224
2  DFW N557AA         0   45  224
3  DFW N541AA         0   48  224
4  DFW N403AA         0   39  224
5  DFW N492AA         0   44  224
6  DFW N262AA         0   45  224
 [1] "date"      "hour"      "minute"    "dep"       "arr"       "dep_delay"
 [7] "arr_delay" "carrier"   "flight"    "dest"      "plane"     "cancelled"
[13] "time"      "dist"     
[Stage 4:=============================>                             (1 + 0) / 2]
[Stage 4:=============================>                             (1 + 0) / 2]
[Stage 4:=============================>                             (1 + 0) / 2]
[Stage 4:=============================>                             (1 + 0) / 2]
[Stage 4:=============================>                             (1 + 0) / 2]

[Stage 4:=============================>                             (1 + 0) / 2]





你可能感兴趣的:(Spark组件之SparkR学习3--使用spark-submit向集群提交R代码文件data-manipulation.R)