spark处理HDFS文件

此部分内容几乎完全参考http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html

1.伪分布式

HP-Pavilion-g4-Notebook-PC:/usr/lib/hadoop-2.8.0$ sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/lib/hadoop-2.8.0/logs/hadoop-yangxiaohuan-namenode-yangxiaohuan-HP-Pavilion-g4-Notebook-PC.out
localhost: starting datanode, logging to /usr/lib/hadoop-2.8.0/logs/hadoop-yangxiaohuan-datanode-yangxiaohuan-HP-Pavilion-g4-Notebook-PC.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is 7f:77:9e:35:fe:21:22:6f:dd:4c:20:27:16:d1:43:37.
Are you sure you want to continue connecting (yes/no)?

The authenticity of host ‘0.0.0.0 (0.0.0.0)’ can’t be established. 解决办法
办法一:
参考网址1,直接关闭防火墙ufw disable,我没有使用这种办法
方法二:
参考网址2

HP-Pavilion-g4-Notebook-PC:/usr/lib/hadoop-2.8.0$ ssh  -o StrictHostKeyChecking=no 0.0.0.0

然后退出ssh环境,直接exit即可

HP-Pavilion-g4-Notebook-PC:/usr/lib/hadoop-2.8.0$ sbin/start-dfs.sh
Starting namenodes on [localhost]
localhost: namenode running as process 6162. Stop it first.
localhost: datanode running as process 6319. Stop it first.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/lib/hadoop-2.8.0/logs/hadoop-username-secondarynamenode-username-HP-Pavilion-g4-Notebook-PC.out

出现localhost: namenode running as process 6162. Stop it first.这种错误,主要是服务没有成功启动,在启动之前,需要将hadoop的所有服务给关掉。

sbin/stop-all.sh

然后再次启动,输入http://localhost:50070/ 就可以看到内容

创建HDFS目录

$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/

将文件拷贝到分布式文件系统中

$ bin/hdfs dfs -mkdir input
$ bin/hdfs dfs -put etc/hadoop/*.xml input

我这里在创建input文件目录时候,报错提示没有input目录。然后使用命令bin/hdfs dfs -mkdir -p input才OK的
运行例子

HP-Pavilion-g4-Notebook-PC:/usr/lib/hadoop-2.8.0$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep input output 'dfs[a-z.]+'

然后Copy the output files from the distributed filesystem to the local filesystem and examine them:

$ bin/hdfs dfs -get output output
$ cat output/*

在浏览器中访问http://localhost:50070/
hadoop伪分布式安装完成后
spark处理HDFS文件_第1张图片
还可以查看文件
spark处理HDFS文件_第2张图片

ps:其他问题,还没有解决
我不知道为什么三个网址,我只能查看一个
http://localhost:50030/ - Hadoop 管理介面
http://localhost:50060/ - Hadoop Task Tracker 状态
http://localhost:50070/ - Hadoop DFS 状态

到这里,hadoop的伪分布式就是安装完毕的。

2.Spark HDFS

下面就可以借助spark来处理HDFS文件
进入spark-shell环境

~$ spark-shell

读取HDFS文件

val s=sc.textFile("hdfs://localhost:9000//user/yangxiaohuan/input/capacity-scheduler.xml")
s.count

spark处理HDFS文件_第3张图片
可以看到运行成功之后的结果。假设文件找不到,则会报错。

3.intellij中运行

新建scala工程,然后新建Scala Scripts,取名为test
test.scala内容

import org.apache.spark.{SparkConf, SparkContext}

object test{
  def main(args: Array[String]): Unit = {
    val conf=new SparkConf();
    conf.set("spark.master","local");
    conf.set("spark.app.name","fileOperate");
    val sc=new SparkContext(conf);
    //读取HDFS文件
    val textFileRdd=sc.textFile("hdfs://localhost:9000//user/yangxiaohuan/input/all_abstract_jian.txt");
    println(textFileRdd.count())

  }
}

build.sbt里面添加
libraryDependencies += “org.apache.spark” % “spark-core_2.11” % “2.1.0”
然后打包成jar
File—Project Setting—Artifacts—”+”—jar-form modules with dependencies…
Build-Build Artifacts生成jar文件
使用命令在spark中运行,进入到spark安装目录

./bin/spark-submit --name "nlp run on spark" --master spark://localhost:8080 --executor-memory 2G --class test /home/username/IntelliJ_IDEA_workspace/fileOperate/out/artifacts/fileoperate_jar/fileoperate.jar

ps:直接在intellij下面run,报错了,目前还不知道如何解决

Exception in thread "main" java.lang.NoClassDefFoundError: scala/Product$class
    at org.apache.spark.SparkConf$DeprecatedConfig.(SparkConf.scala:781)
    at org.apache.spark.SparkConf$.(SparkConf.scala:632)
    at org.apache.spark.SparkConf$.(SparkConf.scala)
    at org.apache.spark.SparkConf.set(SparkConf.scala:92)
    at org.apache.spark.SparkConf$$anonfun$loadFromSystemProperties$3.apply(SparkConf.scala:74)
	at org.apache.spark.SparkConf$$anonfun$loadFromSystemProperties$3.apply(SparkConf.scala:73)
    at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:789)
    at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:225)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:432)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:432)
    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:788)
    at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
    at org.apache.spark.SparkConf.(SparkConf.scala:68)
    at org.apache.spark.SparkConf.(SparkConf.scala:55)
    at test$.main(test.scala:8)
    at test.main(test.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 21 more

完成之后,关闭hadoop

$ sbin/stop-dfs.sh

参考网址
1. http://blog.csdn.net/lglglgl/article/details/46867787
2. http://www.cnblogs.com/huanghongbo/p/6254400.html
3. 在Spark shell中基于HDFS文件系统进行wordcount交互式分析http://www.cnblogs.com/allanli/p/running_spark_shell_on_hdfs.html
4. 官网的命令参考http://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/FileSystemShell.html

你可能感兴趣的:(Spark)