网盘下载地址
链接: https://pan.baidu.com/s/19qWnP6LQ-cHVrvT0o1jTMg 密码: 44hs
Hadoop伪分布式配置
Hadoop 可以在单节点上以伪分布式的方式运行,Hadoop 进程以分离的 Java 进程来运行,节点既作为 NameNode 也作为 DataNode,同时,读取的是 HDFS 中的文件。
Hadoop 的配置文件位于 /usr/local/hadoop/etc/hadoop/ 中,伪分布式需要修改2个配置文件 core-site.xml 和 hdfs-site.xml 。Hadoop的配置文件是 xml 格式.
修改配置文件 core-site.xml:
通过 gedit 编辑会比较方便: gedit ./etc/hadoop/core-site.xml
<configuration> <property> <name>hadoop.tmp.dirname> <value>file:/usr/local/hadoop/tmpvalue> <description>Abase for other temporary directories.description> property> <property> <name>fs.defaultFSname> <value>hdfs://localhost:9000value> property> configuration>
修改配置文件 hdfs-site.xml:
gedit ./etc/hadoop/hdfs-site.xml
<configuration> <property> <name>dfs.replicationname> <value>1value> property> <property> <name>dfs.namenode.name.dirname> <value>file:/usr/local/hadoop/tmp/dfs/namevalue> property> <property> <name>dfs.datanode.data.dirname> <value>file:/usr/local/hadoop/tmp/dfs/datavalue> property> configuration>
配置完成后,执行 NameNode 的格式化:
./bin/hdfs namenode -format
成功的话,会看到 “successfully formatted” 和 “Exitting with status 0” 的提示.
Hadoop 的运行方式是由配置文件决定的(运行 Hadoop 时会读取配置文件),因此如果需要从伪分布式模式切换回非分布式模式,需要删除 core-site.xml 中的配置项。
伪分布式运行MapReduce作业:
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar grep input output 'dfs[a-z.]+'
---------------------------
Hbase伪分布式配置
1.配置/usr/local/hbase/conf/hbase-env.sh。命令如下:
gedit /usr/local/hbase/conf/hbase-env.sh
配置JAVA_HOME,HBASE_CLASSPATH,HBASE_MANAGES_ZK.HBASE_CLASSPATH设置为本机Hadoop安装目录下的conf目录(即/usr/local/hadoop/conf)
export JAVA_HOME=/usr/lib/jvm/default-java export HBASE_CLASSPATH=/usr/local/hadoop/conf export HBASE_MANAGES_ZK=true
2.配置/usr/local/hbase/conf/hbase-site.xml
用命令vi打开并编辑hbase-site.xml,命令如下:
gedit /usr/local/hbase/conf/hbase-site.xml
------------------------------------------------------
Python - MapReduce - WorldCount
1.1 Map阶段:mapper.py
#!/usr/bin/env python import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print "%s\t%s" % (word, 1)
1.2 Reduce阶段:reducer.py
#!/usr/bin/env python from operator import itemgetter import sys current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print "%s\t%s" % (current_word, current_count) current_count = count current_word = word if word == current_word: print "%s\t%s" % (current_word, current_count)
1.3 本地测试代码(cat data | map | sort | reduce)
$echo "foo foo quux labs foo bar quux" | ./mapper.py $echo "foo foo quux labs foo bar quux" | ./mapper.py | sort -k1,1 | ./reducer.py
1.4 在Hadoop上运行python代码
~/.bashrc
export STREAM=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.ja
run.sh
hadoop jar $STREAM \
-file /home/hadoop/wc/mapper.py \
-mapper /home/hadoop/wc/mapper.py \
-file /home/hadoop/wc/reducer.py \
-reducer /home/hadoop/wc/reducer.py \
-input /user/hadoop/input/*.txt \
-output /user/hadoop/wcoutput
--------------------------------------------
hive配置
3. 修改/usr/local/hive/conf
下的hive-site.xml
javax.jdo.option.ConnectionURL jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true JDBC connect string for a JDBC metastore javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver Driver class name for a JDBC metastore javax.jdo.option.ConnectionUserName hive username to use against metastore database javax.jdo.option.ConnectionPassword hive password to use against metastore database
-------------------------------------------
Hive user
pre_deal.sh
#!/bin/bash infile=$1 outfile=$2 awk -F "," 'BEGIN{ srand(); id=0; Province[0]="山东";Province[1]="山西";Province[2]="河南";Province[3]="河北";Province[4]="陕西";Province[5]="内蒙古";Province[6]="上海市"; Province[7]="北京市";Province[8]="重庆市";Province[9]="天津市";Province[10]="福建";Province[11]="广东";Province[12]="广西";Province[13]="云南"; Province[14]="浙江";Province[15]="贵州";Province[16]="新疆";Province[17]="西藏";Province[18]="江西";Province[19]="湖南";Province[20]="湖北"; Province[21]="黑龙江";Province[22]="吉林";Province[23]="辽宁"; Province[24]="江苏";Province[25]="甘肃";Province[26]="青海";Province[27]="四川"; Province[28]="安徽"; Province[29]="宁夏";Province[30]="海南";Province[31]="香港";Province[32]="澳门";Province[33]="台湾"; } { id=id+1; value=int(rand()*34); print id"\t"$1"\t"$2"\t"$3"\t"$5"\t"substr($6,1,10)"\t"Province[value] }' $infile > $outfile
Hive
word_count
create table word_count as select word, count(1) as count from (select explode(split(line,' '))as word from docs) w group by word order by word;
create table word_counts as select word,count(1) as count from (select explode(split(line,' ')) as word from docs) word group by word order by word;
Hive user analyse
CREATE EXTERNAL TABLE dblab.bigdata_user(id INT,uid STRING,item_id STRING,behavior_type INT,item_category STRING,visit_date DATE,province STRING) COMMENT 'Welcome to dblab!' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/bigdatacase/dataset';
查询不重复的数据有多少条
select count(*) from (select uid,item_id,behavior_type,item_category,visit_date,province from bigdata_user group by uid,item_id,behavior_type,item_category,visit_date,province having count(*)=1)a;
5.https://www.cnblogs.com/kaituorensheng/p/3826114.html
https://blog.csdn.net/qq_39662852/article/details/84318619
https://www.liaoxuefeng.com/article/1280231425966113
https://blog.csdn.net/helloxiaozhe/article/details/88964067
https://www.jianshu.com/p/21c880ee93a9
wget http://www.gutenberg.org/files/5000/5000-8.txt
wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt