1.词频量统计,基于cdh hadoop 自带的demo 运行指令如下:
wordcount----hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar wordcount 输入路径:/user/root/input 输出路径 /user/root/output
命令:/opt/cloudera/parcels/CDH/jars/hadoop-examples.jar wordcount /user/root/input /user/root/output
2.hive创建表:
Create external table testtable (
name string,
message string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
location '/user/file.csv'
tblproperties ("skip.header.line.count"="1", "skip.footer.line.count"="2");
就是上面sql中tblproperties的2个属性
“skip.heaer.line.count” 跳过文件行首多少行
“skip.footer.line.count”跳过文件行尾多少行
DROP TABLE IF EXISTS 表名;
-- TAB2 同TAB1 一样都是外部表
CREATE EXTERNAL TABLE 表名
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/test;
load DATA INPATH '/user/test/data1.csv' INTO TABLE 表名
例如:
load data inpath '/user/test/myfile/artrile.txt' into table arc;
3.权限问题
设置权限指令: hdfs dfs -chmod 777 /user/...
4.hive & impala语句使用规则:
Hive和impala语句支持大部分sql,有个别方式不支持,具体差距:
https:
https:
Impala共享hive:INVALIDATE METADATA;
5.hbase规则
Hbase存储方式键值对
常用命令:hbase shell
(1)创建表:create <table>, {NAME => , VERSIONS => 例如:create 'User','info'(创建一个User表,并且有一个info列族)
(2)查看所有表:list
(3)查看表详情:describe 'User'
(4)删除指定的列族:alter 'User', 'delete' => 'info'
(5) 插入数据:put <table>,,,例如:put 'User', 'row1', 'info:name', 'xiaoming’
(6)根据rowKey查询某个记录:get <table>,,[,....] 例如:get 'User', 'row2'
(7)查询所有记录:scan 'User'
(8)扫描前2条:scan 'User', {LIMIT => 2}
(9)范围查询:scan 'User', {STARTROW => 'row2'} ----scan 'User', {STARTROW => 'row2', ENDROW => 'row2'}
(10)统计表记录数:count <table>, {INTERVAL => intervalNum, CACHE => cacheNum}例如:count 'User'
(11)删除列:delete 'User', 'row1', 'info:age'
(12)删除所有行:deleteall 'User', 'row2'
(13)删除表中所有数据:truncate 'User'
(14)禁用表:disable 'User'
(15)启用表:enable 'User'
(16)测试表是否存在:exists 'User'
(17)删除表前,必须先disable:drop 'TEST.USER'
4.6.hive到hbase的使用
命令: CREATE EXTERNAL TABLE h1
(id string, name string,age int,sex string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:name,info:age,info:sex") TBLPROPERTIES("hbase.table.name" = "User");
创建一个hive表加载hbase数据。
5.spark启动问题
(1)spark是独立于hadoop之外的;
(2)Spark RDD编程
Spark中RDD是一个不可变的分布式对象集合。每个RDD被分为多个分区,这些分区运行在集群的不同的节点上。
RDD可以包含Python、Java、Scala中的任意类型的对象,以及自定义的对象。
创建RDD的两种方法:
1 读取一个数据集(SparkContext.textFile()) : lines = sc.textFile("README.md")
2 读取一个集合(SparkContext.parallelize()) : lines = sc.paralelize(List("pandas","i like pandas"))
RDD的两种操作:
1 转化操作(transformation) : 由一个RDD生成一个新的RDD
2 行动操作(action) : 对RDD中的元素进行计算,并把结果返回
RDD的惰性计算:
可以在任何时候定义新的RDD,但Spark会惰性计算这些RDD。它们只有在第一次行动操作中用到的时候才会真正的计算。
此时也不是把所有的计算都完成,而是进行到满足行动操作的行为为止。
lines.first() : Spark只会计算RDD的第一个元素的值
常见的转化操作:
对一个RDD的转化操作:
原始RDD:
1
2 scala> val rdd = sc.parallelize(List(1,2,3,3))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :27
map() : 对每个元素进行操作,返回一个新的RDD
1
2 scala> rdd.map(x => x +1 ).collect()
res0: Array[Int] = Array(2, 3, 4, 4)
flatMap() : 对每个元素进行操作,将返回的迭代器的所有元素组成一个新的RDD返回
1
2 scala> rdd.flatMap(x => x.to(3)).collect()
res2: Array[Int] = Array(1, 2, 3, 2, 3, 3, 3)
filter() : 最每个元素进行筛选,返回符合条件的元素组成的一个新RDD
1
2 scala> rdd.filter(x => x != 1).collect()
res3: Array[Int] = Array(2, 3, 3)
distinct() : 去掉重复元素
1
2 scala> rdd.distinct().collect()
res5: Array[Int] = Array(1, 2, 3)
sample(withReplacement,fration,[seed]) : 对RDD采样,以及是否去重
第一个参数如果为true,可能会有重复的元素,如果为false,不会有重复的元素;
第二个参数取值为[0,1],最后的数据个数大约等于第二个参数乘总数;
第三个参数为随机因子。
1
2
3
4
5
6
7
8 scala> rdd.sample(false,0.5).collect()
res7: Array[Int] = Array(3, 3)
scala> rdd.sample(false,0.5).collect()
res8: Array[Int] = Array(1, 2)
scala> rdd.sample(false,0.5,10).collect()
res9: Array[Int] = Array(2, 3)
对两个RDD的转化操作:
原始RDD:
1
2
3
4
5 scala> val rdd1 = sc.parallelize(List(1,2,3))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at :27
scala> val rdd2 = sc.parallelize(List(3,4,5))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at :27
union() :合并,不去重
1
2 scala> rdd1.union(rdd2).collect()
res10: Array[Int] = Array(1, 2, 3, 3, 4, 5)
1
2 scala> rdd1.intersection(rdd2).collect()
res11: Array[Int] = Array(3)
intersection() :交集
subtract() : 移除相同的内容
1
2 scala> rdd1.subtract(rdd2).collect()
res12: Array[Int] = Array(1, 2)
cartesian() :笛卡儿积
1
2 scala> rdd1.cartesian(rdd2).collect()
res13: Array[(Int, Int)] = Array((1,3), (1,4), (1,5), (2,3), (2,4), (2,5), (3,3), (3,4), (3,5))
常见的行动操作:
原始RDD:
1
2 scala> val rdd = sc.parallelize(List(1,2,3,3))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :27
collect() :返回所有元素
1
2 scala> rdd.collect()
res14: Array[Int] = Array(1, 2, 3, 3)
count() :返回元素个数
1
2 scala> rdd.count()
res15: Long = 4
countByValue() : 各个元素出现的次数
1
2 scala> rdd.countByValue()
res16: scala.collection.Map[Int,Long] = Map(1 -> 1, 2 -> 1, 3 -> 2)
take(num) : 返回num个元素
1
2 scala> rdd.take(2)
res17: Array[Int] = Array(1, 2)
top(num) : 返回前num个元素
1
2 scala> rdd.top(2)
res18: Array[Int] = Array(3, 3)
takeOrdered(num)[(ordering)] :按提供的顺序,返回最前面的num个元素(需要好好再研究一下)
1
2
3
4
5 scala> rdd.takeOrdered(2)
res28: Array[Int] = Array(1, 2)
scala> rdd.takeOrdered(3)
res29: Array[Int] = Array(1, 2, 3)
takeSample(withReplacement,num,[seed]) :采样
1
2
3
4
5
6
7
8 scala> rdd.takeSample(false,1)
res19: Array[Int] = Array(2)
scala> rdd.takeSample(false,2)
res20: Array[Int] = Array(2, 3)
scala> rdd.takeSample(false,2,20)
res21: Array[Int] = Array(3, 3)
reduce(func) :并行整合RDD中的所有数据(最常用的)
1
2 scala> rdd.reduce((x,y) => x + y)
res22: Int = 9
aggregate(zeroValue)(seqOp,combOp) :先使用seqOp将RDD中每个分区中的T类型元素聚合成U类型,再使用combOp将之前每个分区聚合后的U类型聚合成U类型, 特别注意seqOp和combOp都会使用zeroValue的值,zeroValue的类型为U
1
2 scala> rdd.aggregate((0,0))((x, y) => (x._1 + y, x._2 +1), (x,y) => (x._1 + y._1, x._2 + y._2))
res24: (Int, Int) = (9,4)
fold(zero)(func) :将aggregate中的seqOp和combOp使用同一个函数op
1
2 scala> rdd.fold(0)((x, y) => x + y)
res25: Int = 9
foreach(func):对每个元素使用func
1
2
3
4
5 scala> rdd.foreach(x => println(x*2))
4
6
6
2
Spark执行FP-Group
1.启动:spark-shell
2.加载hdfs数据: val data = sc.textFile("hdfs://192.168.100.140:8020/user/mllib/Groceries.txt")
3.遍历前十条数据(测试)data.take(10).foreach(println)
4.去除无用数据:val dataNoHead = data.filter(line => !line.contains("items"))
5.测试:dataNoHead.take(5).foreach(println)
6.清洗数据:val dataS = dataNoHead.map(line => line.split("\\{"))
7.清洗数据:val dataGoods = dataS.map(s => s(1).replace("}\"",""))
8.转换成建模数据:val fpData = dataGoods.map(_.split(",")).cache
9.输出查看:fpData.take(5).foreach(line => line.foreach(print))
10.FP-Group模型建立
11.实例化FPGrowth并且设置支持度为0.05,不满足该支持度的数据将被去除和分区为3:val fpGroup = new FPGrowth().setMinSupport(0.05).setNumPartitions(3)
12.开始创建模型 使用向前准备好的fpData数据,进行FP模型的建立。调用FPGrowth里面的run方法,进行模型的创建:val fpModel = fpGroup.run(fpData)
13.获取满足支持度条件的频繁项集:val freqItems = fpModel.freqItemsets.collect
14.打印频繁项内容。输入命令:freqItems.foreach(f=>println("FrequentItem:"+f.items.mkString(",")+"OccurrenceFrequency:"+f.freq))
15.SparkR加载路径:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark
16.运行SparkR ---->./sparkR
Sqoop2 同步数据命令:
1.测试连接接:sqoop list-databases --connect jdbc:mysql:
2.执行数据同步:sqoop import --connect jdbc:mysql:
3.应用数据库(mysql )导入hbase-----> sqoop import --connect jdbc:mysql:
Sqoop增量导入Hive
1) 复制MySQL的表结构到Hive
sqoop create-hive-table --connect jdbc:mysql:
2) 创建Hive表对应的目录
hdfs dfs -mkdir /user/hive/um_appuser
3) 修改表Hive表对应的目录
ALTER TABLE um_appuser SET LOCATION 'hdfs:
4) 转换为外部表
ALTER TABLE um_appuser SET TBLPROPERTIES ('EXTERNAL'='TRUE');
将数据导入Sqoop
sqoop import --connect jdbc:mysql:
在Hive中执行
SELECT id FROM um_appuser ORDER BY id DESC LIMIT 1;
增量导入
1)创建job(注意–last-value的值,上一步查询得到的结果)
sqoop job --create um_appuser -- import --connect jdbc:mysql:
2)查看已经创建的Sqoop job
sqoop job --list
2)创建调度任务
0 */1 * * * sqoop job --exec um_appuser > um_appuser_sqoop.log 2>&1 &
1) 查询Hive
SELECT id FROM um_appuser ORDER BY id DESC LIMIT 1; --SELECT COUNT(DISTINCT id) FROM um_appuser; SELECT COUNT(id) FROM um_appuser; SELECT id FROM um_appuser GROUP BY id HAVING COUNT(id) > 1;
2) 查询MySQL
SELECT COUNT(1) FROM um_appuser WHERE id <= (Hive查询出最新的id);
sqoop支持两种增量导入模式,
一种是 append,即通过指定一个递增的列,比如:
--incremental append --check-column num_iid --last-value 0
varchar类型的check字段也可以通过这种方式增量导入(ID为varchar类型的递增数字):
--incremental append --check-column ID --last-value 8
另种是可以根据时间戳,比如:
--incremental lastmodified --check-column created --last-value '2012-02-01 11:0:00'
就是只导入created 比'2012-02-01 11:0:00'更大的数据。
sqoop定时增量导入:https:
Rhadoop使用前数据包加载
1.启动SparkR:library(SparkR)
Sys.setenv(JAVA_HOME="/usr/java/jdk1.8.0_45")
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HIVE_HOME="/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hive")
Sys.setenv(HADOOP_HOME="/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hadoop")
library(rJava)
library(rhdfs)
hdfs.init()
hdfs.ls("/user")
input<-hdfs.cat("/user/mllib/Qualitative_Bankruptcy.data.txt")
library(Rserve)
library(RHive)
rhive.init()
rhive.connect("192.168.100.140", defaultFS="hdfs://192.168.100.140:8020")
rhive.query("show databases")
rhive.query("show tables")
input2 <- rhive.query("select * from default.mm")
Oozie定时任务
1.常用命令操作地址:http:
Es集群搭建:
1.http:
ElasticSearch常用指令:
1、创建索引:curl -XPUT http:
2. 通过文件导入:curl -XPOST 192.168.100.140:9200/bank/account/_bulk?pretty --data-binary @accounts.json
3.2、查询索引:curl -XPOST 192.168.100.140:9200/aaa/_search?pretty -d "{\"query\": { \"match_all\": {} }}"
4. curl -XGET 192.168.100.140:9200/aaa/_search?pretty -d "{\"query\": { \"match_all\": {} }}"
5. curl -XGET 192.168.100.140:9200/aaa/bbb/222
6.3、修改索引:curl -XPUT "http://192.168.100.140:9200/fendo/account/222" -d "{\"first_name\":\"fk\"}
7.4、删除索引:curl -XDELETE http:
8.5、查看所有索引:curl 192.168.100.140:9200/_cat/indices?v
Es-hadoop
1.一. hive部分
1.前提是加载jar包 每次使用hive启动时都要加载,也可以写在配置文件里(不提倡这种做法,hive配置其他应用时容易造成冲突。)
加载jar包方法:add jar
(1)第一种方法
CREATE EXTERNAL TABLE lxw1234_es_tags (
cookieid string,
age string,
name string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes' = '192.168.100.140:9200,192.168.100.141:9200,192.168.100.142:9200',
'es.index.auto.create' = 'false',
'es.resource' = 'hello',
'es.read.metadata' = 'true',
'es.mapping.names' = 'cookieid:_metadata._id, age:age, name:name');
(2)第二种方法
CREATE EXTERNAL TABLE company (
cookieid string,
address string,
address_suggest_input string,
address_suggest_output string,
address_suggest_weight bigint,
age bigint,
brand_suggest_output string,
brand_suggest_weight bigint,
business_scope string,
business_scope_suggest_input string,
business_scope_suggest_output string,
business_scope_suggest_weight bigint,
business_status string,
ceo_suggest_output string,
ceo_suggest_weight bigint,
legal_man string,
legal_man_suggest_input string,
legal_man_suggest_output string,
legal_man_suggest_weight bigint,
name string,
name_suggest_input string,
name_suggest_weight bigint,
province string,
registered_capital string,
registered_data date,
score float
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes' = '192.168.100.140:9200,192.168.100.141:9200,192.168.100.142:9200',
'es.index.auto.create' = 'false',
'es.resource' = 'company',
'es.read.metadata' = 'true',
'es.mapping.names' = '
cookieid:_metadata._id,
address:address,
address_suggest_input:address_suggest.input,
address_suggest_output:address_suggest.output,
address_suggest_weight:address_suggest.weight,
age:age,
brand_suggest_output:brand_suggest.output,
brand_suggest_weight:brand_suggest.weight,
business_scope:business_scope,
business_scope_suggest_input:business_scope_suggest.input,
business_scope_suggest_output:business_scope_suggest.output,
business_scope_suggest_weight:business_scope_suggest.weight,
business_status:business_status,
ceo_suggest_output:ceo_suggest.output,
ceo_suggest_weight:ceo_suggest.weight,
legal_man:legal_man,
legal_man_suggest_input:legal_man_suggest.input,
legal_man_suggest_output:legal_man_suggest.output,
legal_man_suggest_weight:legal_man_suggest.weight,
name:name,
name_suggest_input:name_suggest.input,
name_suggest_weight:name_suggest.weight,
province:province,
registered_capital:registered_capital,
registered_data:registered_data,
score:score
');
CREATE EXTERNAL TABLE user(id BIGINT, name STRING) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists','es.index.auto.create' = 'true',
'es.nodes'='192.168.100.140,192.168.100.141,192.168.100.142','es.port'='9200');
CREATE TABLE user_source (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA LOCAL INPATH '/root/data/ceshi.txt' OVERWRITE INTO TABLE user_source;
INSERT OVERWRITE TABLE user SELECT s.id, s.name FROM user_source s;
二. spark部分
./spark-shell --jars ../lib/elasticsearch-spark-20_2.10-5.0.2.jar
import org.apache.spark.SparkConf
import org.elasticsearch.spark._
val conf = new SparkConf()
conf.set("es.index.auto.create", "true")
conf.set("es.nodes","192.168.100.140,192.168.100.141,192.168.100.142")
conf.set("es.port","9200")
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("OTP" -> "Otopeni", "SFO" -> "San Fran")
sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")
Spark-sql------>同步elasticsearch
bin/spark-sql –master spark:
启动es服务 su - elasticsearch -c "/opt/elasticsearch-5.0.2/bin/elasticsearch >/dev/null 2>&1 &"
启动node.js 在/usr/local/ elasticsearch-head目录下启动 /usr/local/elasticsearch-head/node_modules/grunt/bin/grunt server
elasticsearch-head 的es集群管理工具是基于node.js搭建的。所以需要启动node.js.
1.词频量统计,基于cdh hadoop 自带的demo 运行指令如下:
wordcount----hadoop jar /opt/cloudera/parcels/CDH/jars/hadoop-examples.jar wordcount 输入路径:/user/root/input 输出路径 /user/root/output
命令:/opt/cloudera/parcels/CDH/jars/hadoop-examples.jar wordcount /user/root/input /user/root/output
2.hive创建表:
Create external table testtable (
name string,
message string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
location '/user/file.csv'
tblproperties ("skip.header.line.count"="1", "skip.footer.line.count"="2");
就是上面sql中tblproperties的2个属性
“skip.heaer.line.count” 跳过文件行首多少行
“skip.footer.line.count”跳过文件行尾多少行
DROP TABLE IF EXISTS 表名;
-- TAB2 同TAB1 一样都是外部表
CREATE EXTERNAL TABLE 表名
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/test;
load DATA INPATH '/user/test/data1.csv' INTO TABLE 表名
例如:
load data inpath '/user/test/myfile/artrile.txt' into table arc;
3.权限问题
设置权限指令: hdfs dfs -chmod 777 /user/...
4.hive & impala语句使用规则:
Hive和impala语句支持大部分sql,有个别方式不支持,具体差距:
https:
https:
Impala共享hive:INVALIDATE METADATA;
5.hbase规则
Hbase存储方式键值对
常用命令:hbase shell
(1)创建表:create <table>, {NAME => , VERSIONS => 例如:create 'User','info'(创建一个User表,并且有一个info列族)
(2)查看所有表:list
(3)查看表详情:describe 'User'
(4)删除指定的列族:alter 'User', 'delete' => 'info'
(5) 插入数据:put <table>,,,例如:put 'User', 'row1', 'info:name', 'xiaoming’
(6)根据rowKey查询某个记录:get <table>,,[,....] 例如:get 'User', 'row2'
(7)查询所有记录:scan 'User'
(8)扫描前2条:scan 'User', {LIMIT => 2}
(9)范围查询:scan 'User', {STARTROW => 'row2'} ----scan 'User', {STARTROW => 'row2', ENDROW => 'row2'}
(10)统计表记录数:count <table>, {INTERVAL => intervalNum, CACHE => cacheNum}例如:count 'User'
(11)删除列:delete 'User', 'row1', 'info:age'
(12)删除所有行:deleteall 'User', 'row2'
(13)删除表中所有数据:truncate 'User'
(14)禁用表:disable 'User'
(15)启用表:enable 'User'
(16)测试表是否存在:exists 'User'
(17)删除表前,必须先disable:drop 'TEST.USER'
4.6.hive到hbase的使用
命令: CREATE EXTERNAL TABLE h1
(id string, name string,age int,sex string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:name,info:age,info:sex") TBLPROPERTIES("hbase.table.name" = "User");
创建一个hive表加载hbase数据。
5.spark启动问题
(1)spark是独立于hadoop之外的;
(2)Spark RDD编程
Spark中RDD是一个不可变的分布式对象集合。每个RDD被分为多个分区,这些分区运行在集群的不同的节点上。
RDD可以包含Python、Java、Scala中的任意类型的对象,以及自定义的对象。
创建RDD的两种方法:
1 读取一个数据集(SparkContext.textFile()) : lines = sc.textFile("README.md")
2 读取一个集合(SparkContext.parallelize()) : lines = sc.paralelize(List("pandas","i like pandas"))
RDD的两种操作:
1 转化操作(transformation) : 由一个RDD生成一个新的RDD
2 行动操作(action) : 对RDD中的元素进行计算,并把结果返回
RDD的惰性计算:
可以在任何时候定义新的RDD,但Spark会惰性计算这些RDD。它们只有在第一次行动操作中用到的时候才会真正的计算。
此时也不是把所有的计算都完成,而是进行到满足行动操作的行为为止。
lines.first() : Spark只会计算RDD的第一个元素的值
常见的转化操作:
对一个RDD的转化操作:
原始RDD:
1
2 scala> val rdd = sc.parallelize(List(1,2,3,3))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :27
map() : 对每个元素进行操作,返回一个新的RDD
1
2 scala> rdd.map(x => x +1 ).collect()
res0: Array[Int] = Array(2, 3, 4, 4)
flatMap() : 对每个元素进行操作,将返回的迭代器的所有元素组成一个新的RDD返回
1
2 scala> rdd.flatMap(x => x.to(3)).collect()
res2: Array[Int] = Array(1, 2, 3, 2, 3, 3, 3)
filter() : 最每个元素进行筛选,返回符合条件的元素组成的一个新RDD
1
2 scala> rdd.filter(x => x != 1).collect()
res3: Array[Int] = Array(2, 3, 3)
distinct() : 去掉重复元素
1
2 scala> rdd.distinct().collect()
res5: Array[Int] = Array(1, 2, 3)
sample(withReplacement,fration,[seed]) : 对RDD采样,以及是否去重
第一个参数如果为true,可能会有重复的元素,如果为false,不会有重复的元素;
第二个参数取值为[0,1],最后的数据个数大约等于第二个参数乘总数;
第三个参数为随机因子。
1
2
3
4
5
6
7
8 scala> rdd.sample(false,0.5).collect()
res7: Array[Int] = Array(3, 3)
scala> rdd.sample(false,0.5).collect()
res8: Array[Int] = Array(1, 2)
scala> rdd.sample(false,0.5,10).collect()
res9: Array[Int] = Array(2, 3)
对两个RDD的转化操作:
原始RDD:
1
2
3
4
5 scala> val rdd1 = sc.parallelize(List(1,2,3))
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[13] at parallelize at :27
scala> val rdd2 = sc.parallelize(List(3,4,5))
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[14] at parallelize at :27
union() :合并,不去重
1
2 scala> rdd1.union(rdd2).collect()
res10: Array[Int] = Array(1, 2, 3, 3, 4, 5)
1
2 scala> rdd1.intersection(rdd2).collect()
res11: Array[Int] = Array(3)
intersection() :交集
subtract() : 移除相同的内容
1
2 scala> rdd1.subtract(rdd2).collect()
res12: Array[Int] = Array(1, 2)
cartesian() :笛卡儿积
1
2 scala> rdd1.cartesian(rdd2).collect()
res13: Array[(Int, Int)] = Array((1,3), (1,4), (1,5), (2,3), (2,4), (2,5), (3,3), (3,4), (3,5))
常见的行动操作:
原始RDD:
1
2 scala> val rdd = sc.parallelize(List(1,2,3,3))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :27
collect() :返回所有元素
1
2 scala> rdd.collect()
res14: Array[Int] = Array(1, 2, 3, 3)
count() :返回元素个数
1
2 scala> rdd.count()
res15: Long = 4
countByValue() : 各个元素出现的次数
1
2 scala> rdd.countByValue()
res16: scala.collection.Map[Int,Long] = Map(1 -> 1, 2 -> 1, 3 -> 2)
take(num) : 返回num个元素
1
2 scala> rdd.take(2)
res17: Array[Int] = Array(1, 2)
top(num) : 返回前num个元素
1
2 scala> rdd.top(2)
res18: Array[Int] = Array(3, 3)
takeOrdered(num)[(ordering)] :按提供的顺序,返回最前面的num个元素(需要好好再研究一下)
1
2
3
4
5 scala> rdd.takeOrdered(2)
res28: Array[Int] = Array(1, 2)
scala> rdd.takeOrdered(3)
res29: Array[Int] = Array(1, 2, 3)
takeSample(withReplacement,num,[seed]) :采样
1
2
3
4
5
6
7
8 scala> rdd.takeSample(false,1)
res19: Array[Int] = Array(2)
scala> rdd.takeSample(false,2)
res20: Array[Int] = Array(2, 3)
scala> rdd.takeSample(false,2,20)
res21: Array[Int] = Array(3, 3)
reduce(func) :并行整合RDD中的所有数据(最常用的)
1
2 scala> rdd.reduce((x,y) => x + y)
res22: Int = 9
aggregate(zeroValue)(seqOp,combOp) :先使用seqOp将RDD中每个分区中的T类型元素聚合成U类型,再使用combOp将之前每个分区聚合后的U类型聚合成U类型, 特别注意seqOp和combOp都会使用zeroValue的值,zeroValue的类型为U
1
2 scala> rdd.aggregate((0,0))((x, y) => (x._1 + y, x._2 +1), (x,y) => (x._1 + y._1, x._2 + y._2))
res24: (Int, Int) = (9,4)
fold(zero)(func) :将aggregate中的seqOp和combOp使用同一个函数op
1
2 scala> rdd.fold(0)((x, y) => x + y)
res25: Int = 9
foreach(func):对每个元素使用func
1
2
3
4
5 scala> rdd.foreach(x => println(x*2))
4
6
6
2
Spark执行FP-Group
1.启动:spark-shell
2.加载hdfs数据: val data = sc.textFile("hdfs://192.168.100.140:8020/user/mllib/Groceries.txt")
3.遍历前十条数据(测试)data.take(10).foreach(println)
4.去除无用数据:val dataNoHead = data.filter(line => !line.contains("items"))
5.测试:dataNoHead.take(5).foreach(println)
6.清洗数据:val dataS = dataNoHead.map(line => line.split("\\{"))
7.清洗数据:val dataGoods = dataS.map(s => s(1).replace("}\"",""))
8.转换成建模数据:val fpData = dataGoods.map(_.split(",")).cache
9.输出查看:fpData.take(5).foreach(line => line.foreach(print))
10.FP-Group模型建立
11.实例化FPGrowth并且设置支持度为0.05,不满足该支持度的数据将被去除和分区为3:val fpGroup = new FPGrowth().setMinSupport(0.05).setNumPartitions(3)
12.开始创建模型 使用向前准备好的fpData数据,进行FP模型的建立。调用FPGrowth里面的run方法,进行模型的创建:val fpModel = fpGroup.run(fpData)
13.获取满足支持度条件的频繁项集:val freqItems = fpModel.freqItemsets.collect
14.打印频繁项内容。输入命令:freqItems.foreach(f=>println("FrequentItem:"+f.items.mkString(",")+"OccurrenceFrequency:"+f.freq))
15.SparkR加载路径:/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/spark
16.运行SparkR ---->./sparkR
Sqoop2 同步数据命令:
1.测试连接接:sqoop list-databases --connect jdbc:mysql:
2.执行数据同步:sqoop import --connect jdbc:mysql:
3.应用数据库(mysql )导入hbase-----> sqoop import --connect jdbc:mysql:
Sqoop增量导入Hive
1) 复制MySQL的表结构到Hive
sqoop create-hive-table --connect jdbc:mysql:
2) 创建Hive表对应的目录
hdfs dfs -mkdir /user/hive/um_appuser
3) 修改表Hive表对应的目录
ALTER TABLE um_appuser SET LOCATION 'hdfs:
4) 转换为外部表
ALTER TABLE um_appuser SET TBLPROPERTIES ('EXTERNAL'='TRUE');
将数据导入Sqoop
sqoop import --connect jdbc:mysql:
在Hive中执行
SELECT id FROM um_appuser ORDER BY id DESC LIMIT 1;
增量导入
1)创建job(注意–last-value的值,上一步查询得到的结果)
sqoop job --create um_appuser -- import --connect jdbc:mysql:
2)查看已经创建的Sqoop job
sqoop job --list
2)创建调度任务
0 */1 * * * sqoop job --exec um_appuser > um_appuser_sqoop.log 2>&1 &
1) 查询Hive
SELECT id FROM um_appuser ORDER BY id DESC LIMIT 1; --SELECT COUNT(DISTINCT id) FROM um_appuser; SELECT COUNT(id) FROM um_appuser; SELECT id FROM um_appuser GROUP BY id HAVING COUNT(id) > 1;
2) 查询MySQL
SELECT COUNT(1) FROM um_appuser WHERE id <= (Hive查询出最新的id);
sqoop支持两种增量导入模式,
一种是 append,即通过指定一个递增的列,比如:
--incremental append --check-column num_iid --last-value 0
varchar类型的check字段也可以通过这种方式增量导入(ID为varchar类型的递增数字):
--incremental append --check-column ID --last-value 8
另种是可以根据时间戳,比如:
--incremental lastmodified --check-column created --last-value '2012-02-01 11:0:00'
就是只导入created 比'2012-02-01 11:0:00'更大的数据。
sqoop定时增量导入:https:
Rhadoop使用前数据包加载
1.启动SparkR:library(SparkR)
Sys.setenv(JAVA_HOME="/usr/java/jdk1.8.0_45")
Sys.setenv(HADOOP_CMD="/usr/bin/hadoop")
Sys.setenv(HIVE_HOME="/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hive")
Sys.setenv(HADOOP_HOME="/opt/cloudera/parcels/CDH-5.7.2-1.cdh5.7.2.p0.18/lib/hadoop")
library(rJava)
library(rhdfs)
hdfs.init()
hdfs.ls("/user")
input<-hdfs.cat("/user/mllib/Qualitative_Bankruptcy.data.txt")
library(Rserve)
library(RHive)
rhive.init()
rhive.connect("192.168.100.140", defaultFS="hdfs://192.168.100.140:8020")
rhive.query("show databases")
rhive.query("show tables")
input2 <- rhive.query("select * from default.mm")
Oozie定时任务
1.常用命令操作地址:http:
Es集群搭建:
1.http:
ElasticSearch常用指令:
1、创建索引:curl -XPUT http:
2. 通过文件导入:curl -XPOST 192.168.100.140:9200/bank/account/_bulk?pretty --data-binary @accounts.json
3.2、查询索引:curl -XPOST 192.168.100.140:9200/aaa/_search?pretty -d "{\"query\": { \"match_all\": {} }}"
4. curl -XGET 192.168.100.140:9200/aaa/_search?pretty -d "{\"query\": { \"match_all\": {} }}"
5. curl -XGET 192.168.100.140:9200/aaa/bbb/222
6.3、修改索引:curl -XPUT "http://192.168.100.140:9200/fendo/account/222" -d "{\"first_name\":\"fk\"}
7.4、删除索引:curl -XDELETE http:
8.5、查看所有索引:curl 192.168.100.140:9200/_cat/indices?v
Es-hadoop
1.一. hive部分
1.前提是加载jar包 每次使用hive启动时都要加载,也可以写在配置文件里(不提倡这种做法,hive配置其他应用时容易造成冲突。)
加载jar包方法:add jar
(1)第一种方法
CREATE EXTERNAL TABLE lxw1234_es_tags (
cookieid string,
age string,
name string
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes' = '192.168.100.140:9200,192.168.100.141:9200,192.168.100.142:9200',
'es.index.auto.create' = 'false',
'es.resource' = 'hello',
'es.read.metadata' = 'true',
'es.mapping.names' = 'cookieid:_metadata._id, age:age, name:name');
(2)第二种方法
CREATE EXTERNAL TABLE company (
cookieid string,
address string,
address_suggest_input string,
address_suggest_output string,
address_suggest_weight bigint,
age bigint,
brand_suggest_output string,
brand_suggest_weight bigint,
business_scope string,
business_scope_suggest_input string,
business_scope_suggest_output string,
business_scope_suggest_weight bigint,
business_status string,
ceo_suggest_output string,
ceo_suggest_weight bigint,
legal_man string,
legal_man_suggest_input string,
legal_man_suggest_output string,
legal_man_suggest_weight bigint,
name string,
name_suggest_input string,
name_suggest_weight bigint,
province string,
registered_capital string,
registered_data date,
score float
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
'es.nodes' = '192.168.100.140:9200,192.168.100.141:9200,192.168.100.142:9200',
'es.index.auto.create' = 'false',
'es.resource' = 'company',
'es.read.metadata' = 'true',
'es.mapping.names' = '
cookieid:_metadata._id,
address:address,
address_suggest_input:address_suggest.input,
address_suggest_output:address_suggest.output,
address_suggest_weight:address_suggest.weight,
age:age,
brand_suggest_output:brand_suggest.output,
brand_suggest_weight:brand_suggest.weight,
business_scope:business_scope,
business_scope_suggest_input:business_scope_suggest.input,
business_scope_suggest_output:business_scope_suggest.output,
business_scope_suggest_weight:business_scope_suggest.weight,
business_status:business_status,
ceo_suggest_output:ceo_suggest.output,
ceo_suggest_weight:ceo_suggest.weight,
legal_man:legal_man,
legal_man_suggest_input:legal_man_suggest.input,
legal_man_suggest_output:legal_man_suggest.output,
legal_man_suggest_weight:legal_man_suggest.weight,
name:name,
name_suggest_input:name_suggest.input,
name_suggest_weight:name_suggest.weight,
province:province,
registered_capital:registered_capital,
registered_data:registered_data,
score:score
');
CREATE EXTERNAL TABLE user(id BIGINT, name STRING) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'radio/artists','es.index.auto.create' = 'true',
'es.nodes'='192.168.100.140,192.168.100.141,192.168.100.142','es.port'='9200');
CREATE TABLE user_source (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA LOCAL INPATH '/root/data/ceshi.txt' OVERWRITE INTO TABLE user_source;
INSERT OVERWRITE TABLE user SELECT s.id, s.name FROM user_source s;
二. spark部分
./spark-shell --jars ../lib/elasticsearch-spark-20_2.10-5.0.2.jar
import org.apache.spark.SparkConf
import org.elasticsearch.spark._
val conf = new SparkConf()
conf.set("es.index.auto.create", "true")
conf.set("es.nodes","192.168.100.140,192.168.100.141,192.168.100.142")
conf.set("es.port","9200")
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("OTP" -> "Otopeni", "SFO" -> "San Fran")
sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")
Spark-sql------>同步elasticsearch
bin/spark-sql –master spark:
启动es服务 su - elasticsearch -c "/opt/elasticsearch-5.0.2/bin/elasticsearch >/dev/null 2>&1 &"
启动node.js 在/usr/local/ elasticsearch-head目录下启动 /usr/local/elasticsearch-head/node_modules/grunt/bin/grunt server
elasticsearch-head 的es集群管理工具是基于node.js搭建的。所以需要启动node.js.