本节给出一个简单的公司人力资源系统的数据处理案例。人力资源管理系统的管理内容组织结构如图3-10所示。
图 3 - 10
人力资源系统的数据源包含职工基本信息、部门基本信息、职工考勤信息、
职工工资清单等,数据文件存放在本地目录/usr/local/hrs。
1) 职工基本信息:存放职工的基本信息,包含职工姓名,职工id,职工性
别,职工年龄,入职年份,职位,所在部门id等信息;people.txt数据内容如下:
Michael,1,male,37,2014,developer,2
Andy,2,female,33,2016,manager,1
Justin,3,female,23,2016,recruitingspecialist,3
John,4,male,22,2017,developer,2
Herry,5,male,27,2017,developer,1
Brewster,6,male,37,2014,manager,2
Brice,7,female,30,2016,manager,3
Justin,8,male,23,2017,recruitingspecialist,3
John,9,male,22,2018,developer,1
Herry,10,female,27,2017,recruitingspecialist,3
2) 部门基本信息:存放部门信息,包含部门名称,编号;department.txt数据内容如下:
management,1
researchanddevelopment,2
HumanResources,3
3) 职工考勤信息:存放职工的考勤信息,包含年、月信息,职工加班,迟到,旷工,早退小时数信息;attendance.txt数据内容如下:(其中月份的模拟数据使用随机数)
1,2017,12,0,2,4,0
2,2017,8,5,0,5,3
3,2017,3,16,4,1,5
4,2018,3,0,0,0,0
5,2018,3,0,3,0,0
6,2017,3,32,0,0,0
7,2017,3,0,16,3,32
8,2015,19,36,0,0,0,3
9,2017,5,6,30,0,2,2
10,2017,10,6,56,40,0,22
1,2016,12,0,2,4,0
2,2014,38,5,40,5,3
3,2016,23,16,24,1,5
4,2016,23,0,20,0,0
5,2016,3,0,3,20,0
6,2016,23,32,0,0,0
7,2014,43,0,16,3,32
8,2016,49,36,0,20,0,3
9,2016,45,6,30,0,22,2
10,2014,40,6,56,40,0,22
4) 职工工资清单:存放职工每月的工资清单信息;salary.txt数据内容如下:
1,6000
2,15000
3,6000
4,7000
5,5000
6,17000
7,20000
8,5500
9,6500
10,7500
本案例使用之前的Spark 2.2.1+Hive数据仓库集成环境,已经将Hive的配置
文件hive-site.xml拷贝到 Spark2.2.1的$SPARK_HOME/conf 目录下。
将人力资源系统的数据加载到Hive仓库的HRS(Human Resource System)人力资源系统数据库中,并对人力资源系统的数据分别建表。
1) 启动Spark-Shell。
root@master:~# spark-shell –master spark://192.168.189.1:7077 --driver-class-path
/usr/local/apache-hive-1.2.1/lib/mysql-connector-java-5.1.13-bin.jar --executor-memory 512m --total-executor-cores 4
To adjust logginglevel use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/02/21 09:18:33WARN util.NativeCodeLoader: Unable to load native-hadoop library for yourplatform... using builtin-java classes where applicable
Spark context WebUI available at http://master:4040
Spark contextavailable as 'sc' (master = local[*], app id = local-1519175914783).
Spark sessionavailable as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.1
/_/
查询Hive中的数据库。
scala>spark.sql("show databases").show
+------------+
|databaseName|
+------------+
| default|
| hive|
| hivestudy|
+------------+
2) 人力资源系统(HRS)数据库的构建与使用。在Hive中新建HRS数据库。
scala>spark.sql("CREATE DATABASE HRS")
18/02/21 09:36:09WARN metastore.ObjectStore: Failed to get database hrs, returningNoSuchObjectException
res13:org.apache.spark.sql.DataFrame = []
scala>spark.sql("show databases").show
+------------+
|databaseName|
+------------+
| default|
| hive|
| hivestudy|
| hrs|
+------------+
3) 在Hive中使用人力资源系统 HRS数据库。
scala>spark.sql("USE HRS")
res15:org.apache.spark.sql.DataFrame = []
4) 在人力资源系统HRS数据库中创建四个数据对应的表,仅在表不存在时创建。
l 构建职工基础信息表people。
scala> spark.sql("CREATE TABLE IF NOT EXISTS people(nameSTRING, id INT,gender
STRING, age INT, year INT, position STRING, depID INT) ROW FORMATDELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
18/02/21 09:39:49 WARN metastore.HiveMetaStore: Location:hdfs://master:9000/user/hive/warehouse/hrs.db/people specified for non-externaltable:people
res16: org.apache.spark.sql.DataFrame = []
l 构建部门基础信息表department。
scala>spark.sql("CREATE TABLE IF NOT EXISTS department(name STRING, depIDINT)
ROW FORMATDELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'")
18/02/21 09:40:59WARN metastore.HiveMetaStore: Location: hdfs://master:9000/user/hive/warehouse/hrs.db/departmentspecified for non-external table:department
res17:org.apache.spark.sql.DataFrame = []
l 构建职工考勤信息表attendance。
scala>spark.sql("CREATE TABLE IF NOT EXISTS attendance (id INT, year INT, monthINT,
overtime INT,latetime INT, absenteeism INT, leaveearlytime INT) ROW FORMAT DELIMITED FIELDSTERMINATED BY ',' LINES TERMINATED BY '\n'")
18/02/21 09:42:10WARN metastore.HiveMetaStore: Location:hdfs://master:9000/user/hive/warehouse/hrs.db/attendance specified fornon-external table:attendance
res18:org.apache.spark.sql.DataFrame = []
l 构建职工工资清单表salary。
scala>spark.sql("CREATE TABLE IF NOT EXISTS salary (id INT, salary INT) ROWFORMAT
DELIMITED FIELDSTERMINATED BY ',' LINES TERMINATED BY '\n'")
18/02/21 09:43:11WARN metastore.HiveMetaStore: Location: hdfs://master:9000/user/hive/warehouse/hrs.db/salaryspecified for non-external table:salary
res19:org.apache.spark.sql.DataFrame = []
人力资源系统数据的加载:分别将本地文本文件的数据加载到四个表。
1) 查询本地4个数据源文本文件。
root@master:/usr/local/hrs#ls -ltr
total 16
-rw-r--r-- 1 rootroot 370 Feb 21 09:47 people.txt
-rw-r--r-- 1 rootroot 55 Feb 21 09:49 department.txt
-rw-r--r-- 1 rootroot 390 Feb 21 09:50 attendance.txt
-rw-r--r-- 1 rootroot 74 Feb 21 09:50 salary.txt
2) 在Hive的HRS数据库中加载职工基础信息表,操作如下:
scala>spark.sql("LOAD DATA LOCAL INPATH '/usr/local/hrs/people.txt' OVERWRITEINTO
TABLE people")
res20:org.apache.spark.sql.DataFrame = []
其中OVERWRITE表示覆盖当前表的数据,即先清除表数据,再将数据insert到表中。其他表的加载操作类似。
3) 在Hive的HRS数据库中加载部门基础信息表,操作如下:
scala>spark.sql("LOAD DATA LOCAL INPATH '/usr/local/hrs/department.txt' INTOTABLE
department ")
res22:org.apache.spark.sql.DataFrame = []
4) 在Hive的HRS数据库中加载职工考勤信息表,操作如下:
scala>spark.sql("LOAD DATA LOCAL INPATH '/usr/local/hrs/attendance.txt' INTOTABLE
attendance")
res23:org.apache.spark.sql.DataFrame = []
5) 在Hive的HRS数据库中加载职工工资清单表,操作如下:
scala>spark.sql("LOAD DATA LOCAL INPATH '/usr/local/hrs/salary.txt' INTO TABLEsalary")
res24:org.apache.spark.sql.DataFrame = []
登陆HADOOP的Web Interface界面(http://192.168.189.1:50070/explorer.html#/user/hive/warehouse/hrs.db),Hdfs系统中查询人力资源系统的数据如图3-11所示。
图 3 - 11 Hdfs系统中查询人力资源系统的数据
scala> spark.sql("select * from people").show
+--------+---+------+---+----+--------------------+-----+
| name| id|gender|age|year| position|depID|
+--------+---+------+---+----+--------------------+-----+
| Michael| 1| male| 37|2014| developer| 2|
| Andy| 2|female| 33|2016| manager| 1|
| Justin| 3|female| 23|2016|recruitingspecialist| 3|
| John| 4| male| 22|2017| developer| 2|
| Herry| 5| male| 27|2017| developer| 1|
|Brewster| 6| male| 37|2014| manager| 2|
| Brice| 7|female| 30|2016| manager| 3|
| Justin| 8| male| 23|2017|recruitingspecialist| 3|
| John| 9| male| 22|2018| developer| 1|
| Herry| 10|female| 27|2017|recruitingspecialist| 3|
+--------+---+------+---+----+--------------------+-----+
scala> spark.sql("select * from department").show
+--------------------+-----+
| name|depID|
+--------------------+-----+
| management| 1|
|researchanddevelo...| 2|
| HumanResources| 3|
+--------------------+-----+
scala> spark.sql("select * from attendance").show
18/02/21 12:10:44 WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoring similar problems.
+---+----+-----+--------+--------+-----------+--------------+
| id|year|month|overtime|latetime|absenteeism|leaveearlytime|
+---+----+-----+--------+--------+-----------+--------------+
| 1|2017| 12| 0| 2| 4| 0|
| 2|2017| 8| 5| 0| 5| 3|
| 3|2017| 3| 16| 4| 1| 5|
| 4|2018| 3| 0| 0| 0| 0|
| 5|2018| 3| 0| 3| 0| 0|
| 6|2017| 3| 32| 0| 0| 0|
| 7|2017| 3| 0| 16| 3| 32|
| 8|2015| 19| 36| 0| 0| 0|
| 9|2017| 5| 6| 30| 0| 2|
| 10|2017| 10| 6| 56| 40| 0|
| 1|2016| 12| 0| 2| 4| 0|
| 2|2014| 38| 5| 40| 5| 3|
| 3|2016| 23| 16| 24| 1| 5|
| 4|2016| 23| 0| 20| 0| 0|
| 5|2016| 3| 0| 3| 20| 0|
| 6|2016| 23| 32| 0| 0| 0|
| 7|2014| 43| 0| 16| 3| 32|
| 8|2016| 49| 36| 0| 20| 0|
| 9|2016| 45| 6| 30| 0| 22|
| 10|2014| 40| 6| 56| 40| 0|
+---+----+-----+--------+--------+-----------+--------------+
scala> spark.sql("select * from salary").show
+---+------+
| id|salary|
+---+------+
| 1| 6000|
| 2| 15000|
| 3| 6000|
| 4| 7000|
| 5| 5000|
| 6| 17000|
| 7| 20000|
| 8| 5500|
| 9| 6500|
| 10| 7500|
+---+------+
人力资源系统的数据常见的查询操作有部门职工数的查询、部门职工的薪资topN的查询、部门职工平均工资的排名、各部门每年职工薪资的总数查询等,下面给出具体案例。
1) 查看各表的信息,同时查看界面回显中的Schema信息:
scala> spark.sql("select * from people")
res29:org.apache.spark.sql.DataFrame = [name: string, id: int ... 5 more fields]
scala> spark.sql("select * fromdepartment")
res30:org.apache.spark.sql.DataFrame = [name: string, depID: int]
scala> spark.sql("select * fromattendance")
res31:org.apache.spark.sql.DataFrame = [id: int, year: int ... 5 more fields]
scala> spark.sql("select * from salary")
res32:org.apache.spark.sql.DataFrame = [id: int, salary: int]
2) 部门职工数的查询。首先将people表数据与department表数据进行
join,然后根据department的部门名进行分组,分组后针对people中唯一标识一个职工的id字段进行统计,最后得到各个部门对应的职工总数统计信息。
scala>spark.sql("select b.name, count(a.id) from people a join department b ona.depid=b.depid
group byb.name").show
+--------------------+---------+
| name|count(id)|
+--------------------+---------+
| management| 3|
| HumanResources| 4|
|researchanddevelo...| 3|
+--------------------+---------+
3) 对各个部门职工薪资的总数、平均值的排序。
首先根据部门id将people表数据与department表数据进行join,根据职工id joinsalary表数据,然后根据department的部门名进行分组,分组后针对职工的薪资进行求和或求平均值,并根据该值大小进行排序。
scala>spark.sql("select b.name, sum(c.salary) as s from people a join departmentb on
a.depid=b.depidjoin salary c on a.id=c.id group by b.name order by s").show
+--------------------+-----+
| name| s|
+--------------------+-----+
| management|26500|
|researchanddevelo...|30000|
| HumanResources|39000|
+--------------------+-----+
4) 查各个部门职工薪资的平均值的排序如下:
scala>spark.sql("select b.name, avg(c.salary) as s from people a join departmentb on
a.depid=b.depidjoin salary c on a.id=c.id group by b.name order by s").show
+--------------------+-----------------+
| name| s|
+--------------------+-----------------+
| management|8833.333333333334|
| HumanResources| 9750.0|
|researchanddevelo...| 10000.0|
+--------------------+-----------------+
5) 查询各个部门职工的考勤信息
首先根据职工id将attendance考勤表数据与people职工表数据进行join,并计算职工的考勤信息,然后根据department的部门名、考勤信息的年份进行分组,分组后针对职工的考勤信息进行统计。
具体步骤:
scala>spark.sql("select b.name, sum(h.attdinfo), h.year from (select a.id,a.depid, at.year, at.month,
overtime-latetime-absenteeism- leaveearlytime as attdinfo from attendance at join people a onat.id = a.id) h join department b on h.depid = b.depid group by b.name,h.year").show
18/02/21 10:13:37WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoringsimilar problems.
+--------------------+-------------+----+
| name|sum(attdinfo)|year|
+--------------------+-------------+----+
| management| -29|2017|
|researchanddevelo...| 26|2017|
| management| -3|2018|
| HumanResources| -141|2014|
|researchanddevelo...| 6|2016|
| HumanResources| 36|2015|
| management| -69|2016|
| HumanResources| -135|2017|
|researchanddevelo...| 0|2018|
| management| -43|2014|
| HumanResources| 2|2016|
+--------------------+-------------+----+
6) 合并前面的全部查询。
scala>spark.sql("select e.name, e.pcount, f.sumsalary, f.avgsalary,j.year,j.sumattd from (select
b.name, count(a.id) as pcount from people a joindepartment b on a.depid=b.depid group by b.name order by pcount) e join (selectb.name, sum(c.salary) as sumsalary, avg(c.salary) as avgsalary from people ajoin department b on a.depid=b.depid join salary c on a.id=c.id group by b.nameorder by sumsalary) f on (e.name = f.name) join (select b.name, sum(h.attdinfo)as sumattd, h.year from (select a.id, a.depid, at.year, at.month,overtime-latetime- absenteeism- leaveearlytime as attdinfo from attendance atjoin people a on at.id = a.id) h join department b on h.depid = b.depid groupby b.name, h.year) j on f.name = j.name order by f.name").show
18/02/21 10:14:42WARN lazy.LazyStruct: Extra bytes detected at the end of the row! Ignoringsimilar problems.
+--------------------+------+---------+-----------------+----+-------+
| name|pcount|sumsalary| avgsalary|year|sumattd|
+--------------------+------+---------+-----------------+----+-------+
| HumanResources| 4| 39000| 9750.0|2015| 36|
| HumanResources| 4| 39000| 9750.0|2014| -141|
| HumanResources| 4| 39000| 9750.0|2016| 2|
| HumanResources| 4| 39000| 9750.0|2017| -135|
| management| 3| 26500|8833.333333333334|2016| -69|
| management| 3| 26500|8833.333333333334|2017| -29|
| management| 3| 26500|8833.333333333334|2014| -43|
| management| 3| 26500|8833.333333333334|2018| -3|
|researchanddevelo...| 3| 30000| 10000.0|2018| 0|
|researchanddevelo...| 3| 30000| 10000.0|2017| 26|
|researchanddevelo...| 3| 30000| 10000.0|2016| 6|
+--------------------+------+---------+-----------------+----+-------+
将前面的几个查询合并到一个sql语句中,最后得到部门的各种统计信息,包括部门职工数、部门薪资、部门每年的考勤统计等信息。
2018年新春报喜!热烈祝贺王家林大咖大数据经典传奇著作《SPARK大数据商业实战三部曲》畅销书籍 清华大学出版社发行上市!
本书基于Spark 2.2.0最新版本(2017年7月11日发布),以Spark商业案例实战和Spark在生产环境下几乎所有类型的性能调优为核心,以Spark内核解密为基石,分为上篇、中篇、下篇,对企业生产环境下的Spark商业案例与性能调优抽丝剥茧地进行剖析。上篇基于Spark源码,从一个动手实战案例入手,循序渐进地全面解析了Spark 2.2新特性及Spark内核源码;中篇选取Spark开发中最具有代表的经典学习案例,深入浅出地介绍,在案例中综合应用Spark的大数据技术;下篇性能调优内容基本完全覆盖了Spark在生产环境下的所有调优技术。
本书适合所有Spark学习者和从业人员使用。对于有分布式计算框架应用经验的人员,本书也可以作为Spark高手修炼的参考书籍。同时,本书也特别适合作为高等院校的大数据教材使用。
当当网、京东、淘宝、亚马逊等网店已可购买!欢迎大家购买学习!