目录:
1 数据去重----(预处理:清洗、过滤、去重)
2 数据排序
3 求均值
4 单表关联
5 多表关联
6 日志解析
7 共同好友
8 其他杂例
2018-3-1 a 2018-3-2 b 2018-3-3 c 2018-3-4 d 2018-3-5 a 2018-3-6 b 2018-3-7 c 2018-3-3 c |
变形:分组排序、topk(自己写一遍)
如原始数据:
2 32 654 32 15 756 65223 5956 22 650 92 |
要求结果:
1 2 2 6 3 15 4 22 5 26 6 32 7 32 8 54 9 92 10 650 11 654 12 756 13 5956 14 65223 |
原始数据:
1)math:
张三 88 李四 99 王五 66 赵六 77
2)chinese:
张三 78 李四 89 王五 96 赵六 67
3)english:
张三 80 李四 82 王五 84 赵六 86 |
输出结果:
张三 82 李四 90 王五 82 赵六 76 |
给出child-parent(孩子——父母)表,要求输出grandchild-grandparent(孙子——爷奶)表。
样例输入如下所示。
file:
child parent Tom Lucy Tom Jack Jone Lucy Jone Jack Lucy Mary Lucy Ben Jack Alice Jack Jesse Terry Alice Terry Jesse Philip Terry Philip Alma Mark Terry Mark Alma |
输出结果:
grandchild grandparent Tom Alice Tom Jesse Jone Alice Jone Jesse Tom Mary Tom Ben Jone Mary Jone Ben Philip Alice Philip Jesse Mark Alice Mark Jesse |
Map side join
Reduce side join
简单转换(如字段截取,字符串替代等)
外部字典替换
格式转换(如json,xml等格式转换为plain text)
原始数据:每个人的好友列表
A:B,C,D,F,E,O B:A,C,E,K C:F,A,D,I D:A,E,F,L E:B,C,D,M,L F:A,B,C,D,E,O,M G:A,C,D,E,F H:A,C,D,E,O I:A,O J:B,O K:A,C,D L:D,E,F M:E,F,G O:A,H,I,J …… |
输出结果:每个人和其他各人所拥有的功能好友
A-B C,E, A-C D,F, A-D E,F, A-E B,C,D, A-F B,C,D,E,O, A-G C,D,E,F, A-H C,D,E,O, A-I O, A-J B,O, A-K C,D, A-L D,E,F, A-M E,F, B-C A, B-D A,E, …… |
去哪儿网笔试题:
去哪儿旅行的APP每天会产生大量的访问日志。用户【uuid-x】的每一次操作记录会产生一条日志记录,假设用户可以通过单程搜索【search-dancheng】,往返搜索【search-wangfan】等多个入口进入报价详情页【detail】选择航班并完成最后的下订单【submit】购票操作。日志格式如下,请编写Map/Reduce程序完成如下需求(伪代码完成即可)
a) 计算20140510这一天去哪儿旅行APP的订单有多少来自单程搜索,有多少来自往返搜索
日志示例(仅作示例【片段,每天数据量会非常大】):
20140510 09:17:19 uuid-01 search-dancheng dep=北京&arr=上海&date=20140529&pnvm=0 20140510 09:18:20 uuid-02 search-wangFan dep=北京&arr=上海&sdate=20140529&edate=20140605 20140510 09:18:23 uuid-01 detail dep=北京&arr=上海&date=20140529&fcode=CA1810 20140510 09:20:29 uuid-02 detail dep=北京&arr=上海&date=20140529&fcode=CA1810 20140510 09:21:19 uuid-01 submit dep=北京&arr=上海&date=20140529&fcode=CA1810&price=1280 20140510 09:23:19 uuid-03 search-dancheng dep=北京&arr=广州&date=20140529&pnvm=0 20140510 09:25:19 uuid-04 search-dancheng dep=北京&arr西安&date=20140529&pnvm=0 20140510 09:25:30 uuid-05 search-dancheng dep=北京&arr=天津&date=20140529&pnvm=0 20140510 09:26:29 uuid-04 detail dep=北京&arr=西安&上海&date=20140529&fcode=CA1810 20140510 09:28:19 uuid-06 submit dep=北京&arr=拉萨&date=20140529&fcode=CA1810&price=2260 |
电力公司数据更新日志合并
某公司日志处理需求说明:
根据系统和关键字查询日志,并将关键字所在行以下10行数据输出或保存到hdfs,最终是把这些数据展示到Web页面。
(关键字所在的数据行与它以下10行数据并没有关联关系,日志数据为很乱的原数据。)
java应用+shell脚本+spark.jar包
java应用负责用户登录后,输入系统、关键字等参数,提交查询,java调用shell脚本-->submit
结果数据保存到hdfs上。保存的该文件用随机数命名,最后在Web页面读取展示出来。
样例数据如下:
15-06-10.23:58:02.321 [pool-22-thread-5] INFO HttpPostMessageSender - HttpPostMessageSender resp statusCode: 200 content:success tradeno:2015061010001000070650683 15-06-10.23:58:02.321 [pool-22-thread-5] INFO HttpPostMessageSender - HttpPost 是否发送成功 true 15-06-10.23:58:02.321 [pool-22-thread-5] INFO NotificationServiceImpl - ****进行入库操作**** 15-06-10.23:58:02.324 [pool-22-thread-5] INFO NotificationServiceImpl - ****没有此TRADE_NO,新增Notification****tradeNo=2015061010001000070650683 15-06-10.23:58:02.327 [pool-22-thread-5] INFO NotifyServiceImpl - ****enter--saveOrUpdateNotity**** 15-06-10.23:58:02.330 [pool-22-thread-5] INFO NotifyServiceImpl - ****没有此TRADE_NO,新增Notify****tradeNo=2015061010001000070650683 15-06-10.23:58:02.333 [pool-22-thread-5] INFO ACCESS - 2015061010001000070650683,FINISHED_SUCCESS 15-06-10.23:58:04.250 [pool-20-thread-2] INFO NPPListener - Received a new message OutTradeNotify{tradeInfo=TradeInfo{outTradeNo='150610263916206067998', tradeNo='2015061010001000070650827', originalTradeNo='null', bizTradeNo='9488771051', tradeType=TRADE_GENERAL, subTradeType=SALE, payMethod=CASHIERGATEMODE, tradeMoney=Money{currency=CNY, amount=10420}, tradeSubject='消费订单', submitter=ThinCustomer{merchantNo='23077370', customerNo='360080000230773708', customerLoginName='null', customerName='null', customerOutName='null'}, seller=ThinCustomer{merchantNo='23077370', customerNo='360080000230773708', customerLoginName='null', customerName='null', customerOutName='null'}, sellerAccountNo='360080000230773708000811', buyer=ThinCustomer{merchantNo='null', customerNo='360000000260175680', customerLoginName='null', customerName='null', customerOutName='null'}, tradeStatus=TRADE_FINISHED, createdDate=Wed Jun 10 23:57:32 CST 2015, deadlineTime=null, tradeFinishedDate='20150610', tradeFinishedTime='235802', payTool=EXPRESS, bankCode='CEB', exchangeDate='null', exchangeRate='null', returnParams='null', oldGWV60AuthCode='null', oldEXV10TerminalNo='null', clearingCurrency=null, clearingMoney=null, tradeExtInfo=TradeExtInfo{notifyStatus='NOT', outMessageId='null', cardSha1='null', signNo='null', returnParams='null', extendParams='null', pageBackUrl='null', serverNotifyUrl='http://gw.jd.com/payment/notify_chinabankReal.action', notifySmsMoible='null', notifyMailAddress='null', innerMessageFormat='XML', apiMessageFormat='EX_V1.0', requestCharset='UTF-8', encryptType='3DES', signType='MD5', requestModule='null', requestVersion='null', remoteIp='109.145.60.24', receivingChannel='JDSC', requestProtocol='HTTP', requestMethod='null', outTradeDate='20150610', outTradeTime='235731', outTradeIp='109.145.60.24', outRefererHosts='null', retryCount=1}, ext=null}}, OutMessageNotify{apiMessageFormat=null, messageFormat=null, notifyCharset='null', signType='null', encryptType='null'}, MessageNotify{responseModule='null', responseCode='null', responseDesc='null'} 15-06-10.23:58:04.253 [pool-20-thread-2] INFO KeyServiceImpl - Calling SecurityService to get {} key for merchant {} with codeClass {}23077370KeyTypeEnum{code='3DES', cnName='三DES'}EXPRESS 15-06-10.23:58:04.264 [pool-20-thread-2] INFO CustomerCenterFacade - [INVOCATION_LOG_C] 2015-06-10.23:58:04.264;pool-20-thread-2;172.17.92.48:0->172.17.87.47:20996;com.wangyin.customer.api.CustomerCenterFacade:1.1.6.getMerchantCustomerKeys(com.wangyin.customer.common.dto.customer.CustomerParamDTO);***;2015-06-10.23:58:04.253;RESULT:***;11,112,359; 15-06-10.23:58:04.264 [pool-20-thread-2] INFO KeyServiceImpl - 获取的3DES 密钥值为20B0984A9B751F0B911A1AEA0738D557AE16548CCE029E2A 15-06-10.23:58:04.264 [pool-20-thread-2] INFO KeyServiceImpl - Calling SecurityService to get {} key for merchant {} with codeClass {}23077370KeyTypeEnum{code='SALT', cnName='签名密钥'}EXPRESS 15-06-10.23:58:04.270 [pool-24-thread-4] INFO NPPListener - Received a new message OutTradeNotify{tradeInfo=TradeInfo{outTradeNo='22015061023575751670871914', tradeNo='2015061010001000070651406', originalTradeNo='null', bizTradeNo='null', tradeType=TRADE_GENERAL, subTradeType=SALE, payMethod=APIEXPRESSMODE, tradeMoney=Money{currency=CNY, amount=500000}, tradeSubject='消费订单', submitter=ThinCustomer{merchantNo='22843776', customerNo='360080000228437761', customerLoginName='null', customerName='null', customerOutName='null'}, seller=ThinCustomer{merchantNo='22843776', customerNo='360080000228437761', customerLoginName='null', customerName='null', customerOutName='null'}, sellerAccountNo='360080000228437761000811', buyer=null, tradeStatus=TRADE_FINISHED, createdDate=Wed Jun 10 23:57:57 CST 2015, deadlineTime=null, tradeFinishedDate='20150610', tradeFinishedTime='235802', payTool=EXPRESS, bankCode='ICBC', exchangeDate='null', exchangeRate='null', returnParams='22894010', oldGWV60AuthCode='null', oldEXV10TerminalNo='00000002', clearingCurrency=null, clearingMoney=null, tradeExtInfo=TradeExtInfo{notifyStatus='NOT', outMessageId='API.150610.0ddf8c2f7ed94f3e9f741cd44500a866', cardSha1='5D72C7755A82576EE906BAB8314164ABAC513C9C', signNo='201505110010089270009113541', returnParams='22894010', extendParams='null', pageBackUrl='null', serverNotifyUrl='http://jrb-api.d.chinabank.com.cn/notify/quick.htm', notifySmsMoible='null', notifyMailAddress='null', innerMessageFormat='XML', apiMessageFormat='EX_V1.0', requestCharset='UTF-8', encryptType='3DES', signType='MD5', requestModule='null', requestVersion='null', remoteIp='172.17.80.168', receivingChannel='API', requestProtocol='HTTP', requestMethod='POST', outTradeDate='null', outTradeTime='null', outTradeIp='null', outRefererHosts='null', retryCount=1}, ext=null}}, OutMessageNotify{apiMessageFormat=null, messageFormat=null, notifyCharset='null', signType='null', encryptType='null'}, MessageNotify{responseModule='null', responseCode='null', responseDesc='null'} 15-06-10.23:58:04.271 [pool-20-thread-2] INFO CustomerCenterFacade - [INVOCATION_LOG_C] 2015-06-10.23:58:04.271;pool-20-thread-2;172.17.92.48:0->172.17.91.104:20996;com.wangyin.customer.api.CustomerCenterFacade:1.1.6.getMerchantCustomerKeys(com.wangyin.customer.common.dto.customer.CustomerParamDTO);***;2015-06-10.23:58:04.264;RESULT:***;6,971,110; 15-06-10.23:58:04.272 [pool-20-thread-2] INFO KeyServiceImpl - 获取MD5 TOKEN 密钥的值为1qaz2wsx3edc 15-06-10.23:58:04.273 [pool-20-thread-2] INFO NPPNotifyProcessorImpl - ApiMessageFormatEX_V1.0 15-06-10.23:58:04.273 [pool-24-thread-4] INFO KeyServiceImpl - Calling SecurityService to get {} key for merchant {} with codeClass {}22843776KeyTypeEnum{code='3DES', cnName='三DES'}EXPRESS 15-06-10.23:58:04.273 [pool-20-thread-2] INFO NPPNotifyProcessorImpl - 转化为NotificationDTO的结果为: com.wangyin.npp.notify.facade.dto.NotificationDTO@54b27890 15-06-10.23:58:04.273 [pool-20-thread-2] INFO NotificationServiceImpl - 准备入库(可能会入库)的 notification=Notification [TRADE_NO=2015061010001000070650827, SOURCE_NAME=NPP_PAYMENT_COMPLETE, FROM_ADDRESS=EXPRESS, FROM_NAME=23077370, TO_ADDRESS=http://gw.jd.com/payment/notify_chinabankReal.action, CHANNEL=HTTP_POST, SUBJECT=null, CONTENT=resp=PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4NCjxDSElOQUJBTks%2BCiAgPFZFUlNJT04%2BMS4wLjA8L1ZFUlNJT04%2BCiAgPE1FUkNIQU5UPjIzMDc3MzcwPC9NRVJDSEFOVD4KICA8VEVSTUlOQUw%2BMDAwMDAwMDE8L1RFUk1JTkFMPgogIDxEQVRBPllFbm10T0Zkb0RBK0tHdmhVZmJBZTlKOVZDOC9ONGx1YW5uMlBTRFF0L0VNTUh3eHR6L29tYi9vdlArTjAybnlsTGdhbUhCVDBZYVpBMUxoSC9iV3RndmoxN0JMTDhPTFc3U3laZmMxMU5sczRqSFdGeUR1UHNsb3F4YU51aFdUUFFDTzljMCtrTFpDZkpuZHB6d2sxN3J4dU5mRGVuYmljZ21kWHphSlhQNElQZzFKQ2h1ZGRRNWdTQTQ4UWVPVEE0UUhJYUsyQVFJNTNZQU03RHdQWFBrZkNPMythRUgvMk5oeGJMRmtYMTEvalJWUUI0NDM1K2FtSm1zclE0UFJ5cVVSWmx6eGVJQk5XNU4xZnZjMUE1NXRVa1RmRjNWc1orWjU2WkdydFoyQzdnQ3BWNkxqOUNDUWlzbjhKMEd3Z2JLS0kvdUMyUVNDTHJOMUl3YU8waSsxUUFIVWdPRGRtTFZHUGxhSTBqTS85UWVmY0Q2R0FjaVJua214R |