spark用正则表达式处理需要将多个输入文件路径作为输入的问题

一、spark用正则表达式处理需要将多个输入文件路径作为输入的问题

1、spark需要处理过去半个月的数据,每天的数据再hdfs上存到一个文件中,将近15个输入目录,此时可以采用如下正则匹配的写法,将代码简化:

import datetime
def produce_half_month(thedate):
    current_day = thedate
    # current_day = '20190515'
    current_datetime = datetime.datetime.strptime(current_day, "%Y%m%d")
    print current_datetime
    i = 0
    match_days = current_day
    while i < 15:
        if i == 0:
            i += 1
            continue
        bf_day = (current_datetime - datetime.timedelta(days=i)).strftime('%Y%m%d')
        # print bf_day
        i += 1
        match_days += "," + bf_day
    print match_days
    return match_days

hdfs_path = "/home/workdir/hdfs/log/hourly/{%s}*/click_exposure/*" % produce_half_month(thedate)

>>> print hdfs_path
/home/workdir/hdfs/log/hourly/{20190505,20190504,20190503,20190502,20190501,20190430,20190429,20190428,20190427,20190426,20190425,20190424,20190423,20190422,20190421}*/click_exposure/*

# spark也能正常读出这个半个月的数据

data_rdd = sc.textFile(hdfs_path, use_unicode=False)

#hadoop命令可以同时匹配半月的日志数据:

hadoop dfs -ls /home/workdir/hdfs/log/hourly/{20190505,20190504,20190503,20190502,20190501,20190430,20190429,20190428,20190427,20190426,20190425,20190424,20190423,20190422,20190421}*/click_exposure/*

2、其他路径匹配方法

 

[work@datazhe/project]$ hadoop dfs -ls /home/workdir/yao/tmp/user_app_list/new_{20190401,20190403}
Found 2 items
-rw-r--r--   3 work work          0 2019-05-19 19:37 /home/workdir/yao/tmp/user_app_list/new_20190401/_SUCCESS
-rw-r--r--   3 work work    1420044 2019-05-19 19:37 /home/workdir/yao/tmp/user_app_list/new_20190401/part-00000.gz
Found 2 items
-rw-r--r--   3 work work          0 2019-05-19 20:02 /home/workdir/yao/tmp/user_app_list/new_20190403/_SUCCESS
-rw-r--r--   3 work work     656201 2019-05-19 20:02 /home/workdir/yao/tmp/user_app_list/new_20190403/part-00000.gz
[work@datazhe/project]$ hadoop dfs -ls /home/workdir/yao/tmp/user_app_list/new_201904{01,03}
Found 2 items
-rw-r--r--   3 work work          0 2019-05-19 19:37 /home/workdir/yao/tmp/user_app_list/new_20190401/_SUCCESS
-rw-r--r--   3 work work    1420044 2019-05-19 19:37 /home/workdir/yao/tmp/user_app_list/new_20190401/part-00000.gz
Found 2 items
-rw-r--r--   3 work work          0 2019-05-19 20:02 /home/workdir/yao/tmp/user_app_list/new_20190403/_SUCCESS
-rw-r--r--   3 work work     656201 2019-05-19 20:02 /home/workdir/yao/tmp/user_app_list/new_20190403/part-00000.gz

[work@datazhe/project]$ hadoop dfs -ls /home/workdir/yao/tmp/user_app_list/new_2019040[1-3]
Found 2 items
-rw-r--r--   3 work work          0 2019-05-19 19:37 /home/workdir/yao/tmp/user_app_list/new_2019040/new_20190401/_SUCCESS
-rw-r--r--   3 work work    1420044 2019-05-19 19:37 /home/workdir/yao/tmp/user_app_list/new_2019040/new_20190401/part-00000.gz
Found 2 items
-rw-r--r--   3 work work          0 2019-05-19 19:46 /home/workdir/yao/tmp/user_app_list/new_2019040/new_20190402/_SUCCESS
-rw-r--r--   3 work work    1291293 2019-05-19 19:46 /home/workdir/yao/tmp/user_app_list/new_2019040/new_20190402/part-00000.gz
Found 2 items
-rw-r--r--   3 work work          0 2019-05-19 20:02 /home/workdir/yao/tmp/user_app_list/new_2019040/new_20190403/_SUCCESS
-rw-r--r--   3 work work     656201 2019-05-19 20:02 /home/workdir/yao/tmp/user_app_list/new_2019040/new_20190403/part-00000.gz

二、hadoop 输入路径用正则表达式被默认处理为多个参数的问题

运行命令  

hadoop jar   wordcount.jar   com.WordCount  /inpath/*{beijing,shanghai,guangzhou}*   /outpath/

这个/inpath/*{beijing,shanghai,guangzhou}* 地址,hadoop自己会解析为多个参数,判定第二个参数,不是输出路径

解决方式:

hadoop jar   wordcount.jar   com.WordCount  /inpath/'{*beijing*,*shanghai*,*guangzhou*}'   /outpath/

这样就可以了。

参考:https://www.cnblogs.com/yanghaolie/p/10538226.html

你可能感兴趣的:(Hadoop,Spark)