数据转换-那些年我们一起踩过的坑

做数据转换的时候,什么样的脏数据都有可能发生,不要期待一切都如你所愿。

1. 写文件的时候一定要注意传来字段的制表符问题

读文件我们readline 然后用\t来读数据

写文件的时候,我们用\n来换行。

如果遇到下面的情况就有些会出现问题了,字段中包含制表符,这样做数据转换的时候就会发生错位。

{"code":"CUXZJS","refer":"\r\nDV8HFI","referPid":null,"people":[],"iosPushToken":""}

2. 用java的小伙伴们,如果用split函数的时候,要注意
如果一条数据是这样的
A\tB\tC\t\t  注意这是五个字段A,B,C,D, E 但是D,E传来的是空字符串
String a = "A	B	C		";
String[] arrStrings = a.split("\t");
这样简单的split,不是完全匹配,最后数组里只有[A,B,C]三个元素
所以要完全匹配需要使用split(regex,-1)
String a = "A	B	C		";
String[] arrStrings = a.split("\t",-1);
这样数组会匹配到[A, B, C, , ]

查看源码定义
public String[] split(String regex, int limit)
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded.

<未完待续......>


你可能感兴趣的:(java,etl)