如果能问出这个问题,根本还是在于对问题严重性不了解,要保证文本能够被正确解析需要有一个规范。
下面是Wikipidia - Comma-separated values中对RFC-4180介绍的(了解一下):
RFC 4180 formalized CSV. It defines the MIME type “text/csv”, and CSV files that follow its rules should be very widely portable. Among its requirements:
- MS-DOS-style lines that end with (CR/LF) characters (optional for the last line).
- An optional header record (there is no sure way to detect whether it is present, so care is required when importing).
- Each record “should” contain the same number of comma-separated fields.
- Any field may be quoted (with double quotes).
- Fields containing a line-break, double-quote or commas should be quoted. (If they are not, the file will likely be impossible to process correctly).
- A (double) quote character in a field must be represented by two (double) quote characters.
话不多说,如果你使用过:
但愿你已经有意识地走到了3,到达(java) bean的水平然后再根据csv parsers comparison,应该使用univocity parsers。
可能你还是会觉得:这是什么玩意为,凭什么要用?
univocity-parsers is currently used by many commercial and open-source projects, including Spark-CSV, Apache Camel and Apache Drill.
下面只举一例如何解析如下的数据,其中需要注意的点:
rdfs:label
是用;
分割的,嵌套字段。index,IRI,skos:prefLabel,rdfs:label(include skos:prefLabel),rdfs:subClassOf,vertical
1,user:Activity,活动,user:Event,
2,user:AdministrativeDepartment,行政机关,行政机构,user:Organization,
3,user:AdministrativeEnforcementOfLawDepartment,行政执法机关,user:LegalDepartment,
4,user:AdministrativeRegion,行政区,user:GeographicalArea,地点
5,user:AgentCompany,经纪公司,经纪人公司,user:Company,
6,user:Album,专辑,user:Audio,
7,user:Animal,动物界,user:Kingdom,
8,user:ArtWorks,美术作品,user:CreativeWork,
9,user:Athlete,运动员,user:SportsPerson,
10,user:Audio,音频,user:CreativeWork,
11,user:Biology,生物,owl:Thing,
12,user:Books,图书,user:Publication,
13,user:Building,建筑物,user:Place,地点
14,user:BusinessEvent,商业事件,user:Event,
15,user:BusinessPerson,商业人物,user:Person,
16,user:Cartoon,动画,user:Video,
17,user:CharitableOrganization,慈善组织,公益组织;公益团体;慈善机构;慈善团体,user:Organization,
@Data // lombok
public class KgClass {
@Parsed(field = "index")
Integer index;
@Parsed(field = "IRI")
String iri;
@Parsed(field = "skos:prefLabel")
String prefLabel;
@Parsed(field = "rdfs:label(include skos:prefLabel)")
@Convert(conversionClass = Splitter.class, args = ";")
List<String> label;
@Parsed(field = "rdfs:subClassOf")
@Convert(conversionClass = Splitter.class, args = ";")
List<String> subClassOf;
@Parsed(field = "vertical")
@Convert(conversionClass = Splitter.class, args = ";")
List<String> vertical;
}
这里是univocity-parsers相对opencsv薄弱的地方了,需要定义一个,已经实现的Conversion
中过于简单。
import com.univocity.parsers.conversions.Conversion;
import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
public class Splitter implements Conversion<String, List<String>> {
private String separator;
public Splitter(String... args) {
if(args.length == 0){
separator = ",";
} else {
separator = args[0];
}
}
@Override
public List<String> execute(String input) {
if(input == null){
return null;
}
return Arrays.asList(input.split(separator));
}
@Override
public String revert(List<String> input) {
if(input == null || input.isEmpty()) {
return "";
}
return input.stream().collect(Collectors.joining(separator));
}
}
BeanListProcessor<Bean> rowProcessor = new BeanListProcessor<>(Bean.class);
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setProcessor(rowProcessor);
parserSettings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(parserSettings);
parser.parse(reader);
List<Bean> beans = rowProcessor.getBeans();
写在后面
司马迁给历史开了个好头,正文好好说事,文末再加上自己的吐槽——太史公曰。都要结束了,留个几行写给自己。想想还是在CSDN上写博客吧,前天在搜索sparql转cypher的项目的时候无意中发现,CSDN上有个哥们转载了我在Leanote博客上的文章还标的自己原创,阅读量比我那篇原创还高一半呢。有用没用不好说,反正生气之余找了五六个人把他给举报了。不过一年没动笔了,这确实不是什么好事,以后还是得努力精进。只有收集、检验再汇总得来的才是对自己有所提升的,没有拿出时间总结确实需要反省。
回到文章:一个小东西整那麻烦有意义吗?还是有的。作为一个程序员应该为自己在写低水平、错误百出的重复工具而感到羞愧,对于工具应该主动了解,并且最低要求应该是能够阅读和记住基本规格。