通往最佳CSV解析?(Java)

1. CSV解析还需要工具包吗?

如果能问出这个问题,根本还是在于对问题严重性不了解,要保证文本能够被正确解析需要有一个规范。

下面是Wikipidia - Comma-separated values中对RFC-4180介绍的(了解一下):

RFC 4180 formalized CSV. It defines the MIME type “text/csv”, and CSV files that follow its rules should be very widely portable. Among its requirements:

  • MS-DOS-style lines that end with (CR/LF) characters (optional for the last line).
  • An optional header record (there is no sure way to detect whether it is present, so care is required when importing).
  • Each record “should” contain the same number of comma-separated fields.
  • Any field may be quoted (with double quotes).
  • Fields containing a line-break, double-quote or commas should be quoted. (If they are not, the file will likely be impossible to process correctly).
  • A (double) quote character in a field must be represented by two (double) quote characters.

话不多说,如果你使用过:

  1. 手工解析 - 生死由天,没有RFC 4180加持
  2. commons-csv - 基本解析
  3. opencsv - 基本解析+Java Bean映射+嵌套解析(已经很棒了)

但愿你已经有意识地走到了3,到达(java) bean的水平然后再根据csv parsers comparison,应该使用univocity parsers。

可能你还是会觉得:这是什么玩意为,凭什么要用?

univocity-parsers is currently used by many commercial and open-source projects, including Spark-CSV, Apache Camel and Apache Drill.

2. univocity-parsers解析嵌套结构案例

下面只举一例如何解析如下的数据,其中需要注意的点:

  1. 标题差异比较明显;
  2. rdfs:label是用;分割的,嵌套字段。

index,IRI,skos:prefLabel,rdfs:label(include skos:prefLabel),rdfs:subClassOf,vertical
1,user:Activity,活动,user:Event,
2,user:AdministrativeDepartment,行政机关,行政机构,user:Organization,
3,user:AdministrativeEnforcementOfLawDepartment,行政执法机关,user:LegalDepartment,
4,user:AdministrativeRegion,行政区,user:GeographicalArea,地点
5,user:AgentCompany,经纪公司,经纪人公司,user:Company,
6,user:Album,专辑,user:Audio,
7,user:Animal,动物界,user:Kingdom,
8,user:ArtWorks,美术作品,user:CreativeWork,
9,user:Athlete,运动员,user:SportsPerson,
10,user:Audio,音频,user:CreativeWork,
11,user:Biology,生物,owl:Thing,
12,user:Books,图书,user:Publication,
13,user:Building,建筑物,user:Place,地点
14,user:BusinessEvent,商业事件,user:Event,
15,user:BusinessPerson,商业人物,user:Person,
16,user:Cartoon,动画,user:Video,
17,user:CharitableOrganization,慈善组织,公益组织;公益团体;慈善机构;慈善团体,user:Organization,

2.1 Java Bean定义

@Data // lombok
public class KgClass {
    @Parsed(field = "index")
    Integer index;
    @Parsed(field = "IRI")
    String iri;
    @Parsed(field = "skos:prefLabel")
    String prefLabel;
    @Parsed(field = "rdfs:label(include skos:prefLabel)")
    @Convert(conversionClass = Splitter.class, args = ";")
    List<String> label;
    @Parsed(field = "rdfs:subClassOf")
    @Convert(conversionClass = Splitter.class, args = ";")
    List<String> subClassOf;
    @Parsed(field = "vertical")
    @Convert(conversionClass = Splitter.class, args = ";")
    List<String> vertical;
}

2.2 分割的辅助类

这里是univocity-parsers相对opencsv薄弱的地方了,需要定义一个,已经实现的Conversion中过于简单。

import com.univocity.parsers.conversions.Conversion;

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

public class Splitter implements Conversion<String, List<String>> {

    private String separator;

    public Splitter(String... args) {
        if(args.length == 0){
            separator = ",";
        } else {
            separator = args[0];
        }
    }

    @Override
    public List<String> execute(String input) {
        if(input == null){
            return null;
        }
        return Arrays.asList(input.split(separator));
    }

    @Override
    public String revert(List<String> input) {
        if(input == null || input.isEmpty()) {
            return "";
        }
        return input.stream().collect(Collectors.joining(separator));
    }
}

2.3 组装

BeanListProcessor<Bean> rowProcessor = new BeanListProcessor<>(Bean.class);
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setProcessor(rowProcessor);
parserSettings.setHeaderExtractionEnabled(true);

CsvParser parser = new CsvParser(parserSettings);
parser.parse(reader);

List<Bean> beans = rowProcessor.getBeans();

写在后面
  司马迁给历史开了个好头,正文好好说事,文末再加上自己的吐槽——太史公曰。都要结束了,留个几行写给自己。想想还是在CSDN上写博客吧,前天在搜索sparql转cypher的项目的时候无意中发现,CSDN上有个哥们转载了我在Leanote博客上的文章还标的自己原创,阅读量比我那篇原创还高一半呢。有用没用不好说,反正生气之余找了五六个人把他给举报了。不过一年没动笔了,这确实不是什么好事,以后还是得努力精进。只有收集、检验再汇总得来的才是对自己有所提升的,没有拿出时间总结确实需要反省。
  回到文章:一个小东西整那麻烦有意义吗?还是有的。作为一个程序员应该为自己在写低水平、错误百出的重复工具而感到羞愧,对于工具应该主动了解,并且最低要求应该是能够阅读和记住基本规格。

你可能感兴趣的:(工具)