yaml json_数据序列化比较:JSON,YAML,BSON,MessagePack

yaml json

yaml json_数据序列化比较:JSON,YAML,BSON,MessagePack_第1张图片

JSON is the de facto standard for data exchange on the web, but it has its drawbacks, and there are other formats that may be more suitable for certain scenarios. I’ll compare the pros and cons of the alternatives, including ease of use and performance.

JSON是Web上数据交换的事实上的标准,但是它有缺点,并且还有其他格式可能更适合某些情况。 我将比较替代方案的优缺点,包括易用性和性能。

Note: I won’t cover implementation details here, but if you’re a Ruby programmer, check out this article, where Dhaivat writes about implementing some serialization formats in Ruby.

注意:我不会在这里介绍实现细节,但是,如果您是Ruby程序员,请查阅本文 ,Dhaivat在其中撰写了有关在Ruby中实现某些序列化格式的文章。

什么是数据序列化 (What Is Data Serialization)

According to Wikipedia, serialization is:

根据Wikipedia ,序列化为:

the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment.

将数据结构或对象状态转换为可以存储的格式(例如,存储在文件或内存缓冲区中,或通过网络连接链接传输)并稍后在相同或另一个计算机环境中重建的格式的过程。

Let’s say you want to collect certain data about a group of people — name, last name, nickname, date of birth, instruments they play. You could easily set a spreadsheet, define some columns, and make every row an entry. You could go just a little further, define that the date of birth column must be a number, and that the instruments columns could be a list of options. It’d look like this:

假设您想收集有关一群人的某些数据,例如姓名,姓氏,昵称,出生日期,他们演奏的乐器。 您可以轻松地设置电子表格,定义一些列,并将每一行作为一个条目。 您可以走得更远一点,定义“生日”列必须为数字,并且“工具”列可以为选项列表。 看起来像这样:

name last name dob nickname instruments
William Bailey 1962 Axl Rose vocals, piano
Saul Hudson 1965 Slash guitar
名称 多宝 昵称 仪器
威廉 贝利 1962年 Axl玫瑰 人声,钢琴
扫罗 哈德森 1965年 削减 吉他

More or less, what you did there was to define a data structure; and you’ll do just fine if you only need this on a spreadsheet format. The problem is that, if you ever want to exchange this information with a database or a website, the mechanics by which these data structures are implemented on these other platforms — even if the underlying semantics are overall the same — will be dramatically different. You can’t just plug-n-play a spreadsheet into a web application, unless the application has been specifically designed for it. And you can’t transfer that info from the website to the database unless you have some sort of export tool or gateway for it.

或多或少,您所做的就是定义一个数据结构 。 如果只需要在电子表格格式中使用它,那就会很好。 问题是,如果您想与数据库或网站交换此信息,则即使在基本语义上完全相同,在其他平台上实现这些数据结构的机制也会大不相同。 您不能只是将电子表格即插即用到Web应用程序中,除非该应用程序是专门为此设计的。 而且,除非您具有某种导出工具或网关,否则您无法将该信息从网站传输到数据库。

Let’s assume that our website already has these data structures implemented in its internal logic, and that it just cannot deal with a spreadsheet format. In order to solve these problems, you can translate these data structures into a format that can be easily shared across different applications, architectures, or what have you: you serialize them. And by doing so, you ensure not only that you can transfer this data across platforms, but that they can be reconstructed in the reverse process called deserialization. Furthermore, if exchanged back from the website to the spreadsheet, you’ll get a semantically identical clone of the original object — that is, a row that looks exactly the same as the one you originally sent.

假设我们的网站已经在内部逻辑中实现了这些数据结构,并且无法处理电子表格格式。 为了解决这些问题,您可以将这些数据结构转换为可以在不同应用程序,体系结构或所拥有的内容之间轻松共享的格式:对它们进行序列化 。 这样,您不仅确保可以跨平台传输此数据,而且还可以以称为反序列化的反向过程来重建它们。 此外,如果从网站交换回电子表格,您将获得原始对象在语义上相同的克隆,即与原始发送的对象看起来完全相同的一行。

In short: serializing data is finding some sort of universal format that can be easily shared across different applications.

简而言之:序列化数据正在寻找某种通用格式,可以在不同应用程序之间轻松共享。

格式 (The Formats)

JSON格式 (JSON)

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It’s easy for humans to read and write; it’s easy for machines to parse and generate.

JSON(JavaScript对象表示法)是一种轻量级的数据交换格式。 人类易于读写。 机器很容易解析和生成。

JSON is the most widespread format for data serialization, and it has the following features:

JSON是最广泛的数据序列化格式,它具有以下功能:

  • (Mostly) human readable code: even if the code has been obscured or minified, you can always indent it with tools such as JSONLint and make it readable again.

    (通常) 人类可读的代码 :即使代码被遮盖或缩小 ,您也可以始终使用JSONLint之类的工具缩进代码并使其再次可读。

  • Very simple and straightforward specification: a summary of the whole spec fits on a single page (as displayed on the JSON site).

    非常简单明了的规范 :整个规范的摘要仅放在一个页面上( 如JSON网站上所示 )。

  • Widespread support: not only does every programming language or IDE come with JSON support, but also many web services APIs offer JSON as a means of data interchange.

    广泛的支持 :不仅每种编程语言或IDE都具有JSON支持,而且许多Web服务API都提供JSON作为数据交换的方式。

  • As a subset of JavaScript, it supports the following JavaScript data types:

    作为JavaScript的子集,它支持以下JavaScript数据类型

    • string

    • number

    • object

      目的
    • array

      数组
    • true and false

      truefalse

    • null

      null

    As a subset of JavaScript, it supports the following JavaScript data types:

    作为JavaScript的子集,它支持以下JavaScript数据类型

This is how our previous spreadsheet would look, after being serialized in JSON:

这是我们先前的电子表格在JSON中序列化后的样子:

[
  {
    "name": "William",
    "last name": "Bailey",
    "dob": 1962,
    "nickname": "Axl Rose",
    "instruments": [
      "vocals",
      "piano"
    ]
  },
  {
    "name": "Saul",
    "last name": "Hudson",
    "dob": 1965,
    "nickname": "Slash",
    "instruments": [
      "guitar"
    ]
  }
]

BSON (BSON)

BSON, short for Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. … It also con­tains ex­ten­sions that al­low rep­res­ent­a­tion of data types that are not part of the JSON spec.

BSON是Binary JSON的缩写,是类似于JSON的文档的二进制编码序列化。 …它也包含扩展名,这些扩展名允许表示不属于JSON规范的数据类型。

JSON is a plain text format, and while binary data can be encoded in text, this has certain limitations and can make JSON files very big. BSON comes in to deal with these problems.

JSON是纯文本格式,虽然二进制数据可以用文本编码,但它具有一定的局限性,并且会使JSON文件很大。 BSON致力于解决这些问题。

It has the following features:

它具有以下功能:

  • convenient storage of binary information: better suitable for exchanging images and attachments

    方便地存储二进制信息 :更适合交换图像和附件

  • designed for fast in-memory manipulation

    设计用于快速的内存操作

  • simple specification: like JSON, BSON has a very short and simple spec

    简单的规范 :与JSON一样,BSON的规范非常简短

  • primary data rep­res­ent­a­tion for Mon­goDB: BSON is de­signed to be tra­versed eas­ily

    MongoDB的主要数据表示形式 :BSON旨在易于遍历

  • extra data types:

    额外的数据类型

    • double (64-bit IEEE 754 floating point number)

      double(64位IEEE 754浮点数)
    • date (integer number of milliseconds since the Unix epoch)

      日期(自Unix时代以来的整数毫秒数)
    • byte array (binary data)

      字节数组(二进制数据)
    • BSON object and BSON array

      BSON对象和BSON数组
    • JavaScript code

      JavaScript代码
    • MD5 binary data

      MD5二进制数据
    • regular expressions

      常用表达

    extra data types:

    额外的数据类型

消息包 (MessagePack)

It’s like JSON. But fast and small.

就像JSON。 但是又快又小。

MessagePack (also msgpack) is another binary format for serialization. Not as well known as BSON, but it’s worth having a look at.

MessagePack (也为msgpack)是用于序列化的另一种二进制格式。 虽然不如BSON出名,但是值得一看。

Among its features:

其功能包括:

  • designed for efficient transmission over the wire

    设计用于通过电线进行高效传输

  • better JSON-compatibility than BSON: as explained by Sadayuki Furuhashi in this Stack Overflow post

    JSON兼容性比BSON更好 :正如Sadayuki Furuhashi在此Stack Overflow帖子中解释的那样

  • smaller than BSON: is has a smaller overhead than BSON, and can serialize smaller objects most of the time

    小于BSON :具有比BSON小的开销 ,并且在大多数情况下可以序列化较小的对象

  • type checking: it supports static-typing

    类型检查 :它支持静态类型

  • streaming API: support for streaming deserializers, which is useful for network communication.

    streaming API :支持流式反序列化器,这对于网络通信很有用。

YAML (YAML)

YAML: YAML Ain’t Markup Language. What It Is: YAML is a human friendly data serialization standard for all programming languages.

YAML:YAML不是标记语言。 简介:YAML是适用于所有编程语言的人类友好数据序列化标准。

Back to plaintext formats, YAML is an alternative to JSON:

回到纯文本格式, YAML是JSON的替代方法:

  • (truly) human readable code: YAML is so readable that even its front-page content is displayed in YAML to make this point

    (确实)人类可读的代码 :YAML可读性强,甚至其首页内容也显示在YAML中以说明这一点

  • compact code: whitespace indentation is used to denote structure, no need for quotes nor brackets

    紧凑代码 :空格缩进用于表示结构,无需引号或括号

  • syntax for relational data: to allow internal references with anchors ( &) and aliases (*)

    关系数据的语法 :允许带有锚点( & )和别名( * )的内部引用

  • especially suited for viewing/editing of data structures: such as configuration files, dumping during debugging, and document headers

    特别适合查看/编辑数据结构 :例如配置文件,调试期间的转储和文档头

  • a rich set of language independent types:

    丰富的语言独立类型集

    • collections:

      集合

      • unordered set of key (!!map)

        无序键( !!map )

      • ordered sequence of key (!!omap)

        键的有序序列( !!omap )

      • ordered sequence of key (!!pairs)

        键的有序序列( !!pairs )

      • unordered set of non-equal values (!!set)

        无序非等值!!set ( !!set )

      • sequence of arbitrary values (!!seq)

        任意值的序列( !!seq )

      collections:

      集合

    • scalar types:

      标量类型

      • null values (~, null)

        空值( ~null )

      • decimals (1234), hexadecimal (0x4D2) and octal (02333) integers

        十进制( 1234 ),十六进制( 0x4D2 )和八进制( 02333 )整数

      • fixed (1_230.15) and exponential (12.3015e+02) floats

        固定( 1_230.15 )和指数( 12.3015e+02 )浮点数

      • infinity (.inf, -.Inf) and not-a-number (.NAN)

        无限( .inf ,-. -.Inf )和非数字( .NAN )

      • true (Y, true, Yes, ON) and false (n, FALSE, No, off)

        true( YtrueYesON )和false( nFALSENooff )

      • binary (!!binary) with base64 encoding

        具有base64编码的二进制( !!binary )

      • timestamps (!!timestamp).

        时间戳 ( !!timestamp )。

      scalar types:

      标量类型

    a rich set of language independent types:

    丰富的语言独立类型集

This is how our little spreadsheet looks when serialized in YAML:

这是在YAML中序列化后的小电子表格的外观:

- name: William
  last name: Bailey
  dob: 1962
  nickname: Axl Rose
  instruments:
    - vocals
    - piano

- name: Saul
  last name: Hudson
  dob: 1965
  nickname: Slash
  instruments:
    - guitar

其他格式 (Other Formats)

There are a number of other formats for serialization, such as Protocol Buffers (protobuf, also binary), that I’ve (in a rather discretionary manner) left out. If you just want to know every possible format, go and have a look at Wikipedia’s comparison of data serialization formats.

我已经(以相当随意的方式)遗漏了许多其他序列化格式,例如协议缓冲区 (protobuf,也是二进制)。 如果您只想知道每种可能的格式,请查看Wikipedia的数据序列化格式比较 。

……HDF5? (… HDF5?)

We’ll get a bit off-topic here, but just slightly. The Hierarchical Data Format version 5 (HDF5) isn’t really for serialization, but rather for storage, and it’s taking data science and other industries by storm. It’s a very fast and versatile format that can be used not only to store a number of data structures, but even as a replacement for relational databases.

我们在这里会有点题外话,只是略微。 分层数据格式版本5( HDF5 )并不是真正用于序列化,而是用于存储, 它正在席卷数据科学和其他行业 。 这是一种非常快速且通用的格式,不仅可以用于存储许多数据结构,甚至可以替代关系数据库。

To conclude this intermission, let’s just mention that if you’re into binary formats such as BSON and MessagePack for storing/exchanging big volumes of information, you may very well want to have a look at HDF5.

总结一下这段间歇,我们只想提一下,如果您喜欢使用BSON和MessagePack之类的二进制格式来存储/交换大量信息,那么您可能很想看看HDF5。

基准与比较 (Benchmarks and Comparisons)

A pattern that emerges is that BSON can be more expensive than JSON when serializing, but faster when deserializing; and MessagePack is faster than both on any operation. Also, because of its overhead and in spite of being a binary format, BSON files can occasionally be bigger than JSON ones when storing non-binary data. Some links to have a look at:

出现的一种模式是,序列化时BSON比JSON昂贵,反序列化时则更快。 和MessagePack在任何操作上都比两者都快。 同样,由于其开销,并且尽管是二进制格式,但是在存储非二进制数据时,BSON文件有时可能会比JSON文件大。 一些链接可以看一下:

  • Serialization Performance comparison (C#/.NET) by Maxim Novak on M@X on DEV.

    Maxim Novak在DEV上的M @ X上进行的序列化性能比较(C#/。NET) 。

  • Protocol Buffers, Avro, Thrift & MessagePack by Ilya Grigorik on ivita.com.

    Ilya Grigorik在ivita.com上的协议缓冲区,Avro,Thrift和MessagePack 。

  • Binary Serialization Tour Guide by Karlin Fox in Atomic Object.

    Karlin Fox在Atomic Object中的Binary Serialization Tour Guide 。

  • Efficiently Store Pandas DataFrames by Matthew Rocklin.

    高效存储 Matthew Rocklin 编写的Pandas DataFrames 。

  • MessagePack vs JSON vs BSON by Wesley Tanaka.

    Wesley Tanaka 撰写的MessagePack,JSON和BSON 。

It’s also worth noting that the performance could change depending on the serializer and the parser you choose, even for the same format.

还要注意的是,即使对于相同格式,性能也会因选择的序列化器和解析器而有所不同。

备注和评论 (Remarks and Commentary)

As silly as it may sound, BSON has the advantage of the name: people automatically link the format developed by MongoDB (BSON) to the standard (JSON), which are not associated one to another. So when searching for a binary alternative for JSON, you may also consider other options.

尽管听起来很愚蠢,但BSON却具有名称的优势 :人们自动将MongoDB(BSON)开发的格式链接到标准(JSON),而标准彼此之间没有关联。 因此,当搜索JSON的二进制替代方案时,您还可以考虑其他选项。

In fact, MessagePack seems to beat BSON in every possible aspect: it’s faster, smaller, and it’s even more compatible to JSON that BSON is. (In fact, if you’re already working with JSON, MessagePack is almost a drop-in optimization.) Maybe as a “reporter” I should be more balanced, but as a developer, this is a no brainier.

实际上, MessagePack在各个方面似乎都击败了BSON :它更快,更小,并且与BSON更加兼容JSON。 (实际上,如果您已经在使用JSON,则MessagePack几乎是一个嵌入式优化。)也许作为“报告者”,我应该更加平衡,但是作为开发人员,这没有什么头脑。

Still, BSON is MongoDB’s format to store and represent data, so if you’re working with this NoSQL DB, that’s a reason to stick with it.

不过, BSON是MongoDB用来存储和表示数据的格式 ,因此,如果您使用此NoSQL DB,那就是坚持使用它的原因。

Of course, serialization is not all about storing binary data. Admittedly, JSON has a different goal in mind — that of being “human readable”. But it doesn’t take much effort to notice that YAML does a significantly better job at it.

当然,序列化并不仅仅是存储二进制数据。 诚然,JSON具有不同的目标- “人类可读”的目标 。 但是,无需花费太多精力即可注意到YAML在此方面做得更好。

However, the YAML spec is awfully big, specially when compared to that of JSON’s. But arguably, it must be, as it comes with more data types and features.

但是, YAML规范非常庞大 ,特别是与JSON相比。 但是可以说,它一定是必须的,因为它带有更多的数据类型和功能。

On the other hand, in can’t be ignored that the simplicity of JSON played a key role in its adoption over other serialization formats. It relies on an already existent widespread language, JavaScript, and if you know or are exposed to JS (which if you are in the web development industry, you are), you already know JSON.

另一方面,不容忽视的是,JSON的简单性在采用其他序列化格式方面起着关键作用。 它依赖于已经存在的广泛使用的语言JavaScript,并且,如果您知道或接触过JS(如果您在Web开发行业,那么您就是),那么您已经知道JSON。

Then why not adopt YAML, like now? In many cases it isn’t that easy. JSON still has a place for web APIs, as you can easily embed JSON code in HTTP requests (both for GET, as in URLs, and POST, as in sending a form): the format will let you know if the transmission was suddenly cut, as the code will automatically render invalid, which may not be the case with YAML and other competing plaintext formats. Also, you’ll still need to interact at one point or another with JSON-based APIs and legacy code, and it’s always a pain maintaining two pieces of code (JSON and YAML methods) for the same purpose (data serialization).

那为什么不像现在这样采用YAML呢? 在许多情况下,这并不容易。 JSON仍然是Web API的一席之地,因为您可以轻松地将JSON代码嵌入到HTTP请求中(对于GET,如在URL中,对于POST,如在发送表单中):该格式将让您知道传输是否突然中断 ,因为该代码将自动使无效,而YAML和其他竞争的纯文本格式可能不是这种情况。 另外,您仍然需要与基于JSON的API和旧版代码在某一点或另一点进行交互,并且出于同一目的(数据序列化)而同时维护两段代码(JSON和YAML方法)总是很麻烦的。

But then again, these are partly the same arguments that push us backwards and prevent us from adopting newer and more efficient technologies (e.g: like Python 3 over Python 2). And I thought for a minute that we, programmers and entrepreneurs, were innovators, aren’t we?

但是话又说回来,这些部分是相同的论点,这些论点使我们倒退并阻止我们采用更新,更高效的技术(例如:像Python 3而不是Python 2)。 我想了一会儿,我们,程序员和企业家,都是创新者,不是吗?

翻译自: https://www.sitepoint.com/data-serialization-comparison-json-yaml-bson-messagepack/

yaml json

你可能感兴趣的:(数据结构,python,java,编程语言,json)