(一)
(1)结构化数据(即行数据,存储在数据库里,可以用二维表结构来逻辑表达实现的数据)
(2)非结构化数据,包括所有格式的办公文档、文本、图片、XML、HTML、各类报表、图像和音频/视频信息等等
(3)所谓半结构化数据,就是介于完全结构化数据(如关系型数据库、面向对象数据库中的数据)和完全无结构的数据(如声音、图像文件等)之间的数据,HTML文档就属于半结构化数据。它一般是自描述的,数据的结构和内容混在一起,没有明显的区分。
对上面三种数据结构,按照我理解的定义就是:
结构化数据就是能用二维表逻辑表达实现的数据;无结构数据就是纯文本数据,没有标记;半结构化数据就是带有标记的文本。
其中,半结构化数据特征:
a、自描述性。先有数据,再考虑其结构模式。
b、不精确性。随时间和场景变化。
c、不规则性。
d、非强制性。
e、模式复杂性。有时候模式的规模比数据的规模还要大。就像书的目录比内容要多,数据库的索引比数据要多一样。
三种数据结构的数据模型:
结构化数据:二维表(关系型)
半结构化数据:树、图
非结构化数据:无
RMDBS的数据模型有:如网状数据模型、层次数据模型、关系型
其他:
结构化数据:先有结构、再有数据
半结构化数据:先有数据,再有结构
(二)
非结构化数据库是指其字段长度可变,并且每个字段的记录又可以由可重复或不可重复的子字段构成的数据库,用它不仅可以处理结构化数据(如数字、符号等信息)而且更适合处理非结构化数据(全文文本、图象、声音、影视、超媒体等信息)。非结构数据库比较出名的有:iBase。
据IDC的一项调查报告中指出:企业中80%的数据都是非结构化数据,这些数据每年都按指数增长60%。非结构化数据,顾名思义,是存储在文件系统的信息,而不是数据库。据报道指出:平均只有1%-5%的数据是结构化的数据。如今,这种迅猛增长的从不使用的数据在企业里消耗着复杂而昂贵的一级存储的存储容量。如何更好的保留那些在全球范围内具有潜在价值的不同类型的文件,而不是因为处理它们却干扰日常的工作?当然你可以采购更多的就地存储设备,但这总会有局限性的。云存储是越来越多的IT公司正在使用的存储技术。
(三)
这里是摘自WikiPedia的说法,大家可以参考一下:
Semi-structured data[1] is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.
In semi-structured data, the entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important.
Semi-structured data is increasingly occurring since the advent of the Internet where full-text documents and databases are not the only forms of data any more and different applications need a medium for exchanging information. In object-oriented databases, one often finds semi-structured data.
可以发现,半结构化数据,区别于传统的数据模型结构如关系数据库或数据表,它 包含标签(tags)或标记(markers)来区分语义元素,在数据中对记录和字段有 分层结构。因此,它还可以认为是自描述结构。当然,这里还提到无结构的数据就是纯文本的档案(full-text documents)。
XML,other markup languages, email, and EDI (Electronic Data Interchange) are all forms of semi-structured data. OEM (Object Exchange Model) was created prior to XML as a means of self-describing a data structure. XML has been popularized by web services that are developed utilizing SOAP principles.
Some types of data described here as "semi-structured", especially XML, suffer from the impression that they are incapable of structural rigor at the same functional level as Relational Tables and Rows. Indeed, the view of XML as inherently semi-structured (previously, it was referred to as "unstructured") has handicapped its use for a widening range of data-centric applications.
JSON
JSON or JavaScript Object Notation, is an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML. JSON has been popularized by web services developed utilizing REST principles.
There is a new breed of databases such as MongoDB and Couchbase that store data natively in JSON format, leveraging the pros of semi-structured data architecture.
常见的半结构化数据有XML和JSON,XML是标记语言,email和EDI都是半结构化数据。OEM是创立于XML之前的自描述数据结构。JSON数据对象主要为键值对(attribute-value pair).