2019-05-14 Reading: Data provenance and Timestamp Data Analytics

  • 借助区块链技术,将商品原材料过程、生产过程、流通过程、营销过程的信息整合并写入区块链,实现一物一码的全流程正品追踪。
  • 每一个环节,多个主体,分别将各自的信息存储在链路上,相互信任,相互背书,可以防止任何一方篡改,实现责任可追究。
  • 每一条信息都拥有自己特有的区块链ID“身份证”,且每条信息都附有各主体的数字签名和时间戳,供消费者查询和校验。

Timestamp related data and query

Data type: time series data
Data arrive synchronously or asynchronously from multiple sources
4 Process:

  • timestamp process: turn raw time series data into time stamped events to be fed to the indexing process
  • index process: the event data is used by the index process to build time bucketed indices of the events
  • search process: these indices are utilized by the search process which takes searches from users or systems, decomposes the searches, and then executes a search across a set of indices.
  • presentation process
    Example: a user might want to locate all the events from a particular web server and a particular application server occurring within the last hour and which contain a specific IP address.

Time indexing

A. Time Bucketing
Each time bucket can handle one hour's worth of data. Alternate policies might vary the bucket extents from one time period to another. For example, a bucketing policy may specify that the buckets for events from earlier than today are three-hour buckets, but that the buckets for events occurring during the last 24 hours are hashed by the hour.
In order to improve efficiency further, buckets are instantiated using a lazy allocation policy (as late as possible) in primary memory (RAM). In-memory buckets have a maximum capacity and, when they reach their limit, they will be committed to disk and replaced by a new bucket. Bucket storage size is another element of the bucketing policy and varies along with the size of the temporal extent. Finally, bucket policies typically enforce that buckets (a) do not overlap, and (b) cover all possible incoming timestamps.

Each incoming event is assigned to the time bucket where the time stamp from the event matches the bucket's temporal criteria. In one implementation, we can use half-open intervals, defined by a start time and an end time where the start time is an inclusive boundary and the end time is an exclusive boundary. This can make sure events occurring on bucket boundaries are uniquely assigned to a bucket.

B. Segmentation
Once an appropriate bucket has been identified for an event, the raw event data is segmented. A segment is a substring of the incoming event text and segmentation is the collection of segments implied by the segmentation algorithm on the incoming event data.
A segment substring may overlap another substring, but if it does, it must be contained entirely within that substring. We allow this property to apply recursively to the containing substring so that the segment hierarchy forms a tree on the incoming text.

C. Archiving and Indexing Events
The index is split into two separate phases: hot indexing and warm indexing. Hot indexes are managed entirely in RAM, are optimized for the smallest possible insert time, are not searchable, and do not persist. Warm indexes are searchable and persistent, but immutable. When hot indexes need to be made searchable or need to be persistent, they are converted into warm indexes.

During the course of the indexing process, it is possible that a single time bucket will be filled and committed to disk several times. This will result in multiple, independently searchable indices in secondary storage for a single time span. in an exemplary implementation, there is a merging process that takes as input two or more warm indices and merges them into a single warm index for that time bucket. This is a performance optimization and is not strictly required for searching.

你可能感兴趣的:(2019-05-14 Reading: Data provenance and Timestamp Data Analytics)