Big Data refers to any collection of data so large and complex that it exceeds the processing capability of conventional data management systems and techniques.
Volume refers to the large size of data, often in the scale of terabytes to petabytes.
Variety refers to the increased diversity in data.
Structural variety、Media variety、Semantic variety、Availability variety
Velocity refers to the
increasing speed at which big data is created
Increasing speed at which the data needs to be stored and analyzed
Veracity refers to the quality of data.
Data type:
Structure Data:table、Semi-Structured Data:xml、Unstructured Data:Images
For automated data process system, consider the following example:
Walmart is the largest retail corporation of discount department and warehouse stores in the world. it collects all the customer transaction data for further processing (e.g., count the turnover of the day), and adopts the Apache Cassandra to store the processing results for future visualization applications. Design a high-level Big Data processing system for Walmart. Draw a diagram to specify the data processing pipeline. Give major functions of each component in the diagram
对于自动化数据处理系统,考虑以下示例:
沃尔玛是世界上最大的打折百货商店和仓储商店零售公司。它收集所有的客户交易数据进行进一步的处理(例如,计算当天的营业额),并采用Apache Cassandra存储处理结果,以供未来的可视化应用。为沃尔玛设计一个高水平的大数据处理系统。画一个图来指定数据处理管道。在图中给出每个组件的主要功能
Answer:
You are required to understand the proximity measure for binary attributes, especially the dissimilarity between asymmetric binary variables. Besides, you should be able to calculate various distances such as Manhattan distance, minkowski distance, supremum distance, etc. In addition, you should be able to calculate median, mean, approximate median, etc.
2. 了解你的数据
您需要理解二元属性的接近性度量,特别是非对称二元变量之间的不相似性。此外,你应该能够计算各种距离,如曼哈顿距离,闵可夫斯基距离,最高距离等。此外,你应该能够计算中位数、平均值、近似中位数等。
You are required to understand the architecture of Hadoop (racks, data center, etc.), how to calculate the distance of data nodes. The characteristics of files that are suitable for HDFS.
A 200-DataNode Hadoop cluster has 24 TB of disk space per DataNode. The replication factor of data block is 3.
Answer:
You are required to go over carefully for each example you have learned in class (read the examples carefully in course slides).
GREEDY ALGORITHM
(最高父系-本身)值*(本身+子系)个数=结果,结果越高优先选择。
The second choice 不包含The first choice的子系个数
The first choice 选择最小值,The second choice 选择最大值。
An online shopping store adopts the Apache Cassandra to store and manage the shopping cart data. Below is the Entity-Relationship Diagram (ERD) which models the relationships between three entities, i.e., users, shopping carts and product items. (16 marks)
Suppose the expected queries to the Cassandra database are as follows: