大数据课程学习

Characteristics of Big Data

Big Data refers to any collection of data so large and complex that it exceeds the processing capability of conventional data management systems and techniques.

Volume refers to the large size of data, often in the scale of terabytes to petabytes.

Variety refers to the increased diversity in data.

Structural variety、Media variety、Semantic variety、Availability variety

Velocity refers to the

increasing speed at which big data is created

Increasing speed at which the data needs to be stored and analyzed

Veracity refers to the quality of data.

Data type:

Structure Data:table、Semi-Structured Data:xml、Unstructured Data:Images

the problems of traditional RDBMS for handling big data

  • Fault:it have to restarts the query.
  • Performance fluctuations: wait for the slowest node.
  • No open-source parallel database and commercial ones are expensive.

For automated data process system, consider the following example:

Walmart is the largest retail corporation of discount department and warehouse stores in the world. it collects all the customer transaction data for further processing (e.g., count the turnover of the day), and adopts the Apache Cassandra to store the processing results for future visualization applications. Design a high-level Big Data processing system for Walmart. Draw a diagram to specify the data processing pipeline. Give major functions of each component in the diagram

对于自动化数据处理系统,考虑以下示例:

沃尔玛是世界上最大的打折百货商店和仓储商店零售公司。它收集所有的客户交易数据进行进一步的处理(例如,计算当天的营业额),并采用Apache Cassandra存储处理结果,以供未来的可视化应用。为沃尔玛设计一个高水平的大数据处理系统。画一个图来指定数据处理管道。在图中给出每个组件的主要功能

Answer:

  1. Apache Sqoop extracts customer transaction records from the transactional database.
  2. Apache HDFS stores the data in a scalable and reliable distributed manner.
  3. Apache Spark provides efficient distributed data processing and analysis.
  4. Apache Cassandra stores the analysis results for visualization.
  5. Apache Tableau provides interactive data-visualization for business decision
  1. Getting to know your data

You are required to understand the proximity measure for binary attributes, especially the dissimilarity between asymmetric binary variables. Besides, you should be able to calculate various distances such as Manhattan distance, minkowski distance, supremum distance, etc. In addition, you should be able to calculate median, mean, approximate median, etc.

2.      了解你的数据

您需要理解二元属性的接近性度量,特别是非对称二元变量之间的不相似性。此外,你应该能够计算各种距离,如曼哈顿距离,闵可夫斯基距离,最高距离等。此外,你应该能够计算中位数、平均值、近似中位数等。

大数据课程学习_第1张图片

大数据课程学习_第2张图片

 大数据课程学习_第3张图片

大数据课程学习_第4张图片

  1.  HDFS

You are required to understand the architecture of Hadoop (racks, data center, etc.), how to calculate the distance of data nodes. The characteristics of files that are suitable for HDFS.

A 200-DataNode Hadoop cluster has 24 TB of disk space per DataNode. The replication factor of data block is 3.

    1. Calculate the maximum size of the file(s) that can be stored in this cluster.

Answer:

  • Maximum size = 24TB * 200 / 3 = 1600 TB
    1. Assume that the block size is 400MB, average metadata size for each block is 100 Bytes. Estimate the minimum NameNode RAM size. 
  • Answer:
  • #file blocks (don’t concern replica):
    1600TB/400MB = 1600*1024*1024 MB/400MB = 4096*1024
  • Total metadata size:
    4096*1024* 100B = 400*1024*1024 B = 400 MB
  • Total metadata size is the minimum NameNode RAM size

  1. Selective materialization

You are required to go over carefully for each example you have learned in class (read the examples carefully in course slides).

GREEDY ALGORITHM

大数据课程学习_第5张图片

(最高父系-本身)值*(本身+子系)个数=结果,结果越高优先选择。

The second choice 不包含The first choice的子系个数

大数据课程学习_第6张图片

The first choice 选择最小值,The second choice 选择最大值。

大数据课程学习_第7张图片

  1. Apache Cassandra (read the PPT slides of P39-45 of Chapter 6 carefully,understand how to use the ER model and given queries to do logical data modeling (Chebotko logical diagram)).

An online shopping store adopts the Apache Cassandra to store and manage the shopping cart data. Below is the Entity-Relationship Diagram (ERD) which models the relationships between three entities, i.e., users, shopping carts and product items. (16 marks)

  • A user has a unique id and may have other attributes like email.
  • A product item has a unique id and other information like name and price.
  • A shopping cart has a unique id and name.
  • While a user can create many shopping carts, each cart must belong to exactly one user.
  • A shopping cart can have many items and a catalog item can be added to many carts. An item entry in a cart is further described by a timestamp and desired quantity.

 大数据课程学习_第8张图片

Suppose the expected queries to the Cassandra database are as follows:

  • Q1: Find all information about a given user, such as its name and email.
  • Q2: Find all shopping carts that belong to a given user, order by cart id (ASC).
  • Q3: Find all items in a given shopping cart, order by timestamp (DESC).
  • Q4: Find information about a given item, such as its name and price.
  • Q5: Find all shopping carts that a given item has been added to within a given time range, order by cart id (ASC). 

  • Design a logical data model of the Cassandra database                                                                        大数据课程学习_第9张图片

你可能感兴趣的:(学习)