Categorical Data
A categorical variable is also known as a discrete or qualitative variable and can have two or more categories.
It is further divided into two variants, nominal and ordinal.
These variables are sometimes coded as numerical values, or as strings.
This is an unordered category data. This type of variable may be “label-coded” in numeric form but these numerical values have no mathematical interpretation and are just labeling to denote categories. For example, colours: black, red and white can be coded as 1, 2 and 3.
这是一个无序的类别数据。 这种类型的变量可能以数字形式进行**“标签编码”**,但这些数值没有数学解释,只是用来表示类别的标签。 例如,颜色:黑色、红色和白色可以编码为 1、2 和 3。
A dichotomous is a type of nominal data that can only have two possible values, e.g. true or false, or presence or absence. These are also sometimes referred as binary or Boolean variables.二分法是一种名义数据,它只能有两个可能的值,例如 真或假,或存在或不存在。 这些有时也称为二进制或布尔变量。
e.g.: ture(1) or false(0)
This is ordered categorical data in which there is strict order for comparing the values, so a labelling as numbers is not completely arbitrary. For example, human height (small, medium and high) can be coded into numbers small = 1, medium = 2, high = 3.
这是有序的分类数据,其中比较值有严格的顺序,因此标记为数字并不是完全任意的。 例如,人的身高(小、中、高)可以编码为数字小 = 1、中 = 2、高 = 3。
– Values are ordered 值是有序的
– No distance is implied 没有暗示距离
– Eg rank, agreement 例如等级、协议
Quantitative Data
It is a variable in which the interval between values has meaning and there is no true zero value.
It is variable that might have a true value of zero and represents the total absence of the variable being measured. For example, it makes sense to say a Kelvin temperature of 100 is twice as hot as a Kelvin temperature of 50 because it represents twice as much the thermal energy (unlike Fahrenheit temperatures of 100 and 50).
它是真实值可能为零的变量,表示被测变量完全不存在。 例如,可以说 100 的开尔文温度是 50 的开尔文温度的两倍,因为它代表了两倍的热能(与华氏温度 100 和 50 不同)。
Samples from two populations with the same mean but different variances. The red population has mean 100 and variance 100 (SD=10) while the blue population has mean 100 and variance 2500 (SD=50).
Pandas provides various functions for handling missing/wrong data
Part of this already included in the input functions (cf.csv_read()
) where missing values are automatically replaced withNA/NaN
Pandas 提供了各种处理丢失/错误数据的函数
– 这部分内容已包含在输入函数中(参见 csv_read() ),其中缺失值会自动替换为 NA/NaN
data2 = data[’numGen’].dropna()
data[‘numGen’].fillna(0, inplace=True)
data[‘numGen’].replace(to_replace=‘<Null>’, value=0, inplace=True)
Some datasets contain placeholders for missing values
such as ‘n/a
’, ‘–’ or ‘null
Best to replace during import to avoid later problem
import pandas as pd
missing_values = [“--”,”<Null>”]
data = pd.read_csv(‘MajorPowerStations.csv’, na_values = missing_values)
Scipy 包括各种相关统计,得到-1到1之间的数。
A database is a shared collection of logically related data and its description.
The database represents the entities (real-world things), the attributes (their relevant properties), and the logical relationships between the entities.
– Data is managed, so quality can be enforced by the DBMS
管理数据,因此 DBMS 可以强制执行质量
– Improved Data Sharing 改进的数据共享
• Different users get different views of the data 不同的用户对数据有不同的看法
• Efficient concurrent access 高效的并发访问
– Enforcement of Standards 执行标准
• All data access is done in the same way 所有数据访问均以相同方式完成
– Integrity constraints, data validation rules 完整性约束、数据验证规则
– Better Data Accessibility/ Responsiveness 更好的数据可访问性/响应能力
• Use of standard data query language (SQL) 使用标准数据查询语言 (SQL)
– Security, Backup/Recovery, Concurrency 安全、备份/恢复、并发
• Disaster recovery is easier 灾难恢复更容易
Program-Data Independence 程序数据独立性
– Metadata stored in DBMS, so applications don’t need to worry about data formats 元数据存储在 DBMS 中,因此应用程序无需担心数据格式
– Data queries/updates managed by DBMS so programs don’t need to process data access routines
数据查询/更新由 DBMS 管理,因此程序不需要处理数据访问例程
– Results in:
• Reduced application development time 缩短应用程序开发时间
• Increased maintenance productivity 提高维护效率
• Efficient access 高效访问
– Table – an arrangement of related information stored in columns and rows.
表格 – 存储在列和行中的相关信息的排列。
– Field / Attribute – column in a table, contains homogenous set of data.
字段/属性 – 表中的列,包含同类数据集。
– Field data types - kind of data that can be stored in a field. For example, a field whose data type is Text can store data consisting of either text or number characters, but a Number field can store only numerical data.
字段数据类型 - 可以存储在字段中的数据类型。 例如,数据类型为文本的字段可以存储由文本或数字字符组成的数据,但数字字段只能存储数字数据。
– Primary Key (PK) – a field in a table whose value is uniquely identifies each record in the table. A PK cannot be null (it must be given).
主键 (PK) – 表中的字段,其值唯一标识表中的每条记录。 PK 不能为空(必须给出)。
– Record – A row in table.
记录 – 表中的一行
Primary Key
– A primary key is a unique attribute which the database uses to identify a row in a table.
– It is a unique, auto-incrementing ID which is filled in by the database - in other words it is NEVER NULL-– ( NULL has the special meaning in databases of “unknown” or “not given” )
它是一个唯一的、自动递增的 ID,由数据库填充 - 换句话说,它永远不会为空–(NULL 在“未知”或“未给出”的数据库中具有特殊含义)
– A primary ID number will only ever be issued once
一个主要的 ID 号码只会发出一次
Foreign Key
– When we need to refer to a record in a separate table we reference its ID as a foreign key.
当我们需要引用单独表中的记录时,我们将其 ID 作为外键引用。
– A foreign key is defined in a second table, but it refers to the primary key or a unique key in the first table.
One-One Relationship (1-1 Relationship):
One-to-One (1-1) relationship is defined as the relationship between two tables where both the tables should be associated with each other based on only one matching row.
一对一 (1-1) 关系定义为两个表之间的关系,其中两个表应仅基于一个匹配行相互关联。
One-Many Relationship (1-M Relationship):
The One-toMany relationship is defined as a relationship between two tables where a row from one table can have multiple matching rows in another table.
Many-to-Many Relationship (M-N Relationship) 多对多关系
– SQL is the standard declarative query language for RDBMS
SQL 是 RDBMS 的标准声明式查询语言– Describing what data we are interested in, but not how to retrieve it.
– Supported commands from roughly two categories:
支持的命令大致分为两类– DDL (Data Definition Language) 数据定义语言
• Create, drop, or alter the relation schema 创建、删除或更改关系模式
• Example:
CREATE TABLE name ( list_of_columns )
– DML (Data Manipulation Language) 数据操作语言
• for retrieval of information also called query language 用于检索信息,也称为查询语言
When creating a table, we can also specify Integrity Constraints for columns
– eg. domain types per attribute, orNULL / NOT NULL
例如。 每个属性的域类型,或NULL / NOT NULL
– Primary key: unique, minimal identifier of a relation.
– Examples include employee numbers, social security numbers, etc. This is how we can guarantee that all rows are unique.
– Foreign keys are identifiers that enable a dependent relation (on the many side of a relationship) to refer to its parent relation (on the one side of the relationship)
– Must refer to a candidate key of the parent relation 必须引用父关系的候选键
– Like a `logical pointer’ 就像一个“逻辑指针”
– Keys can be simple (single attribute) or composite (multiple attributes)
SQL supports various domain constraints to restrict attribute to valid domains
SQL 支持各种域约束以将属性限制为有效域
whether an attribute is allowed to become NULL (unknown)
是否允许属性变为 NULL(未知)
to specify a default value
( condition ) a Boolean condition that must hold for every tuple in the db instance
gender CHAR CHECK (gender IN ('M,'F','T')),
birthday DATE,
country VARCHAR(20),
level INTEGER DEFAULT 1 CHECK (level BETWEEN 1 and 5)
链接: PostgreSQL 教程
Main Memory (RAM):主要存储 内存条
• Expensive 昂贵的
• Volatile 易挥发的
Secondary Storage (HDD):辅助存储 硬盘
• Cheap 便宜的
• Stable 稳定的
• BIG 容量大
Tertiary Storage (e.g. Tape): 三级存储(例如磁带)
• Very Cheap 非常便宜
• Stable 稳定的
Key Challenge: Secondary storage needed for sheer data volume (and persistence), but it is slow.
– Block-wise transfer 分块转移
• transfer data in fixed-size chunks (blocks or pages) between storage layers
– Caching / Buffering 缓存/缓冲
• Keep ‘hot’ data in memory, use secondary storage for ‘cold’ data
– Optimised File Organisation 优化的文件组织
• Heap Files vs. Sorted Files; Row Stores vs Column Stores
堆文件与排序文件; 行存储与列存储
– Indexing 索引
– Partitioning 分区
Many alternatives exist, each ideal for some situations, and not so good in others:
– Indexes – data structures to organize records via trees or hashing
– like sorted files, they speed up searches for a subset of records, based on values in certain (“search key”) fields
– Updates are much faster than in sorted files.
a record can be placed anywhere in the file where there is space (random order)
堆文件 – 记录可以放置在文件中任何有空间的地方(随机顺序)– suitable when typical access is a file scan retrieving all records.
– Simplest file structure contains records in no particular order.
–Access method is a linear scan 访问方法是线性扫描
–— In average half of the pages in a file must be read,in the worst case even the whole file
–— Efficient if all rows are returned (SELECT * FROM table
如果返回所有行则有效(SELECT * FROM table)
—– Very inefficient if a few rows are requested 如果请求几行,效率非常低
– Rows appended to end of file as they are inserted
—– Hence the file is unordered因此文件是无序的
– Deleted rows create gaps in file 删除的行在文件中产生间隙
–— File must be periodically compacted to recover space 必须定期压缩文件以恢复空间
– store records in sequential order, based on the value of the search key of each record
Sorted Files – 根据每条记录的搜索键值按顺序存储记录
– best if records must be retrieved in some order, or only a ‘range’ of records is needed.
– Idea: Separate location mechanism from data storage
– Just remember a book index:只需记住一个书籍索引
Index is a set of pages (a separate file) with pointers (page numbers) to the data page which contains the value
– Instead of scanning through whole book (relation) each time, using the index is much faster to navigate (less data to search)
– Index typically much smaller than the actual data
Here, index is on name attribute of Stations table
索引位于 Stations 表的 name 属性上
– We say name is the search key for this index (it is the attribute which we use to look up data)
name 是这个索引的搜索键(它是我们用来查找数据的属性)
– Tree index: search keys are stored in sorted order in index [this supports range query for search key]
– Hash index: search keys are distributed uniformly across “buckets” in index using a “hash function”.
CREATE INDEX name ON relation-name (
CREATE INDEX StationNameIdx ON Stations(name)
To drop an index 删除索引:
DROP INDEX index-name
In a clustered index, both index entries and rows with the actual data are ordered in the same way.
– The particular index structure (e.g. hash or tree) dictates how the index entries are organized in the storage structure
特定的索引结构(例如哈希或树)决定了索引条目在存储结构中的组织方式– For a clustered index, this then dictates how the data rows are organized
– There can be at most one clustered index on a table.
– e.g. the white pages of the phone book in alphabetical order
例如 按字母顺序排列的电话簿白页
statement generally creates a clustered index on primary key.
– To have clustered index on other attribute, 要在其他属性上设置聚集索引
in PostgreSQL use command:CLUSTER TABLE name ON Index
– Index entries and rows are not ordered in the same way.
– There can be many secondary indices on a table.
– Index created byCREATE INDEX
is generally an unclustered, secondary index.
CREATE INDEX 创建的索引通常是非聚集的二级索引
– Goal: Is it possible to answer whole query just from an index?
– Covering Index - an index that contains all attributes required to answer a given SQL query:覆盖索引 - 包含回答给定 SQL 查询所需的所有属性的索引:
– all attributes from theWHERE
filter condition 来自 WHERE 过滤条件的所有属性
– if it is a grouping query, also all attributes fromGROUP BY
果是分组查询,还有来自 GROUP BY 和 HAVING 的所有属性
– all attributes mentioned in theSELECT
SELECT 子句中提到的所有属性
– Typically a multi-attribute index
– Order of attributes is important: Prefix of the search key must be the attributes from theWHERE
属性顺序很重要:搜索关键字的前缀必须是来自 WHERE 的属性.
Two main physical design techniques:
– Data Partitioning 数据分区
– Storing sub-sets of the original data set at different places
在不同地方存储原始数据集的子集• can be in different tables in schema on same server, or at remote sites
可以位于同一服务器或远程站点的架构中的不同表中– Goal is to query smaller data sets & to gain scalability by parallelism
\– Sub-sets can be defined by
可以通过以下方式定义子集• columns: Vertical Partitioning 列:垂直分区
• rows: Horizontal Partitioning 行:水平分区
(if each partition is stored on a different site also called Sharding)
–Data Replication (Not covered in this unit of study)
数据复制(本研究单元未涵盖)– Storing copies (‘replicas’) of the same data at more than one place
– Goal is fail safety / availability
– Advantages of Partitioning:分区的优点:
– Easier to manage than a large table 比大桌子更容易管理
– Better availability: 更好的可用性
if one partition is down, others are unaffected if stored on different tablespace / disk
– Helps with bulk loading, e.g for data warehouse applications
– Queries faster on smaller partitions; can be evaluated in parallel
在较小的分区上查询速度更快; 可以并行评估
– Reconnaissance 侦察
– Identify source, and check its structure and content
– Webpage Retrieval 网页检索
– Download one or multiple pages from source
– Typically in a script or program that auto-generates new URLs based on website structure and its URL format
通常在根据网站结构及其 URL 格式自动生成新 URL 的脚本或程序中
– Data Extraction from webpage 从网页中提取数据
– Content parsing, raw data extraction
– Data Cleaning and transformation into required format
– Data Storage / Analysis / combining with other data sets
– Many websites provide a
许多网站提供 robots.txt 文件– Meant for web crawlers who should check this content first before starting crawling a website
适用于在开始抓取网站之前应先检查此内容的网络爬虫– Different rules in here这里有不同的规则
• Crawling/scraping allowed at all?是否允许爬行/抓取?
• Only specific subdirectories?
• Only certain programs (“user-agent”)?
• Which frequency (“request-rate”)?
Df. https://en.wikipedia.org/wiki/Robots_exclusion_standard
– Be a good net citizen: 做一个好的网民
Check, ask, don’t overload – and don’t steal (check copyright!)
– Web scraping per itself is not illegal, you are free to save all publicly data available on the internet to your computer.
– The way you will use that data is what might be illegal.
– Please read the website terms and conditions, and robots.txt, and make sure you are not doing anything illegal
请阅读网站条款和条件以及 robots.txt,并确保您没有做任何违法的事情
– URL – Uniform Resource Locator URL – 统一资源定位符
– “address” format on the web 网络上的“地址”格式
– Example:
• https://convictrecords.com.au/ships/adamant/1821– General Format 通用格式
• protocol://site/path_to_resource
• Typical protocols: http https ftp
– Can be scripted or programmed; more details later and in tutorials
可以编写脚本或编程; 稍后和教程中的更多详细信息
– Webpages are written in HTML网页是用 HTML 编写的
– Textual markup language that defines structure, content, and design of a page as well as active elements (scripts, forms, etc.)
定义页面结构、内容和设计以及活动元素(脚本、表单等)的文本标记语言– Typically several additional files linked:通常链接几个附加文件:
• CSS - cascading style sheets CSS - 级联样式表
• Scripts, Images, videos etc. 脚本、图像、视频等
– Head 头部
– title, style sheets, scripts, meta-data
– Body 主体
– headings, text, lists, tables, images, forms etc.
– Four options: 四个选项
– text patterns 文本模式
– DOM navigation DOM 导航
– CSS selectors CSS 选择器
– XPath expressions XPath 表达式
– Many website or web service provide programmable APIs which allow you to explicitly request data for a program to process, instead of pages to view in browser
许多网站或 Web 服务提供可编程 API,允许您明确请求数据以供程序处理,而不是在浏览器中查看页面
– HTML, XML and JSON are examples of so-called semistructured data models
HTML、XML 和 JSON 是所谓的半结构化数据模型的示例
– data with non-rigid structure具有非刚性结构的数据
– Characteristics of semistructured data半结构化数据的特征
– Missing or additional attributes
– Multiple attributes
– Nesting: semistructured objects (‘documents’) are hierarchical / have tree-structure
– Different types in different objects
– Heterogeneous collections
Self-describing, irregular data, no a priori structure
– While HTML is mainly for web page design, 虽然 HTML 主要用于网页设计
XML is the more structured “cousin” for data exchange XML 是更结构化的数据交换“表亲”
– Some web services can be asked to send XML rather than HTML pages
可以要求某些 Web 服务发送 XML 而不是 HTML 页面
– Also common in enterprise data exchange, or open data sets
– XML refers to its objects as elements XML 将其对象称为元素
– The top-most element is called the root or document element.
– Elements are bound by tags:元素由标签绑定
– Tree structure! (not a graph) 树结构! (不是图表)
– Solely data type for leaf elements:PCDATA (parseable character data)
– DOM Navigation DOM 导航
– XML documents represent a tree structure which can be navigated using XML’s Document Object Model (DOM)
XML 文档表示可以使用 XML 的文档对象模型 (DOM) 导航的树结构
– XPath
– XPath expressions allow to query single values, node(s) or whole subtrees within one XML document
XPath 表达式允许在一个 XML 文档中查询单个值、节点或整个子树
– XQuery
– XQuery builds on XPath to specify a declarative query language over a set of XML documents
XQuery 建立在 XPath 之上,以在一组 XML 文档上指定声明性查询语言
– Relational World关系世界
Schema-first, rich type system for attributes, integrity constraints
– “First Normal Form”: only atomic type attributes allowed
– Semi-structured World 半结构化世界
– Self-describing data with flexible structure
– Nested data model with tree-structure
– optional attributes, grammar, schema and vocabulary
– Traditional dbms platforms were relational (SQL as query language; relational data model) and also powerful (lots of features for integrity, security, tuning), expensive, resource-intensive, hard to administer
传统的 dbms 平台是关系型的(SQL 作为查询语言;关系型数据模型)并且功能强大(许多功能用于完整性、安全性、调优)、昂贵、资源密集、难以管理– Mostly focused on scale-up (run on powerful expensive servers to get excellent performance)
– Rise of cloud computing shifted focus to scale-out on many commodity simple servers, with fault-tolerance
云计算的兴起将重点转移到具有容错性的许多商用简单服务器上的横向扩展– New systems were designed, and described as “NoSQL” because they gave up features of traditional platforms
– Simpler data model, simpler queries and updates (eg without crosstable joins or triggers), weaker guarantees for consistency and integrity
– Often open-source and sometimes free
– Over time, the new platforms added features like joins, triggers and integrity (under pressure from users) while old platforms added support for more diverse data models
– The phrase “Not only SQL” has been used for these systems
短语“不仅是 SQL”已用于这些系统
– Basically a JSON store (JSON type system)
基本上是一个 JSON 存储(JSON 类型系统)
– Flexible schema: Document in a collection do not need to have the same structure
– All documents have an object ID (
) – either user-defined or automatically generated
所有文档都有一个对象 ID (_id) – 用户定义或自动生成
– Relationships:
either via nested documents (“embedded sub-documents”) or using references
Text data usually does not have a pre-defined data model, is unstructured and is typically text-heavy, but may contain dates, numbers and facts as well.
This results in ambiguities that make it more difficult to understand than data in structured databases.
– Supervised learning – predict a value where truth is available in the training data
监督学习 – 预测训练数据中的真实值– Prediction 预言
– Classification (categorical - discrete labels), Regression (quantitative -numeric values)
分类(分类 - 离散标签),回归(定量 - 数值)
– Unsupervised learning – find patterns without ground truth in training data
无监督学习 – 在训练数据中找到没有基本事实的模式– Clustering 聚类
– Probability distribution estimation 概率分布估计
– Finding association (in features) 寻找关联(在功能中)
– Dimension reduction 降维
Other tasks: Semi-supervised learning, Reinforcement learning
Split a string (document) into pieces called tokens
– Possibly remove some characters, e.g., punctuation
– Remove “stop words” such as “a”, “the”, “and” which are considered irrelevant
Map similar words to the same token
– Stemming/lemmatisation 词干/词形还原
– Avoid grammatical and derivational sparseness 避免语法和派生稀疏
– E.g., “was” => “be”
– Lower casing, encoding 下壳,编码
– E.g., “Naïve” => “naive”
Binary indicator feature for each word in a document
Ignore frequencies
Term frequency 词频
– Give more weight to terms that are common in document
– TF = |occurrences of term in doc|
– Damping 阻尼
– Sometimes want to reduce impact of high counts
TF = log(|occurrences of term in doc|)
Inverse document frequency (IDF)逆向文档频率 (IDF)
– Give less weight to terms that are common across documents
• deals with the problems of the Zipf distribution
处理 Zipf 分布的问题
– IDF = log(|total docs|/|docs containing term|)
Documents are represented as vectors in term space
文档在术语空间中表示为向量– Terms are usually stems
– Document vector values can be weighted by, e.g., frequency
– Queries represented the same as documents
All document vectors together: Document-Term-Matrix (Feature-Matrix) 所有文档向量加在一起
– Spatial data is about objects and entities which have a location and/or a geometry
– A special form is geospatial data which refers to data or information that identifies the geographic location of features and boundaries on Earth (such as localities, cities, suburbs etc)
– Spatial Database Management System (SDBMS)
空间数据库管理系统 (SDBMS)– Handle large amount of spatial data stored in secondary storage.
– Spatial semantics built into query language
– Specialized index structure to access spatial data
– **Geographic Information System (GIS)**地理信息系统 (GIS)
– SDBMS Client SDBMS 客户端
– Characterized by a rich set of geographic analysis functions
– SDBMS allows GIS to scale to large databases, which are now becoming the norm
SDBMS 允许 GIS 扩展到大型数据库,这现已成为常态
– Information in a GIS is typically organized in “layers”. GIS 中的信息通常按“层”组织。
• For example a map will have a layer of “roads”, “train stations”, “suburbs” and “water bodies”.
• GIS allows data exploration and integration across layers.
GIS 允许跨层进行数据探索和集成
– Object model concepts 对象模型概念
– Objects: distinct identifiable things relevant to an application
• Objects have attributes and operations
对象具有属性和操作– Attribute: a simple (e.g. numeric, string) property of an object
– Operations: function maps object attributes to other objects
– Geometry type: 几何类型(平面):
– shapes on a plane; shortest path between two points is a straight line
平面上的形状; 两点之间的最短路径是一条直线
– Geography type 地理类型(球体):
– Basis is a sphere; shortest path between two points is a circle arc
基础是一个球体; 两点之间的最短路径是圆弧
– Almost all data is qualified with time (period or point)
几乎所有数据都用时间(周期或点)限定– Web stores 网上商店
– Data warehousing 数据仓库
– Medical records, loans, … 医疗记录 、贷款、…
– Sensor data and time series 传感器数据和时间序列
– Transport information 运输信息
– Limited support for temporal data management in DBMSs
对 DBMS 中时态数据管理的有限支持– Conventional (non-temporal) DBs represent a static snapshot
– Management of temporal aspects is implemented by the application
• Adds additional complexity to application programs
– Some time data types and functions available in SQL, e.g.,DATE
SQL 中可用的一些时间数据类型和函数,例如 DATE、TIME、DATEADD()、DATEDIFF()
• SQL:2011 added support for temporal tables
SQL:2011 添加了对时态表的支持
• Still very limited query support
– A temporal database provides built-in support for the management of temporal data/time
时态数据库为时态数据/时间的管理提供内置支持– Representation of various temporal aspects, e.g., valid time, transaction time
– Support for multiple calendars and granularities
– Easy formulation of complex queries over time
– Queries over and modification of previous states
– SQL supports time instants and intervals (but no periods)
SQL 支持时间瞬间和间隔(但不支持句点)
– Instant data types: 即时数据类型:
– DATE 日期
• SQL-92: day, month and year of a time instant (from year 1 to 9999)
• Postgresql: date (no time of day) from 4713 BC to 5874897 AD
• SQL-92: date + time with variable resolution of fractions of a second (default: 1ms)
• Postgresql: date + time of same range than DATE with 1 ms resolution; optional time zone
– TIME 时间
• SQL-92: hours, minutes, seconds and optional fractional digits of second
• not really a time instant (no date!); in PostgreSQL with 1ms resolution
– Interval data types: 间隔数据类型
– Various specification options, eg. Year-Month Intervals: INTERVAL YEAR TO MONTH
各种规格选项,例如。 年月间隔
– Many DBMS only support time instants, but no intervals
许多 DBMS 只支持时间瞬间,但不支持时间间隔
– Must hence be simulated with two time instants (start + end)
因此必须模拟两个时间点(开始 + 结束)
– User-defined time 用户定义的时间
– According to Snodgrass as ‘an uninterpreted time interval’
根据 Snodgrass 的说法,这是“未解释的时间间隔”
– E.g. a birthdate or a publication time
例如 出生日期或出版时间
– Valid Time & Transaction Time
– Cf. following examples
参见 下面的例子
– A table can be associated with none, one, two or all three kinds of time
– Current 当前的
– “What is now?”
• E.g. “How many products do we currently have in stock?”
– Sequenced已排序
– “What was, and when?”
• E.g. “Give the sequence of how many product were in stock.”
• Or “When did the stock level fall below X in the past?”
– Very central, but not directly supported by SQL!
– Nonsequenced 无序
– “What was at any time?”
• E.g. “How many products A did we have at any time in stock?”
– Valid time records the time when a fact is true in the real world.
有效时间记录事实在现实世界中为真的时间。– Can move forward and backward 可以前后移动
– Transaction time records the history of database activity.
事务时间记录了数据库活动的历史。– Only moves forward (as you cannot go back in history and change things –alas!)
只能前进(因为您无法回到历史并改变事物 – 唉!)
– Therefore allows rollback (very useful for auditing)
– Images can be described as vector graphics or raster data
– Raster images 光栅图像
– Matrix with fixed number of rows and columns
– Digital images consist of fixed number of picture elements, called pixels
– Each pixel represents brightness of a given color
• Color depth => different number of channels
颜色深度 => 不同的通道数
– Raster images can be created in multiple ways
可以通过多种方式创建光栅图像– Digital photography / video
– Image sensors in (scientific) instruments (e.g. satellite images, astronomy, DNA sequencers, microscopes, …)
(科学)仪器中的图像传感器(例如卫星图像、天文学、DNA 测序仪、显微镜等)
– Scanners 扫描仪
– Medical instruments (e.g. Xray, CET, MRT) 医疗器械(例如 X 射线、CET、MRT)
TrueColor or RGB Image 真彩色或RGB图像
Gray-scale image 灰度图像
Binary image 二进制图像
– Image Enhancement: Processing an image so that the result is more suitable for a particular application. (sharpening or de-blurring an out of focus image, highlighting edges, improving image contrast, or brightening an image, removing noise)
– 图像增强:处理图像以使结果更适合特定应用程序。 (锐化或去模糊离焦图像、突出边缘、提高图像对比度或增亮图像、去除噪点)
– Image Restoration: This may be considered as reversing the damage done to an image by a known cause. (removing of blur caused by linear motion, removal of optical distortions)
图像恢复:这可以被视为逆转已知原因对图像造成的损坏。 (去除线性运动引起的模糊,去除光学畸变)
– Image Segmentation: This involves subdividing an image into constituent parts, or isolating certain aspects of an image. (finding lines, circles, or particular shapes in an image, in an aerial photograph, identifying cars, trees, buildings, or roads.
– 图像分割:这涉及将图像细分为组成部分,或隔离图像的某些方面。 (在图像、航拍照片中寻找线条、圆圈或特定形状,识别汽车、树木、建筑物或道路。
– broad set of operations that process images based on shapes.
基于形状处理图像的广泛操作集。– Goal: removing of imperfections in images (binary or grayscale)
– Morphological techniques probe an image with a small shape or template called a structuring element.
– The structuring element is a small binary image, i.e. a small matrix of pixels, each with a value of zero or one
– very relative due to Moore’s Law 由于摩尔定律而非常相关
– What once was considered big data, is considered a main-memory problem nowadays
– eg. Excel: In 2003 max 65000 rows, now max 1 million rows, still …
例如。 Excel:2003 年最多 65000 行,现在最多 100 万行,仍然…
– Nowadays: Terabyte to Exabyte 如今:太字节到艾字节
– conventional scientific research:
常规科学研究– months to gather data from 100s cases, weeks to analyze the data and years to publish.
几个月收集 100 个案例的数据,几周来分析数据,几年来发布。
– Example: Iris flower data set by Edgar Anderson and Ronal Fisher from 1936
示例:1936 年 Edgar Anderson 和 Ronal Fisher 设置的鸢尾花数据
– on the other end of the scale: Twitter
– average 6000 tweets/sec, 500 million per day or 200 billion per year
平均每秒 6000 条推文,每天 5 亿条或每年 2000 亿条
– Structured Data, such as CSV or RDBMS
结构化数据,例如 CSV 或 RDBMS
– Semi-structured Data, such as JSON or XML
半结构化数据,例如 JSON 或 XML
– Unstructured Data, ie. text, e-mails, images, video
非结构化数据,即。 文本、电子邮件、图像、视频– an estimated 80% of enterprise data is unstructured
估计 80% 的企业数据是非结构化的
– study by Forester Research: variety biggest challenge in Big Data
Forester Research 的研究:大数据中的多样性最大挑战
– The traditional approach: 传统方法:
– To scale with increasing load, buy more powerful, larger hardware
• from single workstation
• to dedicated db server
• to large massive-parallel database appliance
A single server has limits… 单个服务器有限制……
For real Big Data processing, need to scale-out to a cluster of multiple servers (nodes):
– Scan large volumes of data
– Map: Extract some interesting information
– Shuffle and sort intermediate results
– Reduce: aggregate intermediate results
– Generate final output
– Key idea: provide an abstraction at the point of these two operations (map and reduce)
关键思想:在这两个操作(map 和 reduce)的点上提供一个抽象– Higher-order functions高阶函数
– Cf. map functions in functional programming languages such as Lisp or Haskell
参见 函数式编程语言(如 Lisp 或 Haskell)中的映射函数
– very flexible due to the user-defined functions 由于用户定义的功能而非常灵活
– great scalability because FP approach 伟大的可扩展性,因为 FP 方法
– easy parallelism due to stateless functions 由于无状态函数而易于并行
– fault-tolerance 容错
Cons: 缺点
– requires programming skills and functional thinking 需要编程技能和函数式思维
– relatively low-level, even filtering to be coded manually相对低级,甚至过滤手动编码
– complex frameworks 复杂的框架
– batch-processing oriented 面向批处理