网页数据挖掘

Web Mining is the process of Data Mining techniques to automatically discover and extract information from Web documents and services. The main purpose of web mining is discovering useful information from the World-Wide Web and its usage patterns.

网页数据挖掘—《数据挖掘概念与技术》

对于新闻、广告、消费信息、财经管理、教育、行政管理和电子商务来说，万维网是一个巨大的、广泛分布的全球信息中心。它包含丰富、动态的信息，涉及带有超文本结构和多媒体的网页内容、超链接信息、访问和使用信息，为数据挖掘提供了丰富的资源。Web挖掘是数据挖掘技术的应用，从Web中发现模式、结构和知识。根据分析目标，Web挖掘可以划分成三个主要领域：Web内容挖掘、Web结构挖掘和Web使用挖掘。

网络数据采集类型与使用分析

网络数据采集是指通过网络爬虫或网站公开 API 等方式从网站上获取数据信息。该方法可以将非结构化数据从网页中抽取出来，将其存储为统一的本地数据文件，并以结构化的方式存储。它支持图片、音频、视频等文件或附件的采集，附件与正文可以自动关联。
网络爬虫基本原理(一)
网络爬虫基本原理(二)

数据抽取

大数据处理流程：数据的抽取、储存、提取

image.png

What is Data Extraction?

Data extraction is a process that involves retrieval of data from various sources. Frequently, companies extract data in order to process it further, migrate the data to a data repository (such as a data warehouse or a data lake) or to further analyze it. It’s common to transform the data as a part of this process. For example, you might want to perform calculations on the data — such as aggregating sales data — and store those results in the data warehouse. If you are extracting the data to store it in a data warehouse, you might want to add additional metadata or enrich the data with timestamps or geolocation data. Finally, you likely want to combine the data with other data in the target data store. These processes, collectively, are called ETL, or Extraction, Transformation, and Loading. Extraction is the first key step in this process.

Structured data

If the data is structured, the data extraction process is generally performed within the source system. It’s common to perform data extraction using one of the following methods:

Full extraction. Data is completely extracted from the source, and there is no need to track changes. The logic is simpler, but the system load is greater.

Incremental extraction. Changes in the source data are tracked since the last successful extraction so that you do not go through the process of extracting all the data each time there is a change. To do this, you might create a change table to track changes, or check timestamps. Some data warehouses have change data capture (CDC) functionality built in. The logic for incremental extraction is more complex, but the system load is reduced.

Unstructured data

When you work with unstructured data, a large part of your task is to prepare the data in such a way that it can be extracted. Most likely, you will store it in a data lake until you plan to extract it for analysis or migration. You'll probably want to clean up "noise" from your data by doing things like removing whitespace and symbols, removing duplicate results, and determining how to handle missing values.

网页数据抽取—《web数据挖掘》

手工方法
包装器归纳
自动抽取

数据挖掘—《数据挖掘概念与技术》

数据清理、数据集成、数据选择、数据变换、数据挖掘、模式评估、知识表示

Data Mining vs Data Extraction

Data mining is based on mathematical methods to reveal patterns or trends. Data extraction is based on programming languages or data extraction tools to crawl the data sources.
The purpose of data mining is to find facts that are previously unknown or ignored, while data extraction deals with existing information.

信息抽取

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimediadocument processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction

内容分析

Content Analysis

Content analysis is a research tool used to determine the presence of certain words, themes, or concepts within some given qualitative data (i.e. text). Using content analysis, researchers can quantify and analyze the presence, meanings and relationships of such certain words, themes, or concepts.

内容分析

对大众传播信息如书籍、杂志、电影、广播和电视等的内容作客观、系统和量化描述的一种研究方法。目的是将一种用语言表示而非数量表示的文献转换为用数量表示的资料，并将分析的结果用统计数字描述。

网页挖掘与内容分析：数据、实体、事件、关系抽取笔记

网页数据挖掘