noobiee

【学习笔记】Data Mining

Week1

Topic: Data Mining Intro & Data Preprocessing

1. Definition

Data mining is defined as the process of discovering patterns in data

The process must be automatic or (more usually) semiautomatic.
The patterns discovered must be meaningful in that they lead to some advantage.

2. Why

2.1 Descriptive

Characterization and Discrimination
The mining of frequent patterns, associations, and correlations

2.2 Predictive

Classification and regression
Clustering analysis：analyze class-labeled (training) data sets, clustering analyzes data objects without consulting class labels.
Outlier analysis：a.k.a. anomaly mining

3. Process

Data cleaning：to remove noise and inconsistent data
Data integration：where multiple data sources may be combined
Data selection：where data relevant to the analysis task are retrieved from the database
Data transformation：where data are consolidated into forms appropriate
Data mining：extract data patterns
Pattern evaluation：to identify useful patterns based on interestingness measures
Knowledge presentation：where visualization and knowledge present

4. Input and Output

4.1 Input

Concept: kinds of things that can be learned

Classification learning: predicting a discrete class
Association learning: detecting associations between features
Clustering: grouping similar instances into clusters
Numeric prediction: predicting a numeric quantity

Instance: the individual, independent examples of a concept to be learned

Other names: tuple, case…

Attributes: measuring aspects of an instance

Nominal
Ordinal
Interval
Ratio

4.2 Output: Knowledge representation

Tables
Linear models
Trees
Rules
Classification rules：(if … then…) Alternative to decision trees, (e.g., if a<3, then b)
Association rules：Classification rules + Support/Confidence

Support: number of instances predicted correctly
Confidence: number of correct predictions, as proportion of all instances that rule applies to

Rules with exceptions：(if...then...Except...if...then...)Add exception condition
Rules involving relations：Compare ('>' or '<') the relations, not specific number
Instance-based representation：rote learning or lazy learning，e.g. KNN
Clusters

5. Data Preprocessing

5.1 Why

5.1.1 How to measure Data Quality

Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable …
Consistency: some modified but some not dangling …
Timeliness: timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be understood?

5.2 How

Data Cleaning：Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Data Integration：Integration of multiple databases, data cube, or files
Data Reduction：

Dimensionality reduction
Numerosity reduction
Data compression

Data Transformation and Data Discretization

Normalization
Concept hierarchy generation

6. Data Cleaning

6.1 Issues with Real-World Data

incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
noisy: containing noise, errors, or outliers
inconsistent: containing discrepancies in codes or names
intentional (e.g., disguised missing data)

6.2 Noisy Data

Incorrect attribute values may be due to

faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention

Other data problems which require data cleaning

duplicate records
incomplete data
inconsistent data

6.3 How to handle noise

Binning：first sort data and partition into (equal-frequency) bins ，then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Regression：smooth by fitting the data into regression functions
Clustering：detect and remove outliers
Combined computer and human inspection

7. Data integration

7.1 Why

Schema integration：Integrate metadata from different sources（e.g., A.cust-id = B.cust-# = C.customer_id）
Entity identification problem：Identify real world entities from multiple data sources, e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts：For the same real-world entity, attribute values from different sources are different，e.g., metric vs. British units

7.2 How

Correlation Analysis (Nominal Data)：X2 (chi-square) test 卡方检验
Correlation Analysis (Numeric Data)：correlation coefficient (Pearson’s product moment coefficient)(-1 to 1) 皮尔逊系数
Covariance (Numeric Data)：协方差

8. Data Reduction

Wavelet Transform
Principal Component Analysis (PCA)
Attribute Subset Selection - Heuristic Search
Attribute Creation：1. Attribute extraction，2. Mapping data to new space，3. Attribute construction
Numerosity Reduction：1. Parametric methods (e.g., regression)，2. Non-parametric methods（histograms, clustering, sampling...）

Parametric data reduction: Linear regression，Multiple regression，Log-linear model
Histogram Analysis： Divide data into buckets and store average (sum) for each bucket
Clustering：Partition data set into clusters based on similarity, and store cluster representation
Sampling：Choose a representative subset of the data（e.g., 1. Simple random sampling，2. Cluster Sampling，3. Stratified sampling）

Data Compression：String compression，Audio/video compression，Time sequence

9. Data Transformation

9.1 Definition

A function that maps the entire set of values of a given attribute to a new set of replacement values. each old value can be identified with one of the new values

9.2 Methods

Smoothing: Remove noise from data
Attribute construction - New attributes constructed from the given ones (e.g.: PCA)
Aggregation: Summarization
Normalization: Scaled to fall within a smaller, specified range (e.g: -1 to 1) , including min-max; z-score: normalization by decimal scaling
Discretization: Concept hierarchy climbing

9.3 Data Discretization Methods

9.3.1 Typical methods

All the methods can be applied recursively

Binning: Top-down split, unsupervised
Histogram analysis: Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., X2) analysis (unsupervised, bottom-up merge)

9.3.2 Simple Discretization : Bining

Equal-width (distance) partitioning
Equal-depth (frequency) partitioning

9.4 Concept Hierarchy Generation

Specification of a partial/total ordering of agributes at the schema level by users or experts：e.g., street < city < state < country
Specification of a hierarchy for a set of values by explicit data grouping：e.g.,{Nanjing, Suzhou}＜JiangSu
Specification of only a partial set of agributes：E.g., only street < city, not others
Automatic generation of hierarchies (or agribute levels) by the analysis of the number of distinct values：E.g., for a set of ajributes: {street, city, state, country}

Week2

Topic: Data Warehousing and OLAP & Classification

1. Data Warehouse Concepts

A decision support database that is maintained separately from the organization‘s operational database.
Support information processing by providing a solid platform of consolidated, historical data for analysis.
A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management‘s decisionmaking process

2. Why

2.1 High performance for both systems

DBMS: tuned for Online Transaction processor (OLTP): access methods, indexing, concurrency control, recovery
Warehouse: tuned for OLAP: complex OLAP queries, multidimensional view, consolidation

2.2 Different functions and different data

missing data: Decision support requires historical data which operational Bs do not typically maintain
data consolidation: Decision support requires consolidation of data from heterogeneous sources
data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled

3. OLTP vs. OLAP

4. Data Warehouse Architecture

4.1 Extraction, Transformation, and Loading (ETL)

Data extraction：get data from multiple, heterogeneous, and external sources
Data cleaning：detect errors in the data and rectify them when possible
Data transformation：convert data from legacy or host format to warehouse format
Load：sort, summarize, consolidate, compute views, check integrity, and build indices and partitions
Refresh：propagate the updates from the data sources

4.2 Metadata Repository

Meta data is the data defining warehouse objects. It stores:

Description of the structure of the data warehouse：schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents
Operational meta-data：data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)
The algorithms used for summarization
The mapping from operational environment to the data warehouse
Data related to system performance：warehouse schema, view and derived data definitions
Business data：business terms and definitions, ownership of data, charging policies

4.3 Three Data Warehouse Models

Enterprise warehouse：collects all of the information about subjects spanning the entire organization
Data Mart：a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart
Virtual warehouse：A set of views over operational databases，Only some possible summary views may be materialized

5. Data Cube

From Tables and Spreadsheets to Data Cubes

A data warehouse is based on a multidimensional data model which views data in the form of a data cube
A data cube, such as sales, allows data to be modelled and viewed in multiple dimensions

Dimension tables, such as item (item_name, brand, type), or time (day, week, month, quarter, year)
Fact table contains measures (such as dollars sold) and keys to each of the related dimension tables

In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.

3-D Data Cube：
referred to as a cuboid

Cube: A lattice of Cuboids

5.1 Conceptual Modeling of Data Warehouses

Modelling data warehouses: dimensions & measures

Star schema: A fact table in the middle connected to a set of dimension tables
Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation

Star schema

Snowflake schema

Fact constellations

5.2 Data Cube Measures

A multidimensional point in the data cube space can be defined by a set of dimension – value pairs. For example, 〈time = “Q1”, location = “Vancouver”, item = “computer”〉
A data cube measure is a numeric function that can be evaluated at each point in the data cube space.
A measure value is computed for a given point by aggregating the data corresponding to the respective dimension

Three Categories:

Distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning
E.g., count(), sum(), min(), max()
Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function
E.g., avg(), standard_deviation()
Holistic: if there is no constant bound on the storage size needed to describe a sub aggregate.
E.g., median(), mode(), rank()

Example

6. Typical OLAP Operations

Roll up (drill-up)：summarize data by Climbing up hierarchy or by dimension reduction
Drill down (roll down)：reverse of roll-up
Slice and dice：project and select
Pivot (rotate)：reorient the cube, visualization 3D to series of 2D planes
drill across：involving (across) more than one fact table
drill through：through the bottom level of the cube to its back-end relational tables (using SQL)

7. Design of Data Warehouse：A Business Analysis Framework

Four views regarding the design of a data warehouse：

Top-down view - allows selection of the relevant information necessary for the data warehouse
Data source view - exposes the information being captured, stored, and managed by operational systems
Data warehouse view - consists of fact tables and dimension tables
Business query view - sees the perspectives of data in the warehouse from the view of end-user

7.1 Data Warehouse Design Process

Top-down, bottom-up approaches or a combination of both

Top-down: Starts with overall design and planning (mature)
Bottom-up: Starts with experiments and prototypes (rapid)

From software engineering point of view

Waterfall: structured and systematic analysis at each step before proceeding to the next
Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around

Typical data warehouse design process：

Choose a business process to model (e.g., orders, invoices, etc.)
Choose the grain (atomic level of data) of the business process (e.g., individual transactions, individual daily snapshots, and so on)
Choose the dimensions that will apply to each fact table record
Choose the measure that will populate each fact table record

7.2 Data Warehouse Usage

Information processing：supports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graph
Analytical processing：supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining：supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools

7.3 Online Analytical Mining (OLAM)

High quality of data in data warehouses：DW contains integrated, consistent, cleaned data
Available information processing structure surrounding data warehouses：Web accessing, service facilities, reporting and OLAP tools
OLAP-based exploratory data analysis：Mining with drilling, dicing, pivoting, etc
Online selection of data mining functions：Integration and swapping of multiple mining functions, algorithms, and tasks

8. Data Warehouse Implementation

8.1 The “Compute Cube” Operator

Transform it into SQL-Like language

SELECT item, city, year, SUM (amount)
FROM SALES
CUBE BY item, city, year

8.2 Indexing OLAP Data: Bitmap Index and Join Index

8.2.1 Bitmap Index

8.2.2 Join Index

In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fact table.

8.3 OLAP Server Architectures

Relational OLAP (ROLAP)

Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware
Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services
Greater scalability

Multidimensional OLAP (MOLAP)

Sparse array-based multidimensional storage engine
Fast indexing to pre-computed summarized data

Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)

Flexibility, e.g., low level: relational, high-level: array

Specialized SQL servers (e.g., Redbricks)

Specialized support for SQL queries over star/snowflake schemas

9. Data Generalization by Attribute-Oriented Induction

9.1 Example of AOI

Describe general characteristics of graduate students in the University database (given the attributes name, gender, major, birth place, birth date, residence, phone# (telephone number), and gpa (grade point average).

Step 1. Fetch relevant set of data using an SQL statement, e.g.,

Select * (i.e., name, gender, major, birth_place, birth_date, residence, phone#, gpa)
from student
where student status in ("Msc", "MBA", "PhD" }

Step 2. Perform attribute-oriented induction（AOI）

Step 3. Present results in generalized relation, cross-tab, or rule forms

9.2 Basic Principles of AOI

Data focusing: task-relevant data, including dimensions, and the result is the initial relation
Attribute-removal: remove attribute A if there is a large set of distinct values for A but (1) there is no generalization operator on A, or (2) A‘s higher level concepts are expressed in terms of other attributes
Attribute-generalization: If there is a large set of distinct values for A, and there exists a set of generalization operators on A, then select an operator and generalize A
Attribute-threshold control: typical 2-8, specified/default
Generalized relation threshold control: control the final relation/rule size

9.3 AOI: Basic Algorithms

InitialRel: Query processing of task-relevant data, deriving the initial relation.
PreGen: Based on the analysis of the number of distinct values in each attribute, determine generalization plan for each attribute: removal? or how high to generalize?
PrimeGen: Based on the PreGen plan, perform generalization to the right level to derive a “prime generalized relation", accumulating the counts.
Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting, (3) mapping into rules, cross tabulations, visualization presentations.

9.4 AOI vs. Cube-Based PLAP

Similarity:

Data generalization
Presentation of data summarization at multiple levels of abstraction
Interactive drilling, pivoting, slicing and dicing

Differences:

OLAP has systematic preprocessing, query independent, and can drill down to rather low level OLAP
AOI has automated desired level allocation, and may perform dimension relevance analysis/ranking when there are many relevant dimensions
AOI works on the data which are not in relational forms

10. Classification: A two-step process

Model construction (Learning): describing a set of predetermined classes
Model usage (Classification): for classifying future or unknown objects

11. More contents: I passed here

Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy: Ensemble Methods

Week3

Topic: Clustering & Prediction & Mining patterns, association and correlations

1. Cluster Analysis

Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters
Unsupervised learning: no predefined classes (i.e. learning by observations vs. learning by examples: supervised)

The quality of a clustering method depends on：

the similarity measure used by the method
its implementation
Its ability to discover some or all of the hidden patterns

2. Measure the Quality of Clustering

Dissimilarity/Similarity metric

Similarity is expressed in terms of a distance function, typically metric: d(i, j)
The definitions of distance functions are usually rather different for intervalscaled, boolean, categorical, ordinal ratio, and vector variables
Weights should be associated with different variables based on applications and data semantics

3. Considerations for Cluster Analysis

Partitioning criteria：Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable)
Separation of clusters：Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive (e.g., one document may belong to more than one class)
Similarity measure：Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity)
Clustering space：Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering)

**4. Major Clustering Approaches（*）**

Partitioning approach：Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach：Create a hierarchical decomposition of the set of data (or objects) using some criterion
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach：Based on connectivity and density functions
Typical methods: DBSCAN, OPTICS, DenClue
Grid-based approach：based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
Model-based：A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based：Based on the analysis of frequent patterns
Typical methods: p-Cluster
User-quided or constraint-based：Clustering by considering user-specified or application-specific constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering：Objects are often linked together in various ways
Massive links can be used to cluster objects: SimRank, LinkClus

5. Partitioning method

Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci , p means other points)
Given k, find a partition of k clusters that optimizes the chosen partitioning criterion

Global optimal: exhaustively enumerate all partitions (NP hard)
Heuristic methods: k-means and k-medoids algorithms

5.1 K-Means

The k-means algorithm defines the centroid of a cluster as the mean value of the points within the cluster.

Given k, the k-means algorithm is implemented in four steps:

Randomly selects k of the objects in D
Compute “seed points” as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster)
Assign each object to the cluster with the nearest seed point
Go back to Step 2, stop when the assignment does not change

Limits:

The k-means method is not guaranteed to converge to the global optimum and often terminates at a local optimum. The results may depend on the initial random selection of cluster centers.
The time complexity of the k-means algorithm is O(nkt), where n is the total number of objects, k is the number of clusters, and t is the number of iterations. Normally, k ≪ n and t ≪ n. Therefore, the method is relatively scalable and efficient in processing large data sets.
Sensitive to noise and outliers: a small number of such data can substantially influence the mean value

5.2 K-Medoids

Improvements：

CLARA: Instead of taking the whole data set into consideration, CLARA uses a random sample of the data set. The PAM algorithm is then applied to compute the best medoids from the sample. Ideally, the sample should closely represent the original data set. In many cases, a large sample works well if it is created so that each object has equal probability of being selected into the sample.
CLARANS: First, it randomly selects k objects in the data set as the current medoids. It then randomly selects a current medoid x and an object y that is not one of the current medoids. Can replacing x by y improve the absolute error criterion? If yes, the replacement is made. The set of the current medoids after the l steps is considered a local optimum. CLARANS repeats this randomized process m times and returns the best local optimal as the final result.

6. Hierarchical Methods

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition.

6.1 AGNES

This is a single-linkage approach in that each cluster is represented by all the objects in the cluster, and the similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters.

The cluster-merging process repeats until all the objects are eventually merged to form one cluster.

6.2 DIANA

All the objects are used to form one initial cluster.
The cluster is split according to some principle such as the maximum Euclidean distance between the closest neighboring objects in the cluster.
The cluster-splitting process repeats until, eventually, each new cluster contains only a single object.

6.3 Dengorgram

A tree structure called a dendrogram is commonly used to represent the process of hierarchical clustering. It shows how objects are grouped together (in an agglomerative method) or partitioned (in a divisive method) step-by-step.

6.4 Example

Key Problem: When shall we stop

Four widely used measures for distance between clusters are as follows, where |p − pʹ| is the distance between two objects or points, p and pʹ; mi is the mean for cluster, Ci; and ni is the number of objects in Ci. They are also known as linkage measures.

Distance between Clusters

Example:

Original Data Samples

before - distance matrix

after - average

... till last one cluster

Improvement

BIRCH
CHAMELEOM

6.5 Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH)

BIRCH uses the notions of clustering feature to summarize a cluster, and clustering feature tree (CF-tree) to represent a cluster hierarchy.

Phase 1: scan DB to build an initial in-memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data)
Phase 2: BIRCH applies a (selected) clustering algorithm to cluster the leaf nodes of the CF-tree, which removes sparse clusters as outliers and groups dense clusters into larger ones.

6.5.1 clustering feature (CF)

Consider a cluster of n d-dimensional data objects or points. The clustering feature (CF) of the cluster is a 3-D vector summarizing information about clusters of objects. where LS is the linear sum of the n points and SS is the square sum of the data points

Example

6.5.2 CF-tree

CF-tree has two parameters: branching factor, B, and threshold, T

The branching factor specifies the maximum number of children per nonleaf node.
The threshold parameter specifies the maximum diameter of subclusters stored at the leaf nodes of the tree.

6.5.3 Steps of BIRCH

For each point in the input：

Find closest leaf entry
Add point to leaf entry and update CF
If entry diameter > max_diameter, then split leaf, and possibly parents

Algorithm is O(n)

6.5.4 Concerns

Sensitive to insertion order of data points
Since we fix the size of leaf nodes, so clusters may not be so natural
Clusters tend to be spherical given the radius and diameter measures

总结一下BIRCH算法的主要优缺点。它的优点是聚类速度快，可以识别噪音点，还可以对数据集进行初步分类的预处理；主要缺点有：对高维特征数据、非凸数据集效果不好；由于CF个数的限制会导致与真实类别分布不同.

6.6 Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling

Chameleon is a hierarchical clustering algorithm that uses dynamic modeling to determine the similarity between pairs of clusters.
Chameleon: K-NN based
Chameleon uses a graph partitioning algorithm to partition the k-nearest- neighbor graph into a large number of relatively small subclusters

7. Density-Based Clustering Methods

Two parameters:

Eps (ε): Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-neighbourhood of that point

7.1 DBSCAN: Density-Based Clustering Based on Connected Regions with High Density

Arbitrary select a point p
Check whether the eps of p contains at least MinPts objects. If not, p is marked as a noise point.
Otherwise, a new cluster C is created for p, and all the objects in the eps of p are added to a candidate set, N
iteratively add objects in N to the same cluster. For an object pʹ in N, if the eps of pʹ has at least MinPts objects, those objects in the eps of pʹ are also added to N. (density- reachable)
Stop when N is empty
Continue the process until all points has been processed

8. Grid-Based Method

a grid-based clustering method takes a space-driven approach by partitioning the embedding space into cells independent of the distribution of the input objects.

Two Examples:

STING:explores statistical information stored in the grid cells
CLIQUE: represents a grid- and density-based approach for subspace clustering in a high-dimensional data space.

8.1 STING

STING is a grid-based mulIresoluIon clustering technique in which the embedding spaIal area of the input objects is divided into rectangular cells.

Each cell at a high level is partitioned to form a number of cells at the next lower level.
Statistical information regarding the attributes in each grid cell is precomputed and stored as statistical parameters.
The statistical parameters of higher-level cells can easily be computed from the parameters of the lower-level cells.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layer-typically with a small number of cells
For each cell in the current level compute the confidence interval

Advantages:

Query-independent, easy to parallelize, incremental update
O(K), where K is the number of grid cells at the lowest level

Disadvantages:

All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is

8.2 CLIQUE (Clustering In QUEst)

CLIQUE (CLustering In QUEst) is a simple grid-based method for finding density- based clusters in subspaces.

Two Steps:

In the first step, CLIQUE partitions the d-dimensional data space into nonoverlapping rectangular units, identifying the dense units among these. CLIQUE finds dense cells in all of the subspaces.
In the second step, CLIQUE uses the dense cells in each subspace to assemble clusters, which can be of arbitrary shape.

9. Evaluation of Clustering

9.1 Determine the Number of Clusters

Simple Method: set the number of clusters to about √n/2 for a data set of n points.
Elbow method：Use the turning point in the curve of sum of within cluster variance w.r.t the # of clusters

我们知道k-means是以最小化样本与质点平方误差作为目标函数，将每个簇的质点与簇内样本点的平方距离误差和称为畸变程度(distortions)，那么，对于一个簇，它的畸变程度越低，代表簇内成员越紧密，畸变程度越高，代表簇内结构越松散。畸变程度会随着类别的增加而降低，但对于有一定区分度的数据，在达到某个临界点时畸变程度会得到极大改善，之后缓慢下降，这个临界点就可以考虑为聚类性能较好的点。

Cross Validation

9.2 Clustering Quality

extrinsic methods：Compare the clustering against the ground truth and measure
intrinsic methods：evaluate the goodness of a clustering by considering how well the clusters are separated.

10. More Prediction

Numeric Prediction
Linear Regression

Linear Regression
Non-Linear Regression
Classification by regression

Logistic Regression

Logit transformation

Support Vector Machines

Margin and support vectors
Linear Separable
Linear Inseparable

11. Data mining tasks

Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-series, and stream data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications

12. Frequent Patterns

itemset: A set of one or more items
k-itemset X= {x1,..., xk}: An itemset that contains k items
(absolute) support, or, support count of X: Frequency or occurrence of an itemset X (3)
(relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) (0.6)
An itemset X is frequent if X‘s relative support is no less than a minimum support threshold

13. Association Rules

Find all the rules X --> Y with minimum support and confidence

support, s, probability that a transaction contains X ∪ Y
confidence, c, the conditional probability of transactions in D containing X that also contain Y ( P(Y|B) ) .

Example：

14. Closed Patterns and Max-itemsets

An itemset X is closed in a data set D if there exists no proper superitemset Y such that Y has the same support count as X in D
An itemset X is a maximal frequent itemset (or max-itemset) in a data set D if X is frequent, and there exists no super-itemset Y such that X ⊂Y and Y is frequent in D.
Closed pattern is a lossless compression of freq. patterns

Example

15. Frequent Itemset Mining Method

• Apriori • Fggrowth • Closet

15.1 Apriori: Finding Frequent Itemsets by Confined Candidate Generation

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
Apriori Pruning principle (antimonotonicity): if a set cannot pass a test (infrequent), all of its supersets will fail the same test as well

Steps:

Initially, scan DB once to get frequent 1-itemset
self-joining: Generate length (k+1) candidate itemsets from length k frequent itemsets
Pruning: Test the candidates against DB
Terminate when no frequent or candidate set can be generated

Aprior example: min sup.=2 in this case

Improving Apriori:

Scan Database only twice：Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
Reduce the Number of Candidates
Sampling for Frequent Patterns

15.2 FP-growth: A Pattern-Growth Approach for Mining Frequent Itemsets

Depth-first search
Grow long patterns from short ones using local frequent items only

Example

Steps:

Scan DB once, find frequent 1-itemset
Sort frequent items in frequency descending order, f-list
Scan DB again, construct FP-tree
Starting at the frequent item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item p
Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base

f-list and FP-tree

conditional pattern base

Completeness：Preserve complete information for frequent pattern mining
Compactness：Items in frequency descending order， the more frequently occurring, the more likely to be shared

15.3 Mining Closed and Max PatternsL: CLOSET

Flist: list of all frequent items in support ascending order
Divide search space：Patterns having d but no a, etc.
Find frequent closed pattern recursively

16. Correlation Rules

A correlation rule is measured not only by its support and confidence but also by the correlation between itemsets A and B

Lift can be used as a simple correlation measure (higher)

16.1 Correlation Rules: Lift 提升度

当lift>1，说明A和B之间有正相关关系，且数值越大正相关程度越高
当lift<1，说明A和B之间有负相关关系，且数值越小负相关程度越高（lift定义大于等于零）
当lift=1，说明A和B之间没有相关关系

Week4

Topic:
Mining Complex Data types & Advanced & Data Mining Trends and Research Frontiers

1. Different data types

2. Mining Time Series Data

Methods for time series analyses

Frequency-domain methods: Model-free analyses, well-suited to exploratory investigations - spectral analysis vs. wavelet analysis
Time-domain methods: Auto-correlation and cross-correlation analysis
Motif-based time-series analysis

2.1 Regression Analysis

Linear and multiple regression
Non-linear regression
Generalized linear model, Poisson regression, log-linear models
Regression trees：proposed in CART system
Model tree: Each leaf holds a regression model—a multivariate linear equation for the predicted attribute

2.2 Trend Analysis

Categories of Time-Series Movements：

Long-term or trend movements (Trend Curve) (T): general direction in which a time series is moving over a long interval of time
Cyclic movements or cycle variations (C): long term oscillations about a trend line or curve - e.g., business cycles, may or may not be periodic
Seasonal movements or seasonal variations (S) - i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years.
Irregular or random movements (I)

2.3 Similarity Search in Time Series Data

Whole matching: find a sequence that is similar to the query sequence
Subsequence matching: find all pairs of similar subsequences

2.3.1 discrete Fourier transform (DFT)

The Euclidean distance between two signals in the time domain is the same as their distance in the frequency domain （Parseval’s Theorem）帕塞瓦尔定理

2.3.2 discrete wavelet transform (DWT)

2.4 Motif-Based Search and Mining in Time Series Data 模体

motif： A previously unknown, frequently occurring sequential pattern

2.4.1 SAX: Symbolic Aggregate approXimation

时间序列符号聚合近似方法

Essentially an alphabet over the Piecewise Aggregate Approximation (PAA) rank
Experiments show this approach is fast and simple, and has comparable search quality to that of DFT, DWT, and other dimensionality reduction methods.

Parameters: alphabet size, word (segment) length (or output rate)

Select probability distribution for TS
z-score Normalize TS
PAA: Within each time interval, calculate aggregated value (mean) of the segment
Partition TS range by equal-area partitioning
Label each segment with a_rank ∈∑ for aggregate’s corresponding partition rank

SAX特性：

可以进行数据降维。
可以在符号表示上定义距离度量，并且满足下界定理。
可以进行数据压缩。
SAX保留了原始时间序列的大体形状。SAX是一种符号表示法，因此字母表可以存储为位（bits）而不是双精度浮点数，从而节省了大量空间。

3. Mining Graphs and Networks

graph (G.) definition: set of nodes joined by a set of lines (undirected graphs) or arrows (directed graphs)
vertices: represent objects of interest connected with edge
edges: represented by arcs connecting vertices;

3.1 Networks

Graph Pattern Mining: Frequent subgraph patterns, closed graph patterns, Span vs. CloseGraph
Statistical Modeling of Networks: Small world phenomenon, power law (log-tail) distribution, densification
Clustering and Classification of Graphs and Homogeneous Networks
Clustering, Ranking and Classification of Heterogeneous Networks
Role Discovery and Link Prediction in Information Networks: PathPredict
Similarity Search and OLAP in Information Networks: PathSim, GraphCube
Evolution of Social and Information Networks: EvoNetClus

4. Advanced Pattern Mining

4.1 Mining in Multilevel Association Rule

Flexible min-support thresholds: Some items are more valuable but less frequent
Redundancy Filtering: Some rules may be redundant due to "ancestor" relationships between items
A rule is redundant if its support is close to the “expected” value, based on the rule's ancestor

4.2 Mining Multidimensional Association

Categorical Attributes: finite number of possible values, no ordering among values ---- data cube approach
Quantitative Attributes: Numeric, implicit ordering among values ---- discretization, clustering, and gradient approaches

4.3 Negative and Rare Patterns

Rare patterns: Very low support but interesting
Negatively correlated patterns that are infrequent tend to be more interesting than those that are frequent

Null transaction ¬A∩¬B
Null Invariance的意思是，值不随着null-transactions的改变而改变。

5. Advanced Classification Methods

Classification by Backpropagation：Neural Network
Lazy Learners：Instance-Based Methods
Instance-Based Methods： Store training examples and delay the processing (“lazy evaluation‘) until a new instance must be classified

Typical approaches of Lazy Learners:

k-nearest neighbor approach
Locally weighted regression
Case-based reasoning (CBR)

7. Other Methodologies of Data Mining

7.1 Statistical Data Mining

Factor analysis：determine which variables are combined to generate a given factor
Discriminant analysis：predict a categorical response variable, commonly used in social science
Time Series
Quality Control：displays group summary charts
Survival Analysis：predicts the probability that a patient undergoing a medical treatment would survive at least to time t (life span prediction)

7.2 Data Mining Result Visualisation

Scatter plots and boxplots (obtained from descriptive data mining)
Decision trees
Association rules
Clusters
Outliers
Generalized rules

8. Privacy, Security and Social Impacts of Data Mining

The real privacy concern: unconstrained access of individual records, especially privacy-sensitive information

Method 1: Removing sensitive IDs associated with the data
Method 2: Data security-enhancing methods
Multi-level security model: permit to access to only authorized level
Encryption: e.g., blind signatures, biometric encryption, and anonymous databases (personal information is encrypted and stored at different locations)
Method 3: Privacy-preserving data mining methods

Week5

Topic: Introduction to Data Mining: NLP

1. Applications of NLP

Automate routine tasks: Chatbots powered by NLP can process a large number of routine tasks that are handled by human agents today, freeing up employees to work on more challenging and interesting tasks.
Improve search: NLP can improve on keyword matching search for document and FAQ retrieval by disambiguating word senses based on context (for example, “carrier” means something different in biomedical and industrial contexts), matching synonyms (for example, retrieving documents mentioning “car” given a search for “automobile”), and taking morphological variation into account (which is important for non-English queries). Effective NLP-powered academic search systems can dramatically improve access to relevant cuttingedge research for doctors, lawyers, and other specialists.
Search engine optimization: NLP is a great tool for getting your business ranked higher in online search by analyzing searches to optimize your content.Analyzing and organizing large document collections: NLP techniques such as document clustering and topic modeling simplify the task of understanding the diversity of content in large document collections, such as corporate reports, news articles, or scientific documents. These techniques are often used in legal discovery purposes.
Social media analytics: NLP can analyze customer reviews and social media comments to make better sense of huge volumes of information. Sentiment analysis identifies positive and negative comments in a stream of social-media comments, providing a direct measure of customer sentiment in real time.
Market insights: With NLP working to analyze the language of your business’ customers, you’ll have a better handle on what they want, and also a better idea of how to communicate with them. Aspect-oriented sentiment analysis detects the sentiment associated with specific aspects or products in social media (for example, “the keyboard is great, but the screen is too dim”), providing directly actionable information for product design and marketing.
Moderating content: If your business attracts large amounts of user or customer comments, NLP enables you to moderate what’s being said in order to maintain quality and civility by analyzing not only the words, but also the tone and intent of comments

2. Bag Of Words

Bag-of-words models treat documents as unordered collections of tokens or words (a bag is like a set, except that it tracks the number of times each element appears).

Bag-of-words models are often used for efficiency reasons on large information retrieval tasks such as search engines. They can produce close to state-of-the-art results with longer documents.

Nevertheless, it suffers from some shortcomings, such as：

Vocabulary: The vocabulary requires careful design, most specifically in order to manage the size, which impacts the sparsity of the document representations.
Sparsity: Sparse representations are harder to model both for computational reasons (space and time complexity) and also for information reasons, where the challenge is for the models to harness so little information in such a large representational space.
Meaning: Discarding word order ignores the context, and in turn meaning of words in the document (semantics). Context and meaning can offer a lot to the model, that if modeled could tell the difference between the same words differently arranged (“this is interesting” vs “is this interesting”), synonyms (“old bike” vs “used bike”), and much more.

3. NLP Preprocessing

Tokenization
Stop word removal
Stemming and lemmatization 词干提取和词形还原

4. Term Frequency-Inverse Document Frequency （TF-IDF）

5. Topic Modelling

Latent Semantic Analysis (LSA)
Latent Dirichlet Allocation (LDA)

6. Word Embedding

Word2vec
GloVe (Global Vectors for Word Representation)

7. Sentiment analysis

Aspect-based sentiment analysis （ABSA）
细粒度情感分析(fine-grained SA)

拓展阅读

聚类算法（BIRCH）_整得咔咔响的博客-CSDN博客

你可能感兴趣的:(学习笔记,学习,数据挖掘)

3步！用代码生成工具秒建SqlSugar Winform项目？手把手教学，小白也能轻松上手！墨瑾轩数据库学习 oracle 数据库
关注墨瑾轩，带你探索编程的奥秘！超萌技术攻略，轻松晋级编程高手技术宝库已备好，就等你来挖掘订阅墨瑾轩，智趣学习不孤单即刻启航，编程之旅更有趣（对比传统开发效率：人工写代码vs魔法生成器，谁才是真正的“代码魔法师”？）代码生成工具——程序员的“魔法棒”你有没有试过用Excel表格生成代码？或者像搭积木一样拼出一个完整的Winform项目？SqlSugar+代码生成工具（比如Database2Shar
3步搞定Java漏洞修复？别再让黑客当“家”！
关注墨瑾轩，带你探索编程的奥秘！超萌技术攻略，轻松晋级编程高手技术宝库已备好，就等你来挖掘订阅墨瑾轩，智趣学习不孤单即刻启航，编程之旅更有趣Java城堡的“裂缝”与程序员的救赎想象一下：你的Java应用是一座巍峨的城堡，而安全漏洞就是那些悄悄蔓延的裂缝。SQL注入：像是小偷从窗户溜进来，偷偷改写数据库的账本。XSS攻击：像在城堡里偷偷放了一张带毒的地毯，路过的人会被“刺”伤。SSRF漏洞：像让城堡
5大核心技术+3大交互革命！Java如何让虚拟世界‘活过来’？——附代码实战+防坑指南！墨瑾轩 Java乐园交互 java 开发语言
关注墨瑾轩，带你探索编程的奥秘！超萌技术攻略，轻松晋级编程高手技术宝库已备好，就等你来挖掘订阅墨瑾轩，智趣学习不孤单即刻启航，编程之旅更有趣从“木头人”到“交互大师”的Java魔法之旅一、Java的“五大核心技术”——虚拟世界的“五感开关”1.1核心技术1：JOGL渲染引擎——“视觉中枢”作用：用OpenGL实现3D场景渲染代码示例：//JOGL渲染循环：画一个旋转的立方体importjavax.
2018-05-23 修改博文陈艳芳_育儿及修行成长
你只有不断去模仿高手才有可能学会高手的本事，你只有模仿了很多的高手之后，才有可能形成自己的特点，才有可能成为一代宗师。刚开始的模仿，一定会很难看的，尽管初学的动作会很难看，就像孩子走路一样难看。但是最终才会一点点的提高。一，本段要达到的目的？（扮演作者）给出学习的路径：模仿，模仿多个高手；给出学习过程中的状态：开始很难看，一点点提高二，为了实现目的，作者采用了怎样的方式？（分析文章思维体系）逐步递
手撕spring核心源码，彻底搞懂spring流程 Java烟雨后端 java SPRING spring java spring boot kafka 后端
引子十几年前，刚工作不久的程序员还能过着很轻松的日子。记得那时候公司里有些开发和测试的女孩子，经常有问题解决不了的，不管什么领域的问题找到我，我都能帮她们解决。但是那时候我没有主动学习技术的意识，只是满足于解决问题，错过了能力提升最好的阶段。老公是个截然相反的类型，我就看他天天在宿舍里学习。学来学去也就那样了。他不陪我玩，我虽然心里不乐意，但是还好那时候未卜先知：知道自己能生一个长的帅、和我兴趣相
2023-06-08 逆风飞扬888
数学这个学科，讲究逻辑思维，老师给学生讲的越细，灌的越多，学生脑子越笨，反应越差，是不是很诡异，就是那么诡异！所以真正看透数学学习本质的老师都不怎么张嘴讲，张嘴灌，连引导和点播几乎都不做，就是想办法让学生自己解决，逼着学生自己动起来，从学生等着别人喂方法变成学生自己主动尝试！没有看透数学学习本质的老师，都有一个通病，就是掰开揉碎的学生讲，生怕少讲一步，生怕方法不够直接！这就是数学老师之间的区别，别
日精进第二百五十八天鑫山力王强
敬爱的李老师，智慧的马教授，亲爱的家人们：大家好，我是侯维山（侯总）的人，来自滨州鑫山力机械的王强。今天是2019年5月22日，我的日精进第258天，我们互相勉励，携手前行，每天进步一点点，距离成功便不远！1比学习：学习《销售圣经》想要征服一个人，除了了解这个人的一切行为，了解他的心理是重中之重。将这个道理放在销售中也一样，销售人员要想成功地把握住客户的一切反应，促使交易的成功，就必须对客户的心理
学会生气翟兆帅
没错，我需要学习下如何生气，如何表达愤怒。之选择这个话题，是因为我很少会生气，一方面我一直是一个随性佛系的人，另一方面我觉得生气是不成熟，对自己情绪无法很好掌控的表现，所以平时我会控制自己。但前几天的一些事情触动了我，我开始思考其实生气不一定就是不好的，它有时可以缓解你的压力，也可以帮助解决问题。生气是我们与生俱来的本能，但对于我们人类而言，像前面所说我把生气分为两种类型，一类是情绪发泄，一类是推
《不要站在金矿上捡垃圾》学习心得舍得_0286
今天又听了一遍邵总教练的《不要站在金矿上捡垃圾》的分享，最想说的一句话就是，两年来我摆脱了站在金矿上捡垃圾！在遇见美乐家之前，微信对于我而言只是一个聊天工具，、有道云、石墨，我都不知道它的存在。连日记都从来没记过。在这样的互联网时代，竟然这般的认知浅薄！回想我来到美乐家之前一直是站在金矿上捡垃圾！认识了我的美乐家推荐人邵总教练后，他教会我用，有道云，开始的时候不知道该写些什么内容，几十年不学习，脑
AI-Compass宝藏资源库
AI-Compass宝藏资源库：构建最全面的AI学习与实践生态，服务AI全群体AI-Compass致力于构建最全面、最实用、最前沿的AI技术学习和实践生态，通过六大核心模块的系统化组织，为不同层次的学习者和开发者提供从完整学习路径。github地址：AI-Compass：https://github.com/tingaicompass/AI-Compassgitee地址：AI-Compass：ht
零基础英语学习之旅-语法篇（一）万能的小黑学长
入坑，从今天开始持续更新英语学习方法，剥丝抽茧，去除繁琐的语法概念，让大家，快速入门。花最少的时间，学最重要的东西。欢迎大家批评指正、多提意见。话不多说，直接进入正题。。。英语句子的基本五大结构主语：谓语动词之前的成分。Toseeistobelieve.Theflowersmellsgreat.谓语：谓语就是动词动词主要类型：1.实义动词2.系动词be动词等同于be动词3.助动词帮助动词实现时态和
2019.07.12 浅简的
姓名：蔡江燕公司：海南蔚蓝时代实业有限公司组别：365期谦虚3组学员【日精进打卡第468天】【知～学习】《六项精进》大纲0遍共1542遍《大学》0遍共1542遍《六项精进》通篇0遍共472遍《活法.壹》每天必读2页，今日未完成。《5分钟商学院》每天听书10分钟，未完成。【经典名句】路宽不如心宽，命好不如心好【行～实践】一、修身：（对自己个人）无二、齐家：（对家庭和家人）1、与家人聊天三、建功：（对
【剽悍一只猫的剽悍行动营】22天，和孩子一起成长财务自由的社群运营人苏宝
文/Janice2018年春节后，是我人生最黑暗的时候。大娃数学老师投诉她没有完成家庭作业、不交作业，接着是英语老师、语文老师的电话投诉。而我需要花大部分时间在新项目上，没有时间管娃，又与新来的领导在项目管理上有较大分歧，导致关系紧张，心情极度低落。工作上不如意，娃又不消停。每天下班累得半死，还得盯着她学习；好好学习的道理讲了几箩筐，孩子就是说不听，那时的我就像一个炸药桶，只要给我一点火花就能燃爆
日精进吾发叭门
亲爱的王总及何校，亲爱的家人们大家好!我是来自山峰教外教育的李永芳，今天是我第23天的日精进，给大家分享我今天的进步，我们互相勉励，携手前行。每天进步一点点，距离成功便不远。1、比学习：今天在排练快乐会议是，跟大学部的老师学到了很多WPS的一些软件功能，以及会议上的一些细节注意的方面。2、比改变：今天让我认识到了不一样的刘老师，相对于之前的严肃认真，今天的刘老师更加活泼可爱。3、比付出：今天老师们
UGUI 性能优化系列：第三篇——渲染与像素填充率优化吉良吉影NeKoSuKi 性能优化 unity 游戏引擎 c#开发语言
在UnityUGUI性能优化之旅中，我们已经学习了基础的资源管理和Canvas与UI元素的管理。现在，我们将把目光转向更深层次的渲染层面，特别是如何优化像素填充率（PixelFillRate）。在这个环节中，Overdraw（过度绘制）是一个我们必须理解和解决的关键问题，因为它直接关系到GPU的工作效率。一、Overdraw（过度绘制）的危害与检测1.什么是Overdraw？为什么会影响性能？想象
美好的事物总是值得留念~ 韩沫
大家好！我是李晶晶，李是木子李的李，晶是bulingbuling的晶晶，我的爱好非常广泛，我喜欢羽毛球，篮球，看动漫，听音乐，摄影。认识我晶晶，事事都顺心，认识我晶晶，家和万事兴！放一张我傻傻滴图一天，我在向同学抱怨在大学学习真的好无力，身边的人似乎都在刷剧，刷抖音，似乎没有那种对学习独有的激情，我总是在意别人的眼光，我怕自己的独行被人当作另类，结果也融入了这样的圈子。但是到头来却总是忧虑，可又做
2019.1.19 LT婷_420期
姓名：罗婷婷公司：海南蔚蓝时代实业有限公司组别：第420期努力4组（参加培训时的组名）【日精进打卡第173天】【知～学习】《六项精进》大纲1遍共278遍《大学》遍1共278遍课外阅读书籍《别让沉不住气毁了你》每周读30页，截止今日完成45页。【经典名句分】不要有感性的烦恼【行～实践】一、修身：（对自己个人）早起吃早餐喝温开水坚持每天吃点水果二、齐家：（对家庭和家人）和妈妈聊天三、建功：（对工作）例
晨间习惯20210122 圆梦巨人顾家源
一年之计在于春，一日之计在于晨。我的晨间如何度过，取决于我自己。从2021年1月1日开始，记录晨间时光的分配，看看每天我的晨间，时间都去哪儿了？晨间时光：5:30闹钟响，立即起床。5:30-5:55洗漱、烧水、简单拉伸，同步听小爱同学的晨间新闻。终于在今早听到了“郑爽事件”。6:00-6:30固定学习时间，今日是给自己制定了晨间学习的标准和刻意练习模板，因为有标准才能找到差距，补不足。6:30-7
不舒服的舍友关系瑞丝记
女孩（心思很简单，单纯，做人很直，没有歪心眼，对朋友也没有其他想法，想对每个朋友都好，但可能有时候方式不对，生活很无聊，）高中之前女孩一直每天努力学习，但由于学习方法掌握的不好，成绩一直没有明显变化，上了一个普通院校，来了大学也一直努力生活，复习周，抓紧时间好好复习，经过两年多努力，得了奖学金，此时，女孩的舍友们就都眼红了。舍友1（家境不错，从小被爸爸妈妈宠爱，家里洗衣服从未手洗过，但但拿了助学金
没有目标就无法自律七小仙姐
人在病中，就真的什么自律都没有了。该些的文章，该回的消息，该学习的书，统统抛诸脑后。这时候，怎么舒服怎么来。想起来以前看过的一些新闻，有很多国家科研工作者，病重还在床前工作，自叹不如。我想那是因为有一种使命感吧。今天居然了50块解锁了一个电视剧（是什么，大家都知道的哈）。昨天病到刷剧都没法刷，今天是还可以看看手机。人就是这样颓废的，病了三天，就自我放纵了三天。心里埋怨着自己怎么这么不上进，但是身体
Java的CopyOnWriteArrayList xbmchina
简介ArrayList并不是线程安全的，在读线程在读取ArrayList的时候如果有写线程在写数据的时候，基于fast-fail机制，会抛出ConcurrentModificationException异常，也就是说ArrayList并不是一个线程安全的容器。那么并发的情况下，这就有了CopyOnWriteArrayList这个东西。下面主要以下几个方面学习CopyOnWriteArrayList
我要当小仙女一手王
以后我再也不要抱怨啦，因为一抱怨灵气就没有啦。我要成为小仙女天天快快乐乐的成长。我要好好学习，天天向上。我觉得出去玩儿比在家里待着好玩。我觉得我老公并不完全属于我。不属于就不属于呗，自己也可以成就精彩的人生！再说世界上人这么多，又不是只有他一个人。对，就是这样的。我要交志同道合的朋友。一起唱歌跳舞。拍抖音。写小说，参加运动会。一起跳绳，一起玩儿植物大战僵尸现实版的。心胸开阔。正所谓心大了，烦恼就小
悠悠上学记悠悠我心与世无双
11.10日星期二晴今日物语：笨笨的努力，慢慢的进步今天该我值班，白天一天没有见悠悠，中午通电话时候，没说两句他爸爸就说在复习数学方位题，不要让我打扰人家，好的，学习大于天，从上学以来，悠悠很少睡午觉，中午多少会学一点。晚上到家的时候悠悠已经准备睡觉了，听爸爸说悠悠今天作业完成的比较早，明天要考试就让她早点睡了，我问：你们两个闹别扭了没有？有没有吵架？你吵她了没有？爸爸说：今天表现很好，一切平安！
陪孩子备战高考第九十六天想入非非的棋子
今天距高考还有整一百天，孩子与同学和老师一起举行了隆重的誓师仪式。我也很激动和兴奋，希望孩子能够以稳定和安静的心态度过这关键的一百天，用心学习，努力奋斗！争取考出好成绩步入理想的大学！我非常高兴和感激我的孩子在未来的考试中考入理想的大学，我感谢这个世界，感谢孩子的老师，感谢古今圣贤。我的孩子必定圆满！必定如意！加油吧！
AI产品经理面试宝典第30天：AI+教育个性化学习与知识图谱相关面试题的解答指导 TGITCIC AI产品经理一线大厂面试题人工智能产品经理 AI产品经理面试大模型产品经理面试 AI面试大模型面试
自适应学习系统如何实现千人千面？面试官：请用产品视角解释AI自适应学习系统的核心逻辑你的回答：自适应学习系统本质是构建"数据-模型-决策"的闭环。以沪江Hitalk为例，其通过12级能力评估体系采集学员的听、说、读、写数据，利用知识图谱建立知识点关联网络。当学员完成"实景演练-诊断反馈-学习包推送"的完整链路时，系统会动态调整知识图谱权重，形成个性化学习路径。面试官追问：如何验证个性化效果？回答：
英伟达Triton 推理服务详解 leo0308 基础知识机器人 Triton 人工智能
1.TritonInferenceServer简介TritonInferenceServer（简称Triton，原名NVIDIATensorRTInferenceServer）是英伟达推出的一个开源、高性能的推理服务器，专为AI模型的部署和推理服务而设计。它支持多种深度学习框架和硬件平台，能够帮助开发者和企业高效地将AI模型部署到生产环境中。Triton主要用于模型推理服务化，即将训练好的模型通过
2023-01-01新的开始松林子
新的开始2023年1月1日星期日告别了2022年，进入2023年。过去的一年已经结束，新的一年在我的面前展开。我对新的一年的期望就是成长。我期望工作上的成长，我期望学习的成长，我也期望家庭方面的成长，我还期望健康方面的成长。关于工作方面的成长，我需要殷勤筹划，我要发起各样的活动。关于学习方面的成长，我需要的是多阅读和写作。家庭方面的成长，我需要表达爱和不计较人的恶。健康方面的成长，我需要的是锻炼身
中原焦点团队焦点初级32期梁怡2022年2月13日坚持分享第88天怡_96d8
当事人犯错究竟有何意义？当事人犯错究竟如何改善？如何帮助当事人能从犯错中有更多的学习与成长，而非仅是停止犯错而已？——SFBT的“目标架构”乃有一些实用的想法。(-)当事人犯错之际，正是成长的好时机当事人的问题行为，尤其是青少年的经常性犯错,常令关心当事人的人倍感头痛挫折，以至于在处理当事人之问题或层出不穷的事件时，会以追究责任、道德劝说或生气指责的方式对待当事人。但是这些方式反而会造成当事人的反
算法竞赛备考冲刺必刷题（C++） | 洛谷 P1179 数字统计
本文分享的必刷题目是从蓝桥云课、洛谷、AcWing等知名刷题平台精心挑选而来，并结合各平台提供的算法标签和难度等级进行了系统分类。题目涵盖了从基础到进阶的多种算法和数据结构，旨在为不同阶段的编程学习者提供一条清晰、平稳的学习提升路径。欢迎大家订阅我的专栏：算法题解：C++与Python实现！附上汇总贴：算法竞赛备考冲刺必刷题（C++）|汇总【题目来源】洛谷：P1179[NOIP2010普及组]数字
算法竞赛备考冲刺必刷题（C++） | 洛谷 P1109 学生分组热爱编程的通信人算法 c++开发语言
本文分享的必刷题目是从蓝桥云课、洛谷、AcWing等知名刷题平台精心挑选而来，并结合各平台提供的算法标签和难度等级进行了系统分类。题目涵盖了从基础到进阶的多种算法和数据结构，旨在为不同阶段的编程学习者提供一条清晰、平稳的学习提升路径。欢迎大家订阅我的专栏：算法题解：C++与Python实现！附上汇总贴：算法竞赛备考冲刺必刷题（C++）|汇总【题目来源】洛谷：P1109学生分组-洛谷【题目描述】有n
jdk tomcat 环境变量配置 Array_06 java jdk tomcat
Win7 下如何配置java环境变量 1。准备jdk包，win7系统，tomcat安装包（均上网下载即可） 2。进行对jdk的安装，尽量为默认路径（但要记住啊！！以防以后配置用。。。） 3。分别配置高级环境变量。电脑-->右击属性-->高级环境变量-->环境变量。分别配置 : path &nbs
Spring调SDK包报java.lang.NoSuchFieldError错误 bijian1013 java spring
在工作中调另一个系统的SDK包，出现如下java.lang.NoSuchFieldError错误。 org.springframework.web.util.NestedServletException: Handler processing failed; nested exception is java.l
LeetCode[位运算] - #136 数组中的单一数 Cwind java 题解位运算 LeetCode Algorithm
原题链接：#136 Single Number 要求：给定一个整型数组，其中除了一个元素之外，每个元素都出现两次。找出这个元素注意：算法的时间复杂度应为O(n)，最好不使用额外的内存空间难度：中等分析：题目限定了线性的时间复杂度，同时不使用额外的空间，即要求只遍历数组一遍得出结果。由于异或运算 n XOR n = 0, n XOR 0 = n，故将数组中的每个元素进
qq登陆界面开发 15700786134 qq
今天我们来开发一个qq登陆界面，首先写一个界面程序，一个界面首先是一个Frame对象，即是一个窗体。然后在这个窗体上放置其他组件。代码如下： public class First { public void initul(){ jf=ne
Linux的程序包管理器RPM 被触发 linux
在早期我们使用源代码的方式来安装软件时，都需要先把源程序代码编译成可执行的二进制安装程序，然后进行安装。这就意味着每次安装软件都需要经过预处理-->编译-->汇编-->链接-->生成安装文件--> 安装，这个复杂而艰辛的过程。为简化安装步骤，便于广大用户的安装部署程序，程序提供商就在特定的系统上面编译好相关程序的安装文件并进行打包，提供给大家下载，我们只需要根据自己的
socket通信遇到EOFException 肆无忌惮_ EOFException
java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2281) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:
基于spring的web项目定时操作知了ing java Web
废话不多说，直接上代码，很简单配置一下项目启动就行 1，web.xml <?xml version="1.0" encoding="UTF-8"?> <web-app xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="h
树形结构的数据库表Schema设计矮蛋蛋 schema
原文地址： http://blog.csdn.net/MONKEY_D_MENG/article/details/6647488 程序设计过程中，我们常常用树形结构来表征某些数据的关联关系，如企业上下级部门、栏目结构、商品分类等等，通常而言，这些树状结构需要借助于数据库完成持久化。然而目前的各种基于关系的数据库，都是以二维表的形式记录存储数据信息，
maven将jar包和源码一起打包到本地仓库 alleni123 maven
http://stackoverflow.com/questions/4031987/how-to-upload-sources-to-local-maven-repository <project> ... <build> <plugins> <plugin> <groupI
java IO操作与 File 获取文件或文件夹的大小，可读，等属性！！！百合不是茶
类 File File是指文件和目录路径名的抽象表示形式。 1，何为文件：标准文件（txt doc mp3...）目录文件（文件夹）虚拟内存文件 2，File类中有可以创建文件的 createNewFile（）方法,在创建新文件的时候需要try{} catch(）{}因为可能会抛出异常；也有可以判断文件是否是一个标准文件的方法isFile();这些防抖都
Spring注入有继承关系的类（2） bijian1013 java spring
被注入类的父类有相应的属性，Spring可以直接注入相应的属性，如下所例：1.AClass类 package com.bijian.spring.test4; public class AClass { private String a; private String b; public String getA() { retu
30岁转型期你能否成为成功人士 bijian1013 成长励志
很多人由于年轻时走了弯路，到了30岁一事无成，这样的例子大有人在。但同样也有一些人，整个职业生涯都发展得很优秀，到了30岁已经成为职场的精英阶层。由于做猎头的原因，我们接触很多30岁左右的经理人，发现他们在职业发展道路上往往有很多致命的问题。在30岁之前，他们的职业生涯表现很优秀，但从30岁到40岁这一段，很多人
【Velocity四】Velocity与Java互操作 bit1129 velocity
Velocity出现的目的用于简化基于MVC的web应用开发，用于替代JSP标签技术，那么Velocity如何访问Java代码.本篇继续以Velocity三http://bit1129.iteye.com/blog/2106142中的例子为基础， POJO package com.tom.servlets; public
【Hive十一】Hive数据倾斜优化 bit1129 hive
什么是Hive数据倾斜问题操作：join,group by,count distinct 现象：任务进度长时间维持在99%（或100%），查看任务监控页面，发现只有少量（1个或几个）reduce子任务未完成；查看未完成的子任务，可以看到本地读写数据量积累非常大，通常超过10GB可以认定为发生数据倾斜。原因：key分布不均匀倾斜度衡量：平均记录数超过50w且
在nginx中集成lua脚本：添加自定义Http头，封IP等 ronin47 nginx lua csrf
Lua是一个可以嵌入到Nginx配置文件中的动态脚本语言，从而可以在Nginx请求处理的任何阶段执行各种Lua代码。刚开始我们只是用Lua 把请求路由到后端服务器，但是它对我们架构的作用超出了我们的预期。下面就讲讲我们所做的工作。强制搜索引擎只索引mixlr.com Google把子域名当作完全独立的网站，我们不希望爬虫抓取子域名的页面，降低我们的Page rank。 location /{
java-3.求子数组的最大和 bylijinnan java
package beautyOfCoding; public class MaxSubArraySum { /** * 3.求子数组的最大和题目描述：输入一个整形数组，数组里有正数也有负数。数组中连续的一个或多个整数组成一个子数组，每个子数组都有一个和。求所有子数组的和的最大值。要求时间复杂度为O(n)。例如输入的数组为1, -2, 3, 10, -4,
Netty源码学习-FileRegion bylijinnan java netty
今天看org.jboss.netty.example.http.file.HttpStaticFileServerHandler.java 可以直接往channel里面写入一个FileRegion对象，而不需要相应的encoder： //pipeline（没有诸如“FileRegionEncoder”的handler）： public ChannelPipeline ge
使用ZeroClipboard解决跨浏览器复制到剪贴板的问题 cngolon 跨浏览器复制到粘贴板 Zero Clipboard
Zero Clipboard的实现原理 Zero Clipboard 利用透明的Flash让其漂浮在复制按钮之上，这样其实点击的不是按钮而是 Flash ，这样将需要的内容传入Flash，再通过Flash的复制功能把传入的内容复制到剪贴板。 Zero Clipboard的安装方法首先需要下载 Zero Clipboard的压缩包，解压后把文件夹中两个文件：ZeroClipboard.js
单例模式 cuishikuan 单例模式
第一种（懒汉，线程不安全）： public class Singleton { 2 private static Singleton instance; 3 pri
spring+websocket的使用 dalan_123
一、spring配置文件 <?xml version="1.0" encoding="UTF-8"?><beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.or
细节问题：ZEROFILL的用法范围。 dcj3sjt126com mysql
1、zerofill把月份中的一位数字比如1，2，3等加前导0 mysql> CREATE TABLE t1 (year YEAR(4), month INT(2) UNSIGNED ZEROFILL, -> day
Android开发10——Activity的跳转与传值 dcj3sjt126com Android开发
Activity跳转与传值，主要是通过Intent类，Intent的作用是激活组件和附带数据。一、Activity跳转方法一Intent intent = new Intent(A.this, B.class); startActivity(intent) 方法二Intent intent = new Intent();intent.setCla
jdbc 得到表结构、主键 eksliang jdbc 得到表结构、主键
转自博客：http://blog.csdn.net/ocean1010/article/details/7266042 假设有个con DatabaseMetaData dbmd = con.getMetaData(); rs = dbmd.getColumns(con.getCatalog(), schema, tableName, null); rs.getSt
Android 应用程序开关GPS gqdy365 android
要在应用程序中操作GPS开关需要权限： <uses-permission android:name="android.permission.WRITE_SECURE_SETTINGS" /> 但在配置文件中添加此权限之后会报错，无法再eclipse里面正常编译，怎么办？ 1、方法一：将项目放到Android源码中编译； 2、方法二：网上有人说cl
Windows上调试MapReduce zhiquanliu mapreduce
1.下载hadoop2x-eclipse-plugin https://github.com/winghc/hadoop2x-eclipse-plugin.git 把 hadoop2.6.0-eclipse-plugin.jar 放到eclipse plugin 目录中。 2.下载 hadoop2.6_x64_.zip http://dl.iteye.com/topics/download/d2b
如何看待一些知名博客推广软文的行为？ justjavac 博客
本文来自我在知乎上的一个回答：http://www.zhihu.com/question/23431810/answer/24588621 互联网上的两种典型心态：当初求种像条狗，如今撸完嫌人丑当初搜贴像条犬，如今读完嫌人软你为啥感觉不舒服呢？难道非得要作者把自己的劳动成果免费给你用，你才舒服？就如同 Google 关闭了 Gooled Reader，那是
sql优化总结 macroli sql
为了是自己对sql优化有更好的原则性，在这里做一下总结，个人原则如有不对请多多指教。谢谢！要知道一个简单的sql语句执行效率，就要有查看方式，一遍更好的进行优化。一、简单的统计语句执行时间 declare @d datetime ---定义一个datetime的变量set @d=getdate() ---获取查询语句开始前的时间select user_id
Linux Oracle中常遇到的一些问题及命令总结超声波 oracle linux
1.linux更改主机名 (1)#hostname oracledb　　　　临时修改主机名 (2) vi /etc/sysconfig/network 　　修改hostname (3) vi /etc/hosts　　　　　　　　修改IP对应的主机名 2.linux重启oracle实例及监听的各种方法（注意操作的顺序应该是先监听，后数据库实例） &nbs
hive函数大全及使用示例 superlxw1234 hadoop hive函数
具体说明及示例参见附件文档。文档目录：目录一、关系运算： 4 1. 等值比较: = 4 2. 不等值比较: <> 4 3. 小于比较: < 4 4. 小于等于比较: <= 4 5. 大于比较: > 5 6. 大于等于比较: >= 5 7. 空值判断: IS NULL 5
Spring 4.2新特性-使用@Order调整配置类加载顺序 wiselyman spring 4
4.1 @Order Spring 4.2 利用@Order控制配置类的加载顺序 4.2 演示两个演示bean package com.wisely.spring4_2.order; public class Demo1Service { } package com.wisely.spring4_2.order; public class

【学习笔记】Data Mining

Week1

1. Definition

2. Why

2.1 Descriptive

2.2 Predictive

3. Process

4. Input and Output

4.1 Input

4.2 Output: Knowledge representation

5. Data Preprocessing

5.1 Why

5.2 How

6. Data Cleaning

6.1 Issues with Real-World Data

6.2 Noisy Data

6.3 How to handle noise

7. Data integration

7.1 Why

7.2 How

8. Data Reduction

9. Data Transformation

9.1 Definition

9.2 Methods

9.3 Data Discretization Methods

9.4 Concept Hierarchy Generation

Week2

1. Data Warehouse Concepts

2. Why

2.1 High performance for both systems

2.2 Different functions and different data

3. OLTP vs. OLAP

4. Data Warehouse Architecture

4.1 Extraction, Transformation, and Loading (ETL)

4.2 Metadata Repository

4.3 Three Data Warehouse Models

5. Data Cube

5.1 Conceptual Modeling of Data Warehouses

5.2 Data Cube Measures

6. Typical OLAP Operations

7. Design of Data Warehouse：A Business Analysis Framework

7.1 Data Warehouse Design Process

7.3 Online Analytical Mining (OLAM)

8. Data Warehouse Implementation

8.1 The “Compute Cube” Operator

8.3 OLAP Server Architectures

9. Data Generalization by Attribute-Oriented Induction

9.1 Example of AOI

9.2 Basic Principles of AOI

9.3 AOI: Basic Algorithms

9.4 AOI vs. Cube-Based PLAP

10. Classification: A two-step process

11. More contents: I passed here

Week3

1. Cluster Analysis

2. Measure the Quality of Clustering

3. Considerations for Cluster Analysis

4. Major Clustering Approaches（*）

5. Partitioning method

5.1 K-Means

5.2 K-Medoids

6. Hierarchical Methods

6.1 AGNES

6.2 DIANA

6.3 Dengorgram

6.4 Example

6.5 Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH)

6.6 Chameleon: Multiphase Hierarchical Clustering Using Dynamic Modeling

7. ​​​​​​​Density-Based Clustering Methods

7.1 DBSCAN: Density-Based Clustering Based on Connected Regions with High Density

8. Grid-Based Method​​​​​​​

8.1 STING

8.2 CLIQUE (Clustering In QUEst)

9. Evaluation of Clustering

9.1 Determine the Number of Clusters

9.2 Clustering Quality

10. More Prediction

11. Data mining tasks

12. Frequent Patterns

13. Association Rules

**4. Major Clustering Approaches（*）**

7. Density-Based Clustering Methods

8. Grid-Based Method