皓哥好运来

《斯坦福数据挖掘教程·第三版》读书笔记（英文版）Chapter 3 Finding Similar Items

来源：《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

It is therefore a pleasant surprise to learn of a family of techniques called locality-sensitive hashing, or LSH, that allows us to focus on pairs that are likely to be similar, without having to look at all pairs. Thus, it is possible that we can avoid the quadratic growth in computation time that is required by the naive algorithm. There is usually a downside to locality-sensitive hashing, due to the presence of false negatives, that is, pairs of items that are similar, yet are not included in the set of pairs that we examine, but by careful tuning we can reduce the fraction of false negatives by increasing the number of pairs we consider.

The general idea behind LSH is that we hash items using many different hash functions. These hash functions are not the conventional sort of hash functions. Rather, they are carefully designed to have the property that pairs are much more likely to wind up in the same bucket of a hash function if the items are similar than if they are not similar. We then can examine only the candidate pairs, which are pairs of items that wind up in the same bucket for at least one of the hash functions.

The Jaccard similarity of sets S and T is $∣ S \cap T ∣/∣ S \cup T ∣$ , that is, the ratio of the size of the intersection of S and T to the size of their union. We shall denote the Jaccard similarity of S and T by $S I M (S, T)$ .

Shingling of Documents

A document is a string of characters. Define a k-shingle for a document to be any substring of length k found within the document. Then, we may associate with each document the set of k-shingles that appear one or more times within that document.

Suppose our document D is the string abcdabd, and we pick $k = 2$ . Then the set of 2-shingles for D is ${ab, bc, cd, da, bd\}$ .

Note that the substring ab appears twice within D, but appears only once as a shingle. A variation of shingling produces a bag, rather than a set, so each shingle would appear in the result as many times as it appears in the document. However, we shall not use bags of shingles here.

k should be picked large enough that the probability of any given shingle appearing in any given document is low.

Instead of using substrings directly as shingles, we can pick a hash function that maps strings of length k to some number of buckets and treat the resulting bucket number as the shingle. The set representing a document is then the set of integers that are bucket numbers of one or more k-shingles that appear in the document. For instance, we could construct the set of 9-shingles for a document and then map each of those 9-shingles to a bucket number in the range 0 to $2^{32} − 1$ . Thus, each shingle is represented by four bytes instead of nine. Not only has the data been compacted, but we can now manipulate (hashed) shingles by single-word machine operations.

Our goal in this section is to replace large sets by much smaller representations called “signatures.” The important property we need for signatures is that we can compare the signatures of two sets and estimate the Jaccard similarity of the underlying sets from the signatures alone.

The signatures we desire to construct for sets are composed of the results of a large number of calculations, say several hundred, each of which is a “minhash” of the characteristic matrix.

To minhash a set represented by a column of the characteristic matrix, pick a permutation of the rows. The minhash value of any column is the number of the first row, in the permuted order, in which the column has a 1.

In this matrix, we can read off the values of h by scanning from the top until we come to a 1.

The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets.

Call the minhash functions determined by these permutations $h_1, h_2, . . . , h_n$ . From the column representing set S, construct the minhash signature for S, the
vector $h_1(S), h_2(S), . . . , h_n(S)]$ . We normally represent this list of hash-values as a column. Thus, we can form from matrix M a signature matrix, in which the ith column of M is replaced by the minhash signature for (the set of) the ith column.

Moreover, the more minhashings we use, i.e., the more rows in the signature matrix, the smaller the expected error in the estimate of the Jaccard similarity will be.

Fortunately, it is possible to simulate the effect of a random permutation by a random hash function that maps row numbers to as many buckets as there are rows. A hash function that maps integers $0, 1, ..., k - 1$ to bucket numbers 0 through k−1 typically will map some pairs of integers to the same bucket and leave other buckets unfilled.

However, the difference is unimportant as long as k is large and there are not too many collisions. We can maintain the fiction that our hash function h “permutes” row r to position $h (r)$ in the permuted order.
Thus, instead of picking n random permutations of rows, we pick n randomly chosen hash functions $h_1, h_2, . . . , h_n$ on the rows. We construct the signature matrix by considering each row in their given order. Let $S I G (i, c)$ be the element of the signature matrix for the ith hash function and column c. Initially, set $S I G (i, c)$ to ∞ for all i and c. We handle row r by doing the following:

Compute $h_1(r), h_2(r), . . . , h_n(r)$ .
For each column c do the following:
(a) If c has 0 in row r, do nothing.
(b) However, if c has 1 in row r, then for each $i = 1, 2, ..., n$ set $S I G (i, c)$ to the smaller of the current value of $S I G (i, c)$ and $h_i(r)$ .

Example

The process of minhashing is time-consuming, since we need to examine the entire k-row matrix M for each minhash function we want.

But to compute one minhash function on all the columns, we shall not go all the way to the end of the permutation, but only look at the first m out of k rows. If we make m small compared with k, we reduce the work by a large factor, k/m.
However, there is a downside to making m small. As long as each column has at least one 1 in the first m rows in permuted order, the rows after the mth have no effect on any minhash value and may as well not be looked at.

But what if some columns are all-0’s in the first m rows? We have no minhash value
for those columns, and will instead have to use a special symbol, for which we
shall use ∞.

When we examine the minhash signatures of two columns in order to estimate the Jaccard similarity of their underlying sets, we have to take into account the possibility that one or both columns have ∞ as their minhash value for some components of the signature. There are three cases:

If neither column has ∞ in a given row, then there is no change needed. Count this row as an example of equal values if the two values are the same, and as an example of unequal values if not.
One column has ∞ and the other does not. In this case, had we used all the rows of the original permuted matrix M, the column that has the ∞ would eventually have been given some row number, and that number will surely not be one of the first m rows in the permuted order. But the other column does have a value that is one of the first m rows. Thus, we surely have an example of unequal minhash values, and we count this row of the signature matrix as such an example.
Suppose both columns have ∞ in row. Then in the original permuted matrix M, the first m rows of both columns were all 0’s. We thus have no information about the Jaccard similarity of the corresponding sets; that similarity is only a function of the last $k - m$ rows, which we have chosen not to look at. We therefore count this row of the signature matrix as neither an example of equal values nor of unequal values.

As long as the third case, where both columns have ∞, is rare, we get almost as many examples to average as there are rows in the signature matrix. That effect will reduce the accuracy of our estimates of the Jaccard distance somewhat, but not much. And since we are now able to compute minhash values for all the columns much faster than if we examined all the rows of M, we can afford the time to apply a few more minhash functions. We get even better accuracy than originally, and we do so faster than before.

Suppose T is the set of elements of the universal set that are represented by the first m rows of matrix M. Let $S_1$ and $S_2$ be the sets represented by two columns of M. Then the first m rows of M represent the sets S1 ∩ T and S2 ∩ T . If both these sets are empty (i.e., both columns are all-0 in their first m rows), then this minhash function will be ∞ in both columns and will be ignored when estimating the Jaccard similarity of the columns’ underlying sets.

If at least one of the sets S1 ∩ T and S2 ∩ T is nonempty, then the probability of the two columns having equal values for this minhash function is the Jaccard similarity of these two sets, that is

$\frac{|S_1 ∩ S_2∩T|}{|(S_1\cup S_2)\cap T|}$

As long as T is chosen to be a random subset of the universal set, the expected value of this fraction will be the same as the Jaccard similarity of $S_1$ and $S_2$ However, there will be some random variation, since depending on T , we could find more or less than an average number of type X rows (1’s in both columns) and/or type Y rows (1 in one column and 0 in the other) among the first m rows of matrix M.

To mitigate this variation, we do not use the same set T for each minhashing that we do. Rather, we divide the rows of M into $k / m$ groups. Then for each hash function, we compute one minhash value by examining only the first m rows of M, a different minhash value by examining only the second m rows, and so on. We thus get $k / m$ minhash values from a single hash function and a single pass over all the rows of M. In fact, if $k / m$ is large enough, we may get all the rows of the signature matrix that we need by a single hash function applied to each of the subsets of rows of M.

Locality-Sensitive Hashing for Documents

One general approach to LSH is to “hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are. We then consider any pair that hashed to the same bucket for any of the hashings to be a candidate pair. We check only the candidate pairs for similarity. The hope is that most of the dissimilar pairs will never hash to the same bucket, and therefore will never be checked. Those dissimilar pairs that do hash to the same bucket are false positives; we hope these will be only a small fraction of all pairs. We also hope that most of the truly similar pairs will hash to the same bucket under at least one of the hash functions. Those that do not are false negatives; we hope these will be only a small fraction of the truly similar pairs.

If we have minhash signatures for the items, an effective way to choose the hashings is to divide the signature matrix into b bands consisting of r rows each. For each band, there is a hash function that takes vectors of r integers (the portion of one column within that band) and hashes them to some large number of buckets. We can use the same hash function for all the bands, but we use a separate bucket array for each band, so columns with the same vector in different bands will not hash to the same bucket.

Suppose we use b bands of r rows each, and suppose that a particular pair of documents have Jaccard similarity s. The probability the minhash signatures for these documents agree in any one particular row of the signature matrix is s. We can calculate the probability that these documents (or rather their signatures) become a candidate pair as follows:

The probability that the signatures agree in all rows of one particular band is $s^r$ .
The probability that the signatures disagree in at least one row of a particular band is $1-s^r$ .
The probability that the signatures disagree in at least one row of each of the bands is $1-s^r)^b$
The probability that the signatures agree in all the rows of at least one band, and therefore become a candidate pair, is $1-(1-s^r)^b$ .

It may not be obvious, but regardless of the chosen constants b and r, this function has the form of an S-curve.

We can now give an approach to finding the set of candidate pairs for similar documents and then discovering the truly similar documents among them. It must be emphasized that this approach can produce false negatives – pairs of similar documents that are not identified as such because they never become a candidate pair. There will also be false positives – candidate pairs that are evaluated, but are found not to be sufficiently similar.

Pick a value of k and construct from each document the set of k-shingles. Optionally, hash the k-shingles to shorter bucket numbers.
Sort the document-shingle pairs to order them by shingle.
Pick a length n for the minhash signatures. Feed the sorted list to the algorithm to compute the minhash signatures for all the documents.
Choose a threshold t that defines how similar documents have to be in order for them to be regarded as a desired “similar pair.” Pick a number of bands b and a number of rows r such that $b r = n$ , and the threshold t is approximately $1/b) ^{1/r}$ . If avoidance of false negatives is important, you may wish to select b and r to produce a threshold lower than t; if speed is important and you wish to limit false positives, select b and r to produce a higher threshold.
Construct candidate pairs by applying the LSH technique.
Examine each candidate pair’s signatures and determine whether the fraction of components in which they agree is at least t.
Optionally, if the signatures are sufficiently similar, go to the documents themselves and check that they are truly similar, rather than documents that, by luck, had similar signatures

The LSH technique developed is one example of a family of functions (the minhash functions) that can be combined (by the banding technique) to distinguish strongly between pairs at a low distance from pairs at a high distance. The steepness of the S-curve in Fig. 3.8 reflects how effectively we can avoid false positives and false negatives among the candidate pairs. Now, we shall explore other families of functions, besides the minhash functions, that can serve to produce candidate pairs efficiently. These functions can apply to the space of sets and the Jaccard distance, or to another space and/or another distance measure. There are three conditions that we need for a family of functions:

They must be more likely to make close pairs be candidate pairs than distant pairs.
They must be statistically independent, in the sense that it is possible to estimate the probability that two or more functions will all give a certain response by the product rule for independent events.
They must be efficient, in two ways:
(a) They must be able to identify candidate pairs in time much less than the time it takes to look at all pairs. For example, minhash functions have this capability, since we can hash sets to minhash values in time proportional to the size of the data, rather than the square of the number of sets in the data. Since sets with common values are colocated in a bucket, we have implicitly produced the candidate pairs for a single minhash function in time much less than
the number of pairs of sets.
(b) They must be combinable to build functions that are better at avoiding false positives and negatives, and the combined functions must also take time that is much less than the number of pairs. For example, the banding technique takes single minhash functions, which satisfy condition 3a but do not, by themselves have the S-curve behavior we want, and produces from a number of minhash functions a combined function that has the S-curve shape.

Locality-Sensitive Functions

For the purposes of this section, we shall consider functions that take two items and render a decision about whether these items should be a candidate pair.

In many cases, the function f will “hash” items, and the decision will be based on whether or not the result is equal. Because it is convenient to use the notation $f (x) = f (y)$ to mean that f(x, y) is “yes; make x and y a candidate pair,” we shall use $f (x) = f (y)$ as a shorthand with this meaning. We also use $\not = f(y)$ to mean “do not make x and y a candidate pair unless some other function concludes we should do so.”
A collection of functions of this form will be called a family of functions.
For example, the family of minhash functions, each based on one of the possible permutations of rows of a characteristic matrix, form a family.

Let $d_1 < d_2$ be two distances according to some distance measure d. A family F of functions is said to be (d1, d2, p1, p2)-sensitive if for every f in F:

If $d(x, y) ≤ d_1$ , then the probability that f(x) = f(y) is at least $p_1$ .
If $d(x, y) ≥ d_2$ , then the probability that f(x) = f(y) is at most $p_2$ .

Suppose we are given a (d1, d2, p1, p2)-sensitive family F. We can construct a new family F′ by the AND-construction on F, which is defined as follows. Each member of F′ consists of r members of F for some fixed r. If f is in F′, and f is constructed from the set ${f_1, f_2, . . . , f_r}$ of members of F, we say f(x) = f(y) if and only if $f_i(x) = f_i(y)$ for all i = 1, 2, . . . , r. Notice that this construction mirrors the effect of the r rows in a single band: the band makes x and y a candidate pair if every one of the r rows in the band say that x and y are equal (and therefore a candidate pair according to that row).
Since the members of F are independently chosen to make a member of F′, we can assert that F′ is a $d_1,d_2,(p_1)^r,(p_2)^r)$ - sensitive family. That is, for any p, if p is the probability that a member of F will declare (x, y) to be a candidate pair, then the probability that a member of F′ will so declare is $p^r$ .

There is another construction, which we call the OR-construction, that turns
a (d1, d2, p1, p2)-sensitive family F into a $d_1,d_2,1-(1-p_1)^b,1-(1-p_2)^b)$

sensitive family F′. Each member f of F′ is constructed from b members of F, say $f_1, f_2, . . . , f_b$ . We define f(x) = f(y) if and only if $f_i(x) = f_i(y)$ for one or more values of i. The OR-construction mirrors the effect of combining several bands: x and y become a candidate pair if any band makes them a candidate pair.

Summary of Chapter 3

Jaccard Similarity: The Jaccard similarity of sets is the ratio of the size of the intersection of the sets to the size of the union. This measure of similarity is suitable for many applications, including textual similarity of documents and similarity of buying habits of customers.
Shingling: A k-shingle is any k characters that appear consecutively in a document. If we represent a document by its set of k-shingles, then the Jaccard similarity of the shingle sets measures the textual similarity of documents. Sometimes, it is useful to hash shingles to bit strings of shorter length, and use sets of hash values to represent documents.
Minhashing: A minhash function on sets is based on a permutation of the universal set. Given any such permutation, the minhash value for a set is that element of the set that appears first in the permuted order.
Minhash Signatures: We may represent sets by picking some list of permutations and computing for each set its minhash signature, which is the sequence of minhash values obtained by applying each permutation on the list to that set. Given two sets, the expected fraction of the permutations that will yield the same minhash value is exactly the Jaccard similarity of the sets.
Efficient Minhashing: Since it is not really possible to generate random permutations, it is normal to simulate a permutation by picking a random hash function and taking the minhash value for a set to be the least hash value of any of the set’s members. An additional efficiency can be had by restricting the search for the smallest minhash value to only a small subset of the universal set.
Locality-Sensitive Hashing for Signatures: This technique allows us to avoid computing the similarity of every pair of sets or their minhash signatures. If we are given signatures for the sets, we may divide them into bands, and only measure the similarity of a pair of sets if they are identical in at least one band. By choosing the size of bands appropriately, we can eliminate from consideration most of the pairs that do not meet our threshold of similarity.
Distance Measures: A distance measure is a function on pairs of points in a space that satisfy certain axioms. The distance between two points is 0 if the points are the same, but greater than 0 if the points are different. The distance is symmetric; it does not matter in which order we consider the two points. A distance measure must satisfy the triangle inequality: the distance between two points is never more than the sum of the distances between those points and some third point.
Euclidean Distance: The most common notion of distance is the Euclidean distance in an n-dimensional space. This distance, sometimes called the L2-norm, is the square root of the sum of the squares of the differences between the points in each dimension. Another distance suitable for Euclidean spaces, called Manhattan distance or the L1-norm is the sum of the magnitudes of the differences between the points in each dimension.
Jaccard Distance: One minus the Jaccard similarity is a distance measure, called the Jaccard distance.
Cosine Distance: The angle between vectors in a vector space is the cosine distance measure. We can compute the cosine of that angle by taking the dot product of the vectors and dividing by the lengths of the vectors.
Edit Distance: This distance measure applies to a space of strings, and is the number of insertions and/or deletions needed to convert one string into the other. The edit distance can also be computed as the sum of the lengths of the strings minus twice the length of the longest common subsequence of the strings.
Hamming Distance: This distance measure applies to a space of vectors. The Hamming distance between two vectors is the number of positions in which the vectors differ.
Generalized Locality-Sensitive Hashing: We may start with any collection of functions, such as the minhash functions, that can render a decision as to whether or not a pair of items should be candidates for similarity checking. The only constraint on these functions is that they provide a lower bound on the probability of saying “yes” if the distance (according to some distance measure) is below a given limit, and an upper bound on the probability of saying “yes” if the distance is above another given limit. We can then increase the probability of saying “yes” for nearby items and at the same time decrease the probability of saying “yes” for distant items to as great an extent as we wish, by applying an AND construction and an OR construction.
Random Hyperplanes and LSH for Cosine Distance: We can get a set of basis functions to start a generalized LSH for the cosine distance measure by identifying each function with a list of randomly chosen vectors. We apply a function to a given vector v by taking the dot product of v with each vector on the list. The result is a sketch consisting of the signs (+1 or −1) of the dot products. The fraction of positions in which the sketches of two vectors agree, multiplied by 180, is an estimate of the angle between the two vectors.
LSH For Euclidean Distance: A set of basis functions to start LSH for Euclidean distance can be obtained by choosing random lines and projecting points onto those lines. Each line is broken into fixed-length intervals, and the function answers “yes” to a pair of points that fall into the same interval.
High-Similarity Detection by String Comparison: An alternative approach to finding similar items, when the threshold of Jaccard similarity is close to 1, avoids using minhashing and LSH. Rather, the universal set is ordered, and sets are represented by strings, consisting their elements in order. The simplest way to avoid comparing all pairs of sets or their strings is to note that highly similar sets will have strings of approximately the same length. If we sort the strings, we can compare each string with only a small number of the immediately following strings.
Character Indexes: If we represent sets by strings, and the similarity threshold is close to 1, we can index all strings by their first few characters. The prefix whose characters must be indexed is approximately the length of the string times the maximum Jaccard distance (1 minus the minimum Jaccard similarity).
Position Indexes: We can index strings not only on the characters in their prefixes, but on the position of that character within the prefix. We reduce the number of pairs of strings that must be compared, because if two strings share a character that is not in the first position in both strings, then we know that either there are some preceding characters that are in the union but not the intersection, or there is an earlier symbol that appears in both strings.
Suffix Indexes: We can also index strings based not only on the characters in their prefixes and the positions of those characters, but on the length of the character’s suffix – the number of positions that follow it in the string. This structure further reduces the number of pairs that must be compared, because a common symbol with different suffix lengths implies additional characters that must be in the union but not in the intersection.

END

ps. 终于看完了，下一本书跨专业咯：《营销管理》（第15版）

SpringBoot多数据源动态切换方案：AbstractRoutingDataSource详解 fanxbl957 Web spring boot 后端 java
博主介绍：Java、Python、js全栈开发“多面手”，精通多种编程语言和技术，痴迷于人工智能领域。秉持着对技术的热爱与执着，持续探索创新，愿在此分享交流和学习，与大家共进步。DeepSeek-行业融合之万象视界(附实战案例详解100+)全栈开发环境搭建运行攻略：多语言一站式指南(环境搭建+运行+调试+发布+保姆级详解)感兴趣的可以先收藏起来，希望帮助更多的人SpringBoot多数据源动态切换
深入解读MaaS技术架构：从模型服务到智能部署的全流程分析 Cc不爱吃洋葱架构人工智能大语言模型大模型智能部署 MaaS技术架构 LLM
随着人工智能（AI）的迅速发展，MaaS（ModelasaService，模型即服务）技术架构应运而生。它通过将复杂的AI模型封装为标准化服务，降低了模型的开发和部署门槛，帮助企业快速实现业务场景的智能化升级。本文将深入解析MaaS技术架构，详细阐述其各个组成部分以及如何在实际应用中高效发挥其功能。一、使用方层：从应用接入到业务赋能MaaS技术架构的顶层是使用方层，它主要面向第三方应用，是企业与M
人工智能LLM | 基础配置 | 通过环境变量配置API-KEY 一文通教程 H-大叔人工智能大模型实战与教程人工智能
在实战开发大语言模型的过程中，经常会遇到各种API-KEY的配置问题，例如GPTOpenAIKEY的配置，而且目前大部分都要求将其配置在环境变量中，下面将会讲解如何在Linux、macOS、Windows中配置，本文一文通教程。您可以使用配置环境变量的方法，避免在调用各种SDK时显式地配置API-KEY，从而降低泄漏风险。环境变量是操作系统中用于存储有关系统环境的信息的变量。您可以通过环境变量来配
【人工智能】ChatGPT、DeepSeek-R1、DeepSeek-V3 辨析 G皮T #大语言模型人工智能 LLM 大语言模型 chatgpt deepseek DeepSeek-R1 DeepSeek-V3
ChatGPT、DeepSeek-R1、DeepSeek-V3辨析1.ChatGPT对比DeepSeek1.1技术相似点1.2主要差异1.3关键区别1.4如何选择1.5总结2.DeepSeek-R1对比DeepSeek-V32.1DeepSeek-R12.2DeepSeek-V32.3核心区别总结2.4如何选择3.R1和V3有什么含义3.1DeepSeekR1的"R"3.2DeepSeekV3的"
在学校研究学习的偏算法，秋招投递开发岗位还有希望吗程序员
前言Thelasttime,Ihavelearned这是星球同学，在周五晚上答疑聊天的时候对我的提问：如果简历上的项目偏算法，但是自学了一些操作系统和计网的知识，秋招的时候投递偏开发的岗位有希望吗？简历上是否也要加上相关项目？估计也是很多朋友的疑问，毕竟很多同学读研，有些老师疯狂push，要成果，发论文。要想尽快发论文，那只能“研究”人工智能、算法的一些东西了。但是众所周知，算法要求很高，不仅要求
【AI论文】基于图像思维的多模态推理：理论基础、方法及未来前沿东临碣石82 人工智能
摘要：近期，文本思维链（Chain-of-Thought，CoT）显著推动了多模态推理的进展。在这一范式下，模型在语言层面进行推理。然而，这种以文本为中心的方法将视觉信息视为静态的初始语境，从而在丰富的感知数据与离散的符号思维之间造成了根本性的“语义鸿沟”。人类认知往往超越语言的局限，将视觉作为动态的心理草图板加以利用。如今，人工智能领域也正经历着类似的演变，标志着从仅能对图像进行思考的模型向真正
DeepSeek 帮助自己的工作
引言简述人工智能助手在职场中的普及趋势DeepSeek作为智能创作助手的核心功能概述DeepSeek的核心能力信息检索与整合：基于用户意图精准搜索并生成答案多场景应用：技术文档撰写、数据分析、代码生成等交互优化：遵循用户指定的格式与内容规范职场应用场景与实操案例技术文档撰写自动生成API文档框架根据需求补充技术细节示例代码块与公式的规范化输出数据分析支持快速检索行业数据并生成可视化建议数学建模中的
人工智能-基础篇-23-智能体Agent到底是什么？怎么理解？（智能体=看+想+做） weisian151 人工智能人工智能
1、智能体是什么？想象你有一个超级聪明的小助手，它能：自己看环境（比如看到天气、听到声音、读到数据）；自己做决定（比如下雨了要关窗，电量低要去充电）；自己动手干活（比如帮你订外卖、打扫房间、开车）；越用越聪明（比如记住你的习惯，下次不用你提醒）。这个“小助手”就是智能体（Agent）——它是一个能自主感知、思考、行动并学习的系统，可以是软件（比如手机里的AI助手）、硬件（比如机器人），或者软硬结合
多角色AI Agent：基于LLM的虚拟角色扮演系统 AI天才研究院 AI人工智能与大数据人工智能 ai
多角色AIAgent：基于LLM的虚拟角色扮演系统关键词多角色AIAgentLargeLanguageModel(LLM)虚拟角色扮演系统人工智能自然语言处理程序设计摘要本文旨在探讨多角色AIAgent的基础知识以及其如何在虚拟角色扮演系统中发挥作用。我们将首先介绍多角色AIAgent的概念、历史背景和基本原理。随后，我们将深入探讨LLM（大语言模型）在虚拟角色扮演系统中的应用，包括其工作原理、核
【算法】解数独：C++ 实现与策略探讨 master_chenchengg 算法提升算法 java 开发语言
【算法】解数独：C++实现与策略探讨一、引言：C++算法技术的魔力与解数独的智慧二、技术概述：数独求解的艺术定义与技术框架核心特性和优势代码示例：基础回溯解法三、技术细节：解数独的逻辑与挑战原理解析难点分析四、实战应用：从游戏到人工智能应用场景解决方案展示五、优化与改进潜在问题改进建议六、常见问题与解决方案七、总结与展望一、引言：C++算法技术的魔力与解数独的智慧在算法领域，C++凭借其高效、灵活
FastMCP：用于构建MCP服务器的开源Python框架 NetX行者 AI编程服务器开源 python
在人工智能领域，模型上下文协议（ModelContextProtocol，简称MCP）作为一种标准化的协议，为大型语言模型（LLM）提供了丰富的上下文和工具支持。而FastMCP作为构建MCP服务器和客户端的Python框架，以其简洁的API设计、高效的开发体验以及强大的扩展能力，正逐渐成为开发者们的首选工具。一、FastMCP简介FastMCP是一个用于构建MCP服务器和客户端的Python框架
Python在人工智能领域的实际应用：示例代码解析辣条yyds python python 人工智能开发语言
摘要：本文将通过几个典型的人工智能应用场景，展示Python在图像识别、自然语言处理、推荐系统等方面的高级用法。通过示例代码，带大家深入理解Python在人工智能领域的实际应用。正文：Python作为一门流行的编程语言，凭借其简洁的语法、丰富的库和框架，成为了人工智能（AI）领域的主流开发语言。下面，我们将通过几个示例，探讨Python在人工智能方向的实际应用。示例一：图像识别-使用OpenCV进
Tansformer的Multi-Head Attention组件数字化与智能化大模型基础 Transformer框架 transformer 多头注意力机制
一、Transformer的注意力机制Transformer的注意力机制是对传统序列建模方法的颠覆性创新。它通过全局并行的关联计算解决了RNN的效率与长距离依赖瓶颈，通过动态权重和多头设计增强了模型对复杂信息的捕捉能力，最终成为现代人工智能的核心技术基石。其意义不仅在于提升了模型性能，更在于提供了一种“计算关联”的通用思路，推动了人工智能向更高效、更通用的方向发展。在Transformer之前，循
生成式人工智能实战 | 条件生成对抗网络（conditional Generative Adversarial Network, cGAN）盼小辉丶生成对抗网络神经网络深度学习生成式人工智能 pytorch
生成式人工智能实战|条件生成对抗网络0.前言1.条件生成对抗网络1.1GAN基础回顾1.2cGAN核心思想2.cGAN网络架构2.1数学原理2.2网络架构3.实现cGAN3.1环境准备与数据加载3.2模型构建3.3模型训练0.前言生成对抗网络(GenerativeAdversarialNetwork,GAN)是近年来深度学习领域最具突破性的技术之一，能够生成逼真的图像、音频甚至文本。然而，传统的G
【人工智能】Maas（模型即服务）（Model as a Service）是一种基于云计算的商业模式，通过API将预训练的人工智能模型作为服务提供给用户，使其无需自行管理底层基础设施即可调用AI能力。本本本添哥 A -AIGC 人工智能大模型人工智能云计算
ModelasaService（模型即服务，MaaS）是一种基于云计算的商业模式，通过API将预训练的人工智能模型作为服务提供给用户，使其无需自行管理底层基础设施即可调用AI能力。MaaS通过云原生架构和标准化服务，正在重塑AI技术的开发和消费方式，推动人工智能从“技术专有”向“普惠工具”转变。以下是其核心要点：1.定义与核心理念MaaS将大模型（如GPT-3、多模态模型等）封装为标准化服务，用户
【PaddleOCR】快速集成 PP-OCRv5 的 Python 实战秘籍--- 实例化 OCR 对象的 predict() 方法介绍
博主简介：曾任某智慧城市类企业算法总监，目前在美国市场的物流公司从事高级算法工程师一职，深耕人工智能领域，精通python数据挖掘、可视化、机器学习等，发表过AI相关的专利并多次在AI类比赛中获奖。CSDN人工智能领域的优质创作者，提供AI相关的技术咨询、项目开发和个性化解决方案等服务，如有需要请站内私信或者联系任意文章底部的的VX名片（ID：xf982831907）博主粉丝群介绍：①群内初中生、
Python 机器学习实战：Scikit-learn 算法宝典，从线性回归到支持向量机清水白石008 python Python题库 python 机器学习算法
Python机器学习实战：Scikit-learn算法宝典，从线性回归到支持向量机引言各位Python工程师，大家好！欢迎来到激动人心的机器学习世界！在这个数据驱动的时代，机器学习已经渗透到我们生活的方方面面，从智能推荐系统到自动驾驶汽车，都离不开机器学习技术的支撑。作为一名Python开发者，掌握机器学习技能，无疑将为您的职业发展注入强大的动力，让您在人工智能浪潮中占据先机。Scikit-lea
提示词工程在实体关系抽取中的创新 AI天才研究院计算 ChatGPT AI人工智能与大数据 java python javascript kotlin golang 架构人工智能大厂程序员硅基计算碳基计算认知计算生物计算深度学习神经网络大数据 AIGC AGI LLM 系统架构设计软件哲学 Agent 程序员实现财富自由
1.5概念结构与核心要素组成在深入探讨提示词工程在实体关系抽取中的应用之前，我们需要对其概念结构与核心要素组成有一个清晰的理解。这一部分将介绍提示词工程的基本框架，以及实体关系抽取的关键技术。提示词工程的基本框架提示词工程（PromptEngineering）是指利用人工智能技术和自然语言处理方法，设计并优化用于训练语言模型的输入提示（prompt），以达到特定任务目标的过程。其核心框架包括以下几
别再瞎摸索了！HarmonyOS AI 字幕控件用法全解析
引言现在视频、音频这些多媒体内容越来越多，用户对字幕的需求也跟着水涨船高，毕竟谁不想轻松看懂听不懂的内容呢？而且这两年人工智能技术发展得这么快，早就该用到字幕领域了——以前全靠人工打字幕，费时费力还容易出错，现在有了AI帮忙，简直是解放双手！正好HarmonyOS推出了AI字幕控件，这东西能自动识别语音、生成字幕，一下子就让视频和音频内容变得更易用了。对咱们做鸿蒙原生应用的人来说，更是省了大事儿—
Python机器学习入门必看！从原理到实战，手把手教你线性回归模型小张在编程 python 机器学习线性回归
引言在人工智能浪潮席卷全球的今天，机器学习（MachineLearning）早已不再是实验室的“黑科技”——打开购物APP的“猜你喜欢”、输入搜索词后的“相关推荐”、甚至天气预报中的温度预测，背后都有机器学习模型的身影。而在线性回归（LinearRegression）作为机器学习中最基础、最经典的监督学习模型，堪称机器学习的“敲门砖”。本文将从原理到实战，带你彻底掌握这一核心算法。一、机器学习的“
【linux】ssh 远程执行命令自动输入密码方式檀越@新空间 s5 Linux学习 linux ssh 服务器
欢迎来到我的博客，很高兴能够在这里和您见面！希望您在这里可以感受到一份轻松愉快的氛围，不仅可以获得有趣的内容和知识，也可以畅所欲言、分享您的想法和见解。非常期待和您一起在这个小小的网络世界里共同探索、学习和成长。✨✨欢迎订阅本专栏✨✨博客目录一.自动输入密码二.sshpass方式1.安装sshpass2.源码下载3.安装过程4.验证三.expect方式1.脚本2.执行前些天发现了一个巨牛的人工智能
满血DeepSeek加持的AlphaGPT，助力高文律师事务所全面拥抱AI
2025年初,中国团队精心雕琢的通用大模型DeepSeek凭借其创新的架构优化以及深入的数据挖掘技术,在逻辑推理、多轮对话和知识搜索等关键领域大放异彩,其为诸多垂直领域,特别是法律行业的智能化转型,开拓了全新的方向。2月8日,法律科技领域的领军者iCourt将旗下的AlphaGPT与DeepSeek深度融合,重磅推出业内首款“DeepSeek+法律专业”AI大模型。这一创举彻底打破了传统法律智能工
【AIGC时代】OneCode前端框架入门指南：从环境搭建到第一个应用低代码老李 OneCode实战低代码软件行业学习前端框架
在人工智能生成内容(AIGC)技术飞速发展的今天，前端开发领域正经历着前所未有的变革。AI工具能够批量生成代码，但如何将这些自动生成的代码转化为可维护、高质量的生产级应用，成为开发者面临的核心挑战。OneCode框架凭借其独特的设计理念，在这一背景下展现出显著优势，本文将带您从零开始，快速掌握OneCode框架的使用方法。一、AIGC背景下选择OneCode框架的四大理由AIGC工具的普及为前端开
人工智能驱动下的可再生能源气象预测：构建绿色能源时代的新大脑一ge科研小菜菜人工智能人工智能能源
个人主页：一ge科研小菜鸡-CSDN博客期待您的关注一、背景：新能源快速发展下的预测焦虑为应对气候变化和实现碳中和目标，全球能源系统正在加速从“化石主导”向“可再生主导”过渡。风能、太阳能等清洁能源已成为未来能源结构的关键支柱。根据国际能源署（IEA）预测，到2050年，全球超70%的电力将来自可再生能源。然而，可再生能源具有显著的**“天气依赖性”和“波动不确定性”**，风速、光照、温度、湿度等
筑牢 AIGC 安全防线：警惕提示词注入攻击 CS创新实验室 AIGC AIGC 安全大模型提示词提示词注入
在AIGC（生成式人工智能）技术蓬勃发展的当下，其在各个领域的应用日益广泛。然而，随着AIGC技术的深入应用，安全问题也逐渐凸显，提示词注入攻击便是其中不容忽视的一大威胁。对于AIGC开发者而言，深入了解提示词注入攻击并做好防范工作，是保障AIGC系统安全稳定运行的关键。提示词注入攻击的基本知识提示词注入攻击是指攻击者通过精心设计和构造提示词，利用AIGC模型对输入文本的处理机制，干扰模型的正常运
AI人工智能助力联邦学习通信效率优化的解决方案 AI智能应用人工智能 ai
AI驱动的联邦学习通信效率优化：从理论到实践的全面解决方案元数据框架标题AI驱动的联邦学习通信效率优化：从理论到实践的全面解决方案关键词联邦学习（FederatedLearning）、通信优化（CommunicationEfficiency）、AI赋能（AI-Enabled）、参数压缩（ParameterCompression）、客户端选择（ClientSelection）、联邦蒸馏（Federa
通义WebSailor：开启网络智能体新时代云资源服务商人工智能 ai
引言：WebSailor的横空出世在人工智能技术迅猛发展的当下，新的模型和智能体不断涌现，一次次刷新着人们对AI能力的认知。2024年7月7日，阿里云的一则消息犹如一颗重磅炸弹投入AI领域的湖面，激起千层浪——通义正式开源网络智能体WebSailor。这一开源举措，瞬间吸引了全球AI开发者、研究者以及科技爱好者的目光，在业界引发了强烈震动。一时间，技术论坛、社交媒体上关于WebSailor的讨论铺
AI人工智能领域，Stable Diffusion掀起的技术风暴 AI大模型应用工坊人工智能 stable diffusion ai
AI人工智能领域，StableDiffusion掀起的技术风暴关键词：AI人工智能、StableDiffusion、技术风暴、图像生成、扩散模型摘要：本文深入探讨了AI人工智能领域中StableDiffusion所掀起的技术风暴。首先介绍了StableDiffusion的背景，包括其目的、预期读者和文档结构等。详细阐述了核心概念与联系，通过文本示意图和Mermaid流程图进行清晰展示。对核心算法原
AI人工智能浪潮中，GPT的技术优势凸显 AI学长带你学AI 人工智能 gpt ai
AI人工智能浪潮中，GPT的技术优势凸显关键词：人工智能、GPT、自然语言处理、深度学习、Transformer、大语言模型、技术优势摘要：本文深入探讨了在人工智能浪潮中GPT(GenerativePre-trainedTransformer)系列模型的技术优势。我们将从GPT的核心架构出发，分析其独特的技术特点，包括自注意力机制、预训练-微调范式、零样本学习能力等。通过与传统NLP方法的对比，揭
AI伦理与安全之-哥斯拉与缰绳：如何让“哥斯拉”听懂人类的“悄悄话”？众链网络 AI伦理与安全 AI 人工智能 AI工具 AI智能体
相关文章:AI伦理与安全AI伦理与安全之-镜子与偏见：我们教给它的，究竟是智慧还是偏见？AI伦理与安全之-哥斯拉与缰绳：如何让“哥斯拉”听懂人类的“悄悄话”？AI伦理与安全之-梦境与幻觉：它为何会一本正经地胡说八道？在上一篇中，我们谈到AI像一面“镜子”，会映照出我们数据中的偏见。但那只是AI伦理问题中的“序章”。一个更深邃、更终极的挑战，正横亘在人类与超人工智能（ASI）的未来之间。这个挑战，就
JVM StackMapTable 属性的作用及理解 lijingyao8206 jvm 字节码 Class文件 StackMapTable
在Java 6版本之后JVM引入了栈图(Stack Map Table)概念。为了提高验证过程的效率，在字节码规范中添加了Stack Map Table属性，以下简称栈图，其方法的code属性中存储了局部变量和操作数的类型验证以及字节码的偏移量。也就是一个method需要且仅对应一个Stack Map Table。在Java 7版
回调函数调用方法百合不是茶 java
最近在看大神写的代码时,.发现其中使用了很多的回调 ,以前只是在学习的时候经常用到 ,现在写个笔记记录一下代码很简单: MainDemo :调用方法得到方法的返回结果
[时间机器]制造时间机器需要一些材料 comsci 制造
根据我的计算和推测,要完全实现制造一台时间机器,需要某些我们这个世界不存在的物质和材料... 甚至可以这样说,这种材料和物质,我们在反应堆中也无法获得......
开口埋怨不如闭口做事邓集海邓集海做人做事工作
“开口埋怨，不如闭口做事。”不是名人名言，而是一个普通父亲对儿子的训导。但是，因为这句训导，这位普通父亲却造就了一个名人儿子。这位普通父亲造就的名人儿子，叫张明正。　　　　张明正出身贫寒，读书时成绩差，常挨老师批评。高中毕业，张明正连普通大学的分数线都没上。高考成绩出来后，平时开口怨这怨那的张明正，不从自身找原因，而是不停地埋怨自己家庭条件不好、埋怨父母没有给他创造良好的学习环境。　　　　
jQuery插件开发全解析，类级别与对象级别开发 IT独行者 jquery 开发插件　函数
jQuery插件的开发包括两种：一种是类级别的插件开发，即给 jQuery添加新的全局函数，相当于给 jQuery类本身添加方法。 jQuery的全局函数就是属于 jQuery命名空间的函数，另一种是对象级别的插件开发，即给 jQuery对象添加方法。下面就两种函数的开发做详细的说明。 1 、类级别的插件开发类级别的插件开发最直接的理解就是给jQuer
Rome解析Rss 413277409 Rome解析Rss
import java.net.URL; import java.util.List; import org.junit.Test; import com.sun.syndication.feed.synd.SyndCategory; import com.sun.syndication.feed.synd.S
RSA加密解密无量加密解密 rsa
RSA加密解密代码代码有待整理 package com.tongbanjie.commons.util; import java.security.Key; import java.security.KeyFactory; import java.security.KeyPair; import java.security.KeyPairGenerat
linux 软件安装遇到的问题 aichenglong linux 遇到的问题 ftp
1 ftp配置中遇到的问题 500 OOPS: cannot change directory 出现该问题的原因:是SELinux安装机制的问题.只要disable SELinux就可以了修改方法:1 修改/etc/selinux/config 中SELINUX=disabled 2 source /etc
面试心得 alafqq 面试
最近面试了好几家公司。记录下；支付宝，面试我的人胖胖的，看着人挺好的；博彦外包的职位，面试失败；阿里金融，面试官人也挺和善，只不过我让他吐血了。。。由于印象比较深，记录下； 1，自我介绍 2，说下八种基本类型；（算上string。楼主才答了3种，哈哈，string其实不是基本类型，是引用类型） 3，什么是包装类，包装类的优点； 4，平时看过什么书？NND，什么书都没看过。。照样
java的多态性探讨百合不是茶 java
java的多态性是指main方法在调用属性的时候类可以对这一属性做出反应的情况 //package 1; class A{ public void test(){ System.out.println("A"); } } class D extends A{ public void test(){ S
网络编程基础篇之JavaScript-学习笔记 bijian1013 JavaScript
1.documentWrite <html> <head> <script language="JavaScript"> document.write("这是电脑网络学校"); document.close(); </script> </h
探索JUnit4扩展：深入Rule bijian1013 JUnit Rule 单元测试
本文将进一步探究Rule的应用，展示如何使用Rule来替代@BeforeClass，@AfterClass，@Before和@After的功能。在上一篇中提到，可以使用Rule替代现有的大部分Runner扩展，而且也不提倡对Runner中的withBefores()，withAfte
[CSS]CSS浮动十五条规则 bit1129 css
这些浮动规则，主要是参考CSS权威指南关于浮动规则的总结，然后添加一些简单的例子以验证和理解这些规则。 1. 所有的页面元素都可以浮动 2. 一个元素浮动后，会成为块级元素，比如<span>,a, strong等都会变成块级元素 3.一个元素左浮动，会向最近的块级父元素的左上角移动，直到浮动元素的左外边界碰到块级父元素的左内边界；如果这个块级父元素已经有浮动元素停靠了
【Kafka六】Kafka Producer和Consumer多Broker、多Partition场景 bit1129 partition
0.Kafka服务器配置 3个broker 1个topic，6个partition，副本因子是2 2个consumer，每个consumer三个线程并发读取 1. Producer package kafka.examples.multibrokers.producers; import java.util.Properties; import java.util.
zabbix_agentd.conf配置文件详解 ronin47 zabbix 配置文件
Aliaskey的别名，例如 Alias=ttlsa.userid:vfs.file.regexp[/etc/passwd,^ttlsa:.:([0-9]+),,,,\1]，或者ttlsa的用户ID。你可以使用key：vfs.file.regexp[/etc/passwd,^ttlsa:.: ([0-9]+),,,,\1]，也可以使用ttlsa.userid。备注: 别名不能重复，但是可以有多个
java--19.用矩阵求Fibonacci数列的第N项 bylijinnan fibonacci
参考了网上的思路，写了个Java版的： public class Fibonacci { final static int[] A={1,1,1,0}; public static void main(String[] args) { int n=7; for(int i=0;i<=n;i++){ int f=fibonac
Netty源码学习-LengthFieldBasedFrameDecoder bylijinnan java netty
先看看LengthFieldBasedFrameDecoder的官方API http://docs.jboss.org/netty/3.1/api/org/jboss/netty/handler/codec/frame/LengthFieldBasedFrameDecoder.html API举例说明了LengthFieldBasedFrameDecoder的解析机制，如下：实
AES加密解密 chicony 加密解密
AES加解密算法，使用Base64做转码以及辅助加密： package com.wintv.common; import javax.crypto.Cipher; import javax.crypto.spec.IvParameterSpec; import javax.crypto.spec.SecretKeySpec; import sun.misc.BASE64Decod
文件编码格式转换 ctrain 编码格式
package com.test; import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream;
mysql 在linux客户端插入数据中文乱码 daizj mysql 中文乱码
1、查看系统客户端，数据库，连接层的编码查看方法： http://daizj.iteye.com/blog/2174993 进入mysql，通过如下命令查看数据库编码方式： mysql> show variables like 'character_set_%'; +--------------------------+------
好代码是廉价的代码 dcj3sjt126com 程序员读书
长久以来我一直主张：好代码是廉价的代码。当我跟做开发的同事说出这话时，他们的第一反应是一种惊愕，然后是将近一个星期的嘲笑，把它当作一个笑话来讲。当他们走近看我的表情、知道我是认真的时，才收敛一点。当最初的惊愕消退后，他们会用一些这样的话来反驳： “好代码不廉价，好代码是采用经过数十年计算机科学研究和积累得出的最佳实践设计模式和方法论建立起来的精心制作的程序代码。” 我只
Android网络请求库——android-async-http dcj3sjt126com android
在iOS开发中有大名鼎鼎的ASIHttpRequest库，用来处理网络请求操作，今天要介绍的是一个在Android上同样强大的网络请求库android-async-http，目前非常火的应用Instagram和Pinterest的Android版就是用的这个网络请求库。这个网络请求库是基于Apache HttpClient库之上的一个异步网络请求处理库，网络处理均基于Android的非UI线程，通
ORACLE 复习笔记之SQL语句的优化 eksliang SQL优化 Oracle sql语句优化 SQL语句的优化
转载请出自出处：http://eksliang.iteye.com/blog/2097999 SQL语句的优化总结如下 sql语句的优化可以按照如下六个步骤进行：合理使用索引避免或者简化排序消除对大表的扫描避免复杂的通配符匹配调整子查询的性能 EXISTS和IN运算符下面我就按照上面这六个步骤分别进行总结：
浅析：Android 嵌套滑动机制（NestedScrolling） gg163 android 移动开发滑动机制嵌套
谷歌在发布安卓 Lollipop版本之后，为了更好的用户体验，Google为Android的滑动机制提供了NestedScrolling特性 NestedScrolling的特性可以体现在哪里呢？ 比如你使用了Toolbar，下面一个ScrollView，向上滚
使用hovertree菜单作为后台导航 hvt JavaScript jquery .net hovertree asp.net
hovertree是一个jquery菜单插件，官方网址：http://keleyi.com/jq/hovertree/ ，可以登录该网址体验效果。 0.1.3版本：http://keleyi.com/jq/hovertree/demo/demo.0.1.3.htm hovertree插件包含文件： http://keleyi.com/jq/hovertree/css
SVG 教程（二）矩形天梯梦 svg
SVG <rect> SVG Shapes SVG有一些预定义的形状元素，可被开发者使用和操作：矩形 <rect> 圆形 <circle> 椭圆 <ellipse> 线 <line> 折线 <polyline> 多边形 <polygon> 路径 <path>
一个简单的队列 luyulong java 数据结构队列
public class MyQueue { private long[] arr; private int front; private int end; // 有效数据的大小 private int elements; public MyQueue() { arr = new long[10]; elements = 0; front
基础数据结构和算法九：Binary Search Tree sunwinner Algorithm
A binary search tree (BST) is a binary tree where each node has a Comparable key (and an associated value) and satisfies the restriction that the key in any node is larger than the keys in all
项目出现的一些问题和体会 Steven-Walker DAO Web servlet
第一篇博客不知道要写点什么，就先来点近阶段的感悟吧。这几天学了servlet和数据库等知识，就参照老方的视频写了一个简单的增删改查的，完成了最简单的一些功能，使用了三层架构。 dao层完成的是对数据库具体的功能实现，service层调用了dao层的实现方法，具体对servlet提供支持。 &
高手问答：Java老A带你全面提升Java单兵作战能力！ ITeye管理员 java
本期特邀《Java特种兵》作者：谢宇，CSDN论坛ID: xieyuooo 针对JAVA问题给予大家解答，欢迎网友积极提问，与专家一起讨论! 作者简介：淘宝网资深Java工程师，CSDN超人气博主，人称“胖哥”。 CSDN博客地址： http://blog.csdn.net/xieyuooo 作者在进入大学前是一个不折不扣的计算机白痴，曾经被人笑话过不懂鼠标是什么，

《斯坦福数据挖掘教程·第三版》读书笔记（英文版）Chapter 3 Finding Similar Items

Shingling of Documents

Summary of Chapter 3

你可能感兴趣的:(数据挖掘,数据挖掘,人工智能)