boywaiter

Chapter 6 Vector Semantics

Speech and Language Processing ed3 读书笔记

Words that occur in similar contexts tend to have similar meanings. This link between similarity in how words are distributed and similarity in what they mean is called the distributional hypothesis.

Vector semantics instantiates this linguistic hypothesis by learning representations of the meaning of words directly from their distributions in texts. These representations are used in every natural language processing application that makes use of meaning. These word representations are also the first example we will see in the book of representation learning, automatically learning useful representations of the input text. Finding such unsupervised ways to learn representations of the input, instead of creating representations by hand via feature engineering, is an important focus of recent NLP research (Bengio et al., 2013).

6.1 Lexical Semantics

A model of word meaning should allow us to draw useful inferences that will help us solve meaning-related tasks like question-answering, summarization, paraphrase or plagiarism detection, and dialogue.

lexical semantics: linguistic study of word meaning

Lemma and Sense

lemma: also called the citation form. The basic form of a word, for example the singular form of a noun or the infinitive form of a verb, as it is shown at the beginning of a dictionary entry. The specific forms of lemma are called wordforms.
We call each aspect of the meaning of a lemma a word sense.
homonymous: have multiple sense
word sense disambiguation: the task of determining which sense of a word is being used in a particular context.

Relations between words or senses

synonyms: when one word has a sense whose meaning is identical to a sense of another word, or nearly identical, we say the two senses of those two words are synonyms. A more formal definition of synonymy (between words rather than senses) is that two words are synonymous if they are substitutable one for the other in any sentence without changing the truth conditions of the sentence, the situations in which the sentence would be true. We often say in this case that the two words have the same propositional meaning.
principle of contrast is the assumption that a difference in linguistic form is always associated with at least some difference in meaning.
antonyms are words with an opposite meaning
another group of antonyms, reversives, describe change or movement in opposite directions, such as rise/fall or up/down.
Antonyms thus differ completely with respect to one aspect of their meaning—their position on a scale or their direction—but are otherwise very similar, sharing almost all other aspects of meaning. Thus, automatically distinguishing synonyms from antonyms can be difficult.

Word Similarity: While words don’t have many synonyms, most words do have lots of similar words. Cat is not a synonym of dog, but cats and dogs are certainly similar words. In moving from synonymy to similarity, it will be useful to shift from talking about relations between word senses (like synonymy) to relations between words (like similarity). Dealing with words avoids having to commit to a particular
representation of word senses, which will turn out to simplify our task.

One way of getting values for word similarity is to ask humans to judge how similar one word is to another. A number of datasets have resulted from such experiments. For example the SimLex-999 dataset (Hill et al., 2015) gives values on a scale from 0 to 10, like the examples below, which range from near-synonyms (vanish, disappear) to pairs that scarcely seem to have anything in common (hole, agreement).

Word Relatedness: The meaning of two words can be related in ways others than relatedness similarity. One such class of connections is called word relatedness (Budanitsky association and Hirst, 2006), also traditionally called word association in psychology. For example, coffee and cup.

One common kind of relatedness between words is if they belong to the same semantic field. A semantic field is a set of words which cover a particular semantic domain and bear structured relations with each other.

Semantic fields are also related to topic models, like Latent Dirichlet Allocation, LDA, which apply unsupervised learning on large sets of texts to induce sets of associated words from text. Semantic fields and topic models are a very useful tool for discovering topical structure in documents.

Semantic Frames and Roles: Closely related to semantic fields is the idea of a semantic frame. A semantic frame is a set of words that denote perspectives or participants in a particular type of event ( comparing to semantic field).

Taxonomic Relations: Another way word senses can be related is taxonomically. A word (or sense) is a hyponym of another word or sense if the first is more specific, denoting a subclass of the other. Conversely, we say that vehicle is a hypernym of car, and animal is a hypernym of dog. It is unfortunate that the two words (hypernym and hyponym) are very similar and hence easily confused; for this reason, the word superordinate is often used instead of hypernym.

Hypernymy can also be defined in terms of entailment. Being an $A$ entails being a $B$ , or $\forall x A(x) \Rightarrow B(x)$ . Another name for the hypernym/hyponym structure IS-A is the IS-A hierarchy, in which we say A IS-A B, or B subsumes A.

Connotation: Finally, words have affective meanings or connotations. The word connotation has different meanings in different fields, but here we use it to mean the aspects of a word’s meaning that are related to a writer or reader’s emotions, sentiment, opinions, or evaluations. For example some words have positive connotations (happy) while others have negative connotations (sad). Some words describe
positive evaluation (great, love) and others negative evaluation (terrible, hate). Positive or negative evaluation expressed through language is called sentiment, as we saw in Chapter 4, and word sentiment plays a role in important tasks like sentiment analysis, stance detection, and many aspects of natural language processing to the language of politics and consumer reviews.

Early work on affective meaning (Osgood et al., 1957) found that words varied along three important dimensions of affective meaning. These are now generally called valence, arousal, and dominance, defined as follows:

valence: the pleasantness of the stimulus
arousal: the intensity of emotion provoked by the stimulus
dominance: the degree of control exerted by the stimulus

Thus words like happy or satisfied are high on valence, while unhappy or annoyed are low on valence. Excited or frenzied are high on arousal, while relaxed or calm are low on arousal. Important or controlling are high on dominance, while awed or influenced are low on dominance. Each word is thus represented by three numbers, corresponding to its value on each of the three dimensions.

Osgood et al. (1957) noticed that in using these 3 numbers to represent the meaning of a word, the model was representing each word as a point in a three dimensional space, a vector whose three dimensions corresponded to the word’s rating on the three scales. This revolutionary idea that word meaning could be represented as a point in space (e.g., that part of the meaning of heartbreak can be represented as the point $[2.45, 5.65, 3.58]$ was the first expression of the vector semantics models that we introduce next.

6.2 Vector Semantics

The idea of vector semantics is to represent a word as a point in some multidimensional semantic space. Vectors for representing words are generally called embeddings, because the word is embedded in a particular vector space.

If words were represented as embeddings, we could assign sentiment as long as words with similar meanings as the test set words occurred in the training set. Vector semantic models are also extremely practical because they can be learned automatically from text without any complex labeling or supervision.

As a result of these advantages, vector models of meaning are now the standard way to represent the meaning of words in NLP. In this chapter we’ll introduce the two most commonly used models. First is the tf-idf model, often used a a baseline, in which the meaning of a word is defined by a simple function of the counts of nearby words. We will see that this method results in very long vectors that are sparse, i.e. contain mostly zeros (since most words simply never occur in the context of others).

Then we’ll introduce the word2vec model, one of a family of models that are ways of constructing short, dense vectors that have useful semantic properties.

We’ll also introduce the cosine, the standard way to use embeddings (vectors) to compute functions like semantic similarity, the similarity between two words, two sentences, or two documents, an important tool in practical applications like question answering, summarization, or automatic essay grading.

6.3 Words and Vectors

Vector or distributional models of meaning are generally based on a co-occurrence matrix, a way of representing how often words co-occur.

6.3.1 Vectors and documents

In a term-document matrix, each row represents a word in the vocabulary and each column represents a document from some collection of documents.

The term-document matrix was first defined as part of the vector space model of information retrieval (Salton, 1971). In this model, a document is represented as a count vector.

In term-document matrices, the vectors representing each document would have dimensionality $∣ V ∣$ , the vocabulary size.

Term-document matrices were originally defined as a means of finding similar documents for the task of document information retrieval. Two documents that are similar will tend to have similar words, and if two documents have similar words their column vectors will tend to be similar.

Information retrieval (IR) is the task of finding the document $d$ from the $D$ documents in some collection that best matches a query $q$ . For IR we’ll therefore also represent a query by a vector, also of length $V$ , and we’ll need a way to compare two vectors to find how similar they are. (Doing IR will also require efficient ways to store and manipulate these vectors, which is accomplished by making use of the convenient fact that these vectors are sparse, i.e., mostly zeros).

6.3.2 Words as vectors

Rather than the term-document matrix we use the term-term matrix, more commonly called the word-word matrix or the term-context matrix, in which the columns are labeled by words rather than documents. This matrix is thus of dimensionality $|V|\times|V|$ and each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus. It is most common to use smaller contexts, generally a window around the word, for example of 4 words to the left and 4 words to the right, in which case the cell represents the number of times (in some training corpus) the column word occurs in such a $\pm 4$ word window around the row word.

6.4 Cosine for measuring similarity

By far the most common similarity metric is the cosine of the angle between the vectors. The cosine—like most measures for vector similarity used in NLP—is based on the dot product operator from linear algebra, also called the inner product:
$\textrm{dot-product}(\vec v,\vec w) =\vec v\cdot \vec w= \sum_{i=1}^N v_iw_i = v_1w_1 + v_2w_2 + \ldots + v_Nw_N$
The dot product acts as a similarity metric because it will tend to be high just when the two vectors have large values in the same dimensions. Alternatively, vectors that have zeros in different dimensions—orthogonal vectors—will have a dot product of 0, representing their strong dissimilarity.

This raw dot-product, however, has a problem as a similarity metric: it favors long vectors. The vector length is defined as
$|\vec v| = \sqrt {\sum_{i=1}^N v_i^2}$
The dot product is higher if a vector is longer, with higher values in each dimension. More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them. The raw dot product thus will be higher for frequent words. But this is a problem; we’d like a similarity metric that tells us how similar two words are regardless of their frequency.

The simplest way to modify the dot product to normalize for the vector length is to divide the dot product by the lengths of each of the two vectors. The cosine similarity metric between two vectors $\vec v$ and $\vec w$ thus can be computed as:
$\textrm{cosine}(\vec v,\vec w)=\frac{\vec v\cdot\vec w}{|\vec v||\vec w|}=\frac{\sum_{i=1}^N v_i w_i}{\sqrt{\sum_{i=1}^N v_i^2}\sqrt{\sum_{i=1}^N w_i^2}}$

6.5 TF-IDF: Weighing terms in the vector

It’s a bit of a paradox. Word that occur nearby frequently (maybe sugar appears often in our corpus near apricot) are more important than words that only appear once or twice. Yet words that are too frequent—ubiquitous, like the or good— are unimportant. How can we balance these two conflicting constraints?

The tf-idf algorithm algorithm is the product of two terms, each term capturing one of these two intuitions:

The first is the term frequency (Luhn, 1957): the frequency of the word in the document. Normally we want to downweight the raw frequency a bit, since a word appearing 100 times in a document doesn’t make that word 100 times more likely to be relevant to the meaning of the document. So we generally use the $log_{10}$ of the frequency, resulting in the following definition for the term frequency weight:
$KaTeX parse error: Unknown column alignment: [ at position 43: …\{\begin{array}[̲ll] \\ 1 + \log…$
The second factor is used to give a higher weight to words that occur only in a few documents. Terms that are limited to a few documents are useful for discriminating those documents from the rest of the collection; terms that occur frequently across the entire collection aren’t as helpful. The document frequency $\textrm{df}_t$ of a term $t$ is simply the number of documents it occurs in. By contrast, the collection frequency of a term is the total number of times the word appears in the whole collection in any document.

We assign importance to these more discriminative words via the inverse document frequency or idf term weight (Sparck Jones, 1972). The idf is defined using the fraction $N/\textrm{df}_t$ , where $N$ is the total number of documents in the collection, and $\textrm{df}_t$ is the number of documents in which term $t$ occurs. The fewer documents in which a term occurs, the higher this weight. The lowest weight of 1 is assigned to terms that occur in all the documents. It’s usually clear what counts as a document: in Shakespeare
we would use a play; when processing a collection of encyclopedia articles like Wikipedia, the document is a Wikipedia page; in processing newspaper articles, the document is a single article. Occasionally your corpus might not have appropriate document divisions and you might need to break up the
corpus into documents yourself for the purposes of computing idf.

Because of the large number of documents in many collections, this measure is usually squashed with a log function. The resulting definition for inverse document frequency (idf) is thus
$\textrm{idf}=\log_{10}\left(\frac{N}{\textrm{df}_t}\right)$

The tf-idf weighting of the value for word $t$ in document $d$ , $w_{t,d}$ thus combines term frequency with idf:
$w_{t,d} = \textrm{tf}_{t,d}\times \textrm{idf}_t$

6.6 Applications of the tf-idf vector model

In summary, the vector semantics model we’ve described so far represents a target word as a vector with dimensions corresponding to all the words in the vocabulary (length $∣ V ∣$ , with vocabularies of 20,000 to 50,000), which is also sparse (most values are zero). The values in each dimension are the frequency with which the target word co-occurs with each neighboring context word, weighted by tf-idf.

The tf-idf vector model can also be used to decide if two documents are similar. We represent a document by taking the vectors of all the words in the document, and centroid computing the centroid of all those vectors. The centroid is the multidimensional version of the mean; the centroid of a set of vectors is a single vector that has the minimum sum of squared distances to each of the vectors in the set. Given $k$ word document vector vectors $w_1,w_2,\ldots, w_k$ , the centroid document vector d is:
$\frac{w_1 + w_2 + \ldots+ w_k}{k}$
Given two documents, we can then compute their document vectors $d_1$ and $d_2$ , and estimate the similarity between the two documents by $cos(d_1,d_2)$ .

6.7 Optional: Pointwise Mutual Information (PMI)

An alternative weighting function to tf-idf is called PPMI (positive pointwise mutual information). PPMI draws on the intuition that best way to weigh the association between two words is to ask how much more the two words co-occur in our corpus than we would have a priori expected them to appear by chance.

Pointwise mutual information (Fano, 1961) is one of the most important concepts in NLP. It is a measure of how often two events $x$ and $y$ occur, compared with what we would expect if they were independent:
$=\log_2 \frac{P(x,y)}{P(x)P(y)}$
The pointwise mutual information between a target word $w$ and a context word $c$ (Church and Hanks 1989, Church and Hanks 1990) is then defined as:
$\textrm{PMI}(w,c) = \log_2 \frac{P(w,c)}{P(w)P(c)}$
The numerator tells us how often we observed the two words together (assuming we compute probability by using the MLE). The denominator tells us how often we would expect the two words to co-occur assuming they each occurred independently; recall that the probability of two independent events both occurring is just the product of the probabilities of the two events. Thus, the ratio gives us an estimate of how much more the two words co-occur than we expect by chance. PMI is a useful tool whenever we need to find words that are strongly associated.

PMI values range from negative to positive infinity. But negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous. To distinguish whether two words whose individual probability is each $10^{-6}$ occur together more often than chance, we would need to be certain that the probability of the two occurring together is significantly different than $10^{-12}$ , and this kind of granularity would require an enormous corpus. Furthermore it’s not clear whether it’s even possible to evaluate such scores of ‘unrelatedness’ with human judgments. For this reason it is more common to use Positive PMI (called PPMI) which replaces all negative PMI values with zero (Church and Hanks 1989, Dagan et al. 1993, Niwa and Nitta 1994):
$\textrm{PPMI}(w,c) = \max(\log_2 \frac{P(w,c)}{P(w)P(c)},0)$
More formally, let’s assume we have a co-occurrence matrix $F$ with $W$ rows (words) and $C$ columns (contexts), where $f_{ij}$ gives the number of times word $w_i$ occurs in context $c_j$ . This can be turned into a PPMI matrix where $ppmi_{ij}$ gives the PPMI value of word $w_i$ with context $c_j$ as follows:
$p_{ij}=\frac{f_{ij}}{\sum_{i=1}^W\sum_{j=1}^Cf_{ij}}, p_{i*}=\frac{\sum_{j=1}^Cf_{ij}}{\sum_{i=1}^W\sum_{j=1}^C f_{ij}},p_{*j}=\frac{\sum_{i=1}^Wf_{ij}}{\sum_{i=1}^W\sum_{j=1}^C f_{ij}}$

$\textrm{PPMI}_{ij}=\max(\log_2\frac{p_{ij}}{p_{i*}p_{*j}},0)$

PMI has the problem of being biased toward infrequent events; very rare words tend to have very high PMI values. One way to reduce this bias toward low frequency events is to slightly change the computation for $P (c)$ , using a different function $P_\alpha (c)$ that raises contexts to the power of $\alpha$ :
$\textrm{PPMI}_\alpha(w,c) = \max(\log_2 \frac{P(w,c)}{P(w)P_\alpha(c)},0)$

$P_\alpha(c)=\frac{count(c)^\alpha}{\sum_c count(c)^\alpha}$

Levy et al. (2015) found that a setting of $\alpha = 0.75$ improved performance of embeddings on a wide range of tasks. This works because raising the probability to $\alpha = 0.75$ increases the probability assigned to rare contexts, and hence lowers their PMI ( $P_\alpha(c) > P(c)$ when $c$ is rare).

Another possible solution is Laplace smoothing: Before computing PMI, a small constant $k$ (values of 0.1-3 are common) is added to each of the counts, shrinking (discounting) all the non-zero values. The larger the $k$ , the more the non-zero counts are discounted.

6.8 Word2vec

In this section we introduce one method for very dense, short vectors, skip-gram with negative sampling, sometimes called SGNS. The skip-gram algorithm SGNS is one of two algorithms in a software package called word2vec, and so sometimes the algorithm is loosely referred to as word2vec (Mikolov et al. 2013, Mikolov et al. 2013a). The word2vec methods are fast, efficient to train, and easily available online with code and pretrained embeddings. We point to other embedding methods, like the equally popular GloVe (Pennington et al., 2014), at the end of the chapter.

The intuition of word2vec is that instead of counting how often each word $w$ occurs near, say, apricot, we’ll instead train a classifier on a binary prediction task: “Is word $w$ likely to show up near apricot?” We don’t actually care about this prediction task; instead we’ll take the learned classifier weights as the word embeddings.

The revolutionary intuition here is that we can just use running text as implicitly supervised training data for such a classifier; a word $w$ that occurs near the target word apricot acts as gold ‘correct answer’ to the question “Is word $w$ likely to show up near apricot?” This avoids the need for any sort of hand-labeled supervision signal. This idea was first proposed in the task of neural language modeling, when Bengio et al. (2003) and Collobert et al. (2011) showed that a neural language model (a neural network that learned to predict the next word from prior words) could just use the next word in running text as its supervision signal, and could be used to learn an embedding representation for each word as part of doing this prediction task.

We’ll see how to do neural networks in the next chapter, but word2vec is a much simpler model than the neural network language model, in two ways. First, word2vec simplifies the task (making it binary classification instead of word prediction). Second, word2vec simplifies the architecture (training a logistic regression classifier instead of a multi-layer neural network with hidden layers that demand more sophisticated training algorithms). The intuition of skip-gram is:

Treat the target word and a neighboring context word as positive examples.
Randomly sample other words in the lexicon to get negative samples
Use logistic regression to train a classifier to distinguish those two cases
Use the regression weights as the embeddings

6.8.1 The classifier

Let’s start by thinking about the classification task, and then turn to how to train. Imagine a sentence like the following, with a target word apricot and assume we’re using a window of ±2 context words:

... lemon, a [tablespoon of apricot jam, a] pinch ...

$\verb| c1 c2 t c3 c4|$

Our goal is to train a classifier such that, given a tuple $(t, c)$ of a target word $t$ paired with a candidate context word $c$ (for example $(a p r i c o t, j a m)$ , or perhaps $(a p r i c o t, a a r d v a r k)$ ) it will return the probability that $c$ is a real context word (true for jam, false for aardvark):
$P (+ ∣ t, c)$

$P (- ∣ t, c) = 1 - P (+ ∣ t, c)$

How does the classifier compute the probability $P$ ? The intuition of the skipgram model is to base this probability on similarity: a word is likely to occur near the target if its embedding is similar to the target embedding. How can we compute similarity between embeddings? Recall that two vectors are similar if they have a high dot product (cosine, the most popular similarity metric, is just a normalized dot product). In other words:
$\approx t \cdot c$
Of course, the dot product $t\cdot c$ is not a probability, it’s just a number ranging from 0 to $\infin$ . (Recall, for that matter, that cosine isn’t a probability either). To turn the dot product into a probability, we’ll use the logistic or sigmoid function $\sigma(x)$ , the fundamental core of logistic regression.
$P(+|t,c)=\frac{1}{1+e^{-t\cdot c}}$

$P(-|t,c)=1-P(+|t,c)=\frac{e^{-t\cdot c}}{1+e^{-t\cdot c}}$

We need to take account of multiple context words in the window. Skip-gram makes the strong but very
useful simplifying assumption that all context words are independent, allowing us to just multiply their probabilities:
$P(+|t,c_{1:k})=\prod_{i=1}^k\frac{1}{1+e^{-t\cdot c_i}}$

$\log P(+|t,c_{1:k})=\sum_{i=1}^k\log\frac{1}{1+e^{-t\cdot c_i}}$

In summary, skip-gram trains a probabilistic classifier that, given a test target word $t$ and its context window of $k$ words $c_{1:k}$ , assigns a probability based on how similar this context window is to the target word. The probability is based on applying the logistic (sigmoid) function to the dot product of the embeddings of the target word with each context word. We could thus compute this probability if only we had embeddings for each word target and context word in the vocabulary. Let’s now turn to learning these embeddings (which is the real goal of training this classifier in the first place).

6.8.2 Learning skip-gram embeddings

Word2vec learns embeddings by starting with an initial set of embedding vectors and then iteratively shifting the embedding of each word $w$ to be more like the embeddings of words that occur nearby in texts, and less like the embeddings of words that don’t occur nearby.

Let’s start by considering a single piece of the training data, from the sentence above:

... lemon, a [tablespoon of apricot jam, a] pinch ...

$\verb| c1 c2 t c3 c4|$

This example has a target word t (apricot), and 4 context words in the $\pm 2$ window, resulting in 4 positive training instances:

positive examples +
t	c
apricot	tablespoon
apricot	of
apricot	jam
apricot	a

For training a binary classifier we also need negative examples, and in fact skipgram uses more negative examples than positive examples, the ratio set by a parameter $k$ . So for each of these $(t, c)$ training instances we’ll create k negative samples, each consisting of the target $t$ plus a ‘noise word’. A noise word is a random word from the lexicon, constrained not to be the target word $t$ . The following table shows the setting where $k = 2$ , so we’ll have 2 negative examples in the negative training set for each positive example $t, c$ .

negative examples -
t	c	t	c
apricot	aardvark	apricot	twelve
apricot	puddle	apricot	hello
apricot	where	apricot	dear
apricot	coaxial	apricot	forever

The noise words are chosen according to their weighted unigram frequency $p_\alpha(w)$ , where $\alpha$ is a weight. If we were sampling according to unweighted frequency $p (w)$ , it would mean that with unigram probability $p (“ t h e ”)$ we would choose the word the as a noise word, with unigram probability $p (“ a a r d v a r k ”)$ we would choose aardvark, and so on. But in practice it is common to set $\alpha = .75$ , i.e. use the weighting $p_{\frac{3}{4}}(w)$ :
$P_\alpha(w)=\frac{count(w)^\alpha}{\sum_{w'} count(w')^\alpha}$
Setting $\alpha = .75$ gives better performance because it gives rare noise words slightly higher probability: for rare words, $P_\alpha(w) > P(w)$ .

Given the set of positive and negative training instances, and an initial set of embeddings, the goal of the learning algorithm is to adjust those embeddings such that we

Maximize the similarity of the target word, context word pairs $(t, c)$ drawn from the positive examples
Minimize the similarity of the $(t, c)$ pairs drawn from the negative examples.

We can express this formally over the whole training set as:
$L(\theta)=\sum_{(t,c)\in +}\log P(+|t,c)+\sum_{(t,c)\in -}\log P(-|t,c)$
Or, focusing in on one word/context pair $(t, c)$ with its $k$ noise words $n_1,\ldots, n_k$ , the learning objective $L$ is:
$KaTeX parse error: Unknown column alignment: [ at position 15: \begin{array}[̲ll]\\ L(\theta)…$
We can then use stochastic gradient descent to train to this objective, iteratively modifying the parameters (the embeddings for each target word $t$ and each context word or noise word $c$ in the vocabulary) to maximize the objective.

Note that the skip-gram model thus actually learns two separate embeddings for each word $w$ : the target embedding $t$ and the context embedding $c$ . These embeddings are stored in two matrices, the target matrix $T$ and the context matrix $C$ . So each column $i$ of the target matrix $T$ is the $1\times d$ vector embedding $t_i$ for word $i$ in the vocabulary $V$ , and each column $i$ of the context matrix $C$ is a $\times 1$ vector embedding $c_i$ for word $i$ in $V$ . $d$ is the dimension of an embedding and is far less than $∣ V ∣$ .

Just as in logistic regression, then, the learning algorithm starts with randomly initialized $T$ and $C$ matrices, and then walks through the training corpus using gradient descent to move $T$ and $C$ so as to maximize the objective $L(\theta)$ . Thus the matrices $T$ and $C$ function as the parameters $\theta$ that logistic regression is tuning.

Once the embeddings are learned, we’ll have two embeddings for each word $w_i$ : $t_i$ and $c_i$ . We can choose to throw away the $C$ matrix and just keep $T$ , in which case each word $w_i$ will be represented by the vector $t_i$ .

Alternatively we can add the two embeddings together, using the summed embedding $t_i + c_i$ as the new d-dimensional embedding, or we can concatenate them into an embedding of dimensionality $2 d$ .

As with the simple count-based methods like tf-idf, the context window size $L$ effects the performance of skip-gram embeddings, and experiments often tune the parameter $L$ on a dev set. One difference from the count-based methods is that for skip-grams, the larger the window size the more computation the algorithm requires for training (more neighboring words must be predicted).

6.9 Visualizing Embeddings

The simplest way to visualize the meaning of a word $w$ embedded in a space is to list the most similar words to $w$ sorting all words in the vocabulary by their cosines.

Yet another visualization method is to use a clustering algorithm to show a hierarchical representation of which words are similar to others in the embedding space.

Probably the most common visualization method, however, is to project the 100 dimensions of a word
down into 2 dimensions, using a projection method called tSNE (van der Maaten and Hinton, 2008).

6.10 Semantic properties of embeddings

Vector semantic models have a number of parameters. One parameter that is relevant to both sparse tf-idf vectors and dense word2vec vectors is the size of the context window used to collect counts. This is generally between 1 and 10 words on each side of the target word (for a total context of 2-20 words).

The choice depends on on the goals of the representation. Shorter context windows tend to lead to representations that are a bit more syntactic, since the information is coming from immediately nearby words. When the vectors are computed from short context windows, the most similar words to a target word $w$ tend to be semantically similar words with the same parts of speech. When vectors are computed from long context windows, the highest cosine words to a target word $w$ tend to be words that are topically related but not similar.

It’s also often useful to distinguish two kinds of similarity or association between words (Schutze and Pedersen, 1993). Two words have first-order co-occurrence (sometimes called syntagmatic association) if they are typically nearby each other. Thus wrote is a first-order associate of book or poem. Two words have second-order co-occurrence (sometimes called paradigmatic association) if they have similar neighbors. Thus wrote is a second-order associate of words like said or remarked.

Analogy Another semantic property of embeddings is their ability to capture relational meanings. Mikolov et al. (2013b) and Levy and Goldberg (2014b) show that the offsets between vector embeddings can capture some analogical relations between words. For example, the result of the expression vector(‘king’) - vector(‘man’) + vector(‘woman’) is a vector close to vector(‘queen’). Similarly, they found that the expression vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’).

Embeddings and Historical Semantics: Embeddings can also be a useful tool for studying how meaning changes over time, by computing multiple embedding spaces, each from texts written in a particular time period.

6.11 Bias and Embeddings

6.12 Evaluating Vector Models

For intrinsic evaluations, the most common metric is to test their performance on similarity, computing the correlation between an algorithm’s word similarity scores and word similarity ratings assigned by humans. WordSim-353 (Finkelstein et al., 2002) is a commonly used set of ratings from 0 to 10 for 353 noun pairs; for example (plane, car) had an average score of 5.77. SimLex-999 (Hill et al., 2015) is a more difficult dataset that quantifies similarity (cup, mug) rather than relatedness (cup, coffee), and including both concrete and abstract adjective, noun and verb pairs. The TOEFL dataset is a set of 80 questions, each consisting of a target word with 4 additional word choices; the task is to choose which is the correct synonym, as in the example: Levied is closest in meaning to: imposed, believed, requested, correlated (Landauer and Dumais, 1997). All of these datasets present words without context.

Slightly more realistic are intrinsic similarity tasks that include context. The Stanford Contextual Word Similarity (SCWS) dataset (Huang et al., 2012) offers a richer evaluation scenario, giving human judgments on 2,003 pairs of words in their sentential context, including nouns, verbs, and adjectives. This dataset enables the evaluation of word similarity algorithms that can make use of context words. The semantic textual similarity task (Agirre et al. 2012, Agirre et al. 2015) evaluates the performance of sentence-level similarity algorithms, consisting of a set of pairs of sentences, each pair with human-labeled similarity scores.

Another task used for evaluate is an analogy task, where the system has to solve problems of the form a is to b as c is to d, given a, b, and c and having to find d. Thus given Athens is to Greece as Oslo is to , the system must fill in the word Norway. Or more syntactically-oriented examples: given mouse, mice, and dollar the system must return dollars. Large sets of such tuples have been created (Mikolov et al. 2013, Mikolov et al. 2013b).

6.13 Summary

In vector semantics, a word is modeled as a vector—a point in high-dimensional space, also called an embedding.
Vector semantic models fall into two classes: sparse and dense. In sparse models like tf-idf each dimension corresponds to a word in the vocabulary $V$ ;
Cell in sparse models are functions of co-occurrence counts. The term-document matrix has rows for each word (term) in the vocabulary and a column for each document.
The word-context matrix has a row for each (target) word in the vocabulary and a column for each context term in the vocabulary.
A common sparse weighting is tf-idf, which weights each cell by its term frequency and inverse document frequency.
Word and document similarity is computed by computing the dot product between vectors. The cosine of two vectors—a normalized dot product—is the most popular such metric.
PPMI (pointwise positive mutual information) is an alternative weighting scheme to tf-idf.
Dense vector models have dimensionality 50-300 and the dimensions are harder to interpret.
The word2vec family of models, including skip-gram and CBOW, is a popular efficient way to compute dense embeddings.
Skip-gram trains a logistic regression classifier to compute the probability that two words are ‘likely to occur nearby in text’. This probability is computed from the dot product between the embeddings for the two words,
Skip-gram uses stochastic gradient descent to train the classifier, by learning embeddings that have a high dot-product with embeddings of words that occur nearby and a low dot-product with noise words.
Other important embedding algorithms include GloVe, a method based on ratios of word co-occurrence probabilities, and fasttext, an open-source library for computing word embeddings by summing embeddings of the bag of character n-grams that make up a word.

你可能感兴趣的:(Speech,and,Language,Processing,ed3)

VUE如何设置语音穆罕周 vue.js 前端 javascript
在Vue项目中设置语音功能可以通过1、使用WebSpeechAPI和2、集成第三方语音合成库来实现。WebSpeechAPI是一种浏览器内置的API，它提供了语音识别和语音合成功能，而第三方语音合成库则提供了更加丰富和灵活的功能。一、使用WebSpeechAPIWebSpeechAPI是一种原生浏览器API，它包括了语音识别和语音合成两部分。以下是如何在Vue项目中使用WebSpeechAPI设置
webkitSpeechRecognitionHTML5语音识别文字（直接运行） AIGC创想家 html5 语音识别前端
前端想要实现语音转文字，其实不需要任何云服务，浏览器自带的api就能搞定。下面是已经封装好的代码，复制之后可以在控制台只接运行。classSpeechRecognitionManager{??tempTranscript=''??isRecording=false;??timeoutid=0;??exitKeywors=['stop','exit','quit','退出','停止识别','说完了'
GENERALIST REWARD MODELS: FOUND INSIDE LARGELANGUAGE MODELS 樱花的浪漫大模型与智能体对抗生成网络与动作识别强化学习语言模型人工智能自然语言处理深度学习机器学习计算机视觉
GeneralistRewardModels:FoundInsideLargeLanguageModelshttps://arxiv.org/pdf/2506.232351.概述将大型语言模型（LLMs）与复杂的人类价值观（如乐于助人和诚实）对齐，仍然是人工智能发展中的一个核心挑战。这项任务的主要范式是来自人类反馈的强化学习（RLHF）[Christianoetal.,2017;Baietal.,
数字图像处理（三：图像如果当作矩阵，那加减乘除处理了矩阵，那图像咋变）：从LED冬奥会、奥运会及春晚等等大屏，到手机小屏，快来挖一挖里面都有什么
数字图像处理（三）一、（准备工作：咋玩，用什么玩具）图像以矩阵形式存储，那矩阵一变、图像立刻跟着变？1.Python+JupyterNotebook/Lab+库(NumPy,OpenCV,Matplotlib,scikit-image)2.MATLAB+ImageProcessingToolbox3.JavaScript+HTML5Canvas+浏览器4.专业的图像处理软件(带脚本/插件功能)二、
这么简单的从零到一做HTML 网页，你确定不来看看吗？ paid槮 html 服务器前端
HTML网页的介绍HTML(HypertextMarkupLanguage,超文本标记语言)是一种用于创建网页的标准标记语言,是一种与Python不同的编程语言。网页文件的扩展名通常为,html或.htm,这两种扩展名都可使用,并不会影响文件内容简单的HTML网页框架每一个HTML网页都包含一个基础框架，其他的内容都是在基础框架内进行扩充的。示例代码:这里是标题在这里填入正文这是一个较为基础的HT
numpy教程 Jeffrey_Pacino 编程学习 numpy 数据分析
使用jupyternotebook分析数据之前导入的包importnumpyasnp#linearalgebraimportpandasaspd#dataprocessing,CSVfileI/O(e.g.pd.read_csv)%matplotlibinlineimportmatplotlib.pyplotasplt#Matlab-styleplottingimportseabornassns
日常英语口语积累｜第一轮 Ivy_IBFE
【口语练习资料】1.新闻编辑室（快）2.老友记3.摩登家庭4.CommencementspeechTips：1.readingandconsuminginformation2.nottomemorize3.nottoprematurelyapproachanativespeaker4.buildingyourinventoryofwordsandexpressions5.watchingTVors
大模型或多模态在能源系统优化调度中的应用 u013250861 LLM 能源人工智能
1.大模型在电力调度中的应用GAIA-电力调度大语言模型项目描述:专为电力调度设计的大语言模型，能够处理运行调整、运行监控和黑启动等任务技术特点:基于LLaMA2微调，专门针对电力系统领域优化论文:“Alargelanguagemodelforadvancedpowerdispatch”(NatureScientificReports,2025)GitHub:暂未公开源代码，但论文中提到了完整的技
【iOS】编译和链接、动静态库及dyld的简单学习名字不要太长像我这样就好 ios 学习 cocoa objective-c macos 笔记
文章目录编译和链接1️⃣核心结论：一句话区分2️⃣编译过程：从源代码到目标文件（.o）2.1预处理（Preprocessing）：“替换变量+复制粘贴”2.2编译（Compilation）：“翻译成机器能懂的语言”2.3汇编（Assembly）：“翻译成机器指令”2.4实战：用命令行观察编译过程动态库和静态库1️⃣关于动态库和静态库核心结论：一句话区分2️⃣底层原理：编译链接过程的差异2.1静态库
【iOS】编译和链接、动静态库及dyld的简单学习名字不要太长像我这样就好 ios 学习 cocoa objective-c macos 笔记
文章目录编译和链接1️⃣核心结论：一句话区分2️⃣编译过程：从源代码到目标文件（.o）2.1预处理（Preprocessing）：“替换变量+复制粘贴”2.2编译（Compilation）：“翻译成机器能懂的语言”2.3汇编（Assembly）：“翻译成机器指令”2.4实战：用命令行观察编译过程动态库和静态库1️⃣关于动态库和静态库核心结论：一句话区分2️⃣底层原理：编译链接过程的差异2.1静态库
【004】 ITK 读取 CT Dicom 数据并使用 VTK 显示
【004】ITK读取CTDicom数据并使用VTK显示文章目录1.CMakeList.txt2.ITK读取CTdicom文件并使用VTK显示代码实现1.CMakeList.txtcmake_minimum_required(VERSION3.5)project(Image2ReadImageLANGUAGESCXX)set(CMAKE_CXX_STANDARD11)set(CMAKE_CXX_ST
LLM4SR: A Survey on Large Language Models for Scientific Research UnknownBody LLM Daily Survey Paper 语言模型人工智能自然语言处理
文章主要内容文章围绕大语言模型（LLMs）在科学研究中的应用展开，系统探讨了其在科研各关键阶段的作用、方法、挑战及未来方向。科学假设发现：LLMs生成科学假设的研究源于“基于文献的发现”和“归纳推理”。现有方法通过灵感检索策略、反馈模块等组件提升假设生成质量，相关基准测试分为基于文献和数据驱动两类，评估指标涵盖新颖性、有效性等。虽取得一定成果，但面临实验验证困难、依赖现有LLMs能力等挑战。实验规
UMLS（统一医学语言系统）—— 小白最强攻略（讲解+运用）
1概念介绍1.1UMLS介绍UMLS(UnifiedMedicalLanguageSystem)，是由美国国立医学图书馆（NLM）开发的，旨在通过整合各种生物医学术语系统来促进医学信息的统一检索和应用。链接：https://uts.nlm.nih.gov/uts/signUpUMLS参考手册：https://www.ncbi.nlm.nih.gov/books/NBK9676/技术上：Unifie
html5及css有什么区别,html5和css3的区别是什么？ weixin_39788256 html5及css有什么区别
html5和css3是什么？区别是什么？下面本篇文章给大家介绍一下。有一定的参考价值，有需要的朋友可以参考一下，希望对大家有所帮助。什么是HTML5？HTML，超文本标记语言(HyperTextMarkupLanguage)，标准通用标记语言下的一个应用。是用于描述网页文档的一种标记语言。html5是HTML第五次重大修改后的版本，是当前最新版本，主要特点是支持原生的视频播放、离线存储、更多的语义
NUS：LLM表格数据建模综述
标题：LanguageModelingonTabularData:ASurveyofFoundations,TechniquesandEvolution来源：arXiv,2408.10548摘要表格数据是一种跨领域的流行数据类型，由于其异构性和复杂的结构关系，带来了独特的挑战。在表格数据分析中实现高预测性能和鲁棒性对许多应用程序具有重大前景。受自然语言处理，特别是转换器架构的最新进展的影响，出现了
ARMv8架构 weizhideshenghuo ARM arm
ARMarchitecturePE(processingelement)：采取ARM架构的处理器RISC(reducedinstructionsetcomputer)：精简指令集架构：AArch64:64位架构，地址和指令都是64位寄存器提供31个64位通用寄存器，X30用作过程链接寄存器提供1个64位程序计数器PC(programcounter)，栈指针SPs(stackpointers)，异常
docker+gunicorn+gevent部署Django项目间歇性不想努力 docker gunicorn django
1、生成requiremesnts.txt文件执行pipfreeze>requirements.txt2、编写gunicorn-config.py文件frommultiprocessingimportcpu_countbind=["0.0.0.0:8521"]#daemon=Truepidfile='logs/gunicorn.pid'workers=cpu_count()*2wprker_cla
yolov5推理简单代码（网上找了好多，最终找到了） a2488220557 YOLO 计算机视觉 opencv
#yolov5#导包importtorchimportcv2frommultiprocessingimportProcess,Manager,Value#下面两个是yolov5文件夹里面的代码fromutils.generalimportnon_max_suppressionfrommodels.experimentalimportattempt_load#确保在进行对象检测时，边界框的位置可以与
html5这什么意思,html5是什么意思？html5和html的区别介绍 wiles super html5这什么意思
一、HTML5是什么？HTML5是HyperTextMarkupLanguage5的缩写，HTML5是超文本标记语言的最新版本，也就是描述网页的代码，html5实际上是三种代码形式，首先是HTML提供结构，其次是层叠样式表(CSS)负责网站的样式和布局，最后是JavaScript是给网站添加动态功能。二、html5和html的区别1、定义上区别HTML5是应用超文本标记语言(HTML)的第五次修改
HTML和HTML5的区别半生凉忆 html html5
HTML和HTML5的区别什么是HTML？HTML全称为超文本标记语言(HyperTextMarkupLanguage)，它包括一系列标签，通过这些标签可以将网络上的文档格式统一，使分散的Internet资源连接为一个逻辑整体。什么是HTML5?HTML5是HTML的第五个版本，HTML5已经远远超越了标记语言的范畴，它的设计目的是在移动设备上支持多媒体，和HTML比起来，深度和广度上都做了进一步
[特殊字符] LLM（大型语言模型）：智能时代的语言引擎与通用推理基座大千AI助手人工智能 Python #OTHER 语言模型人工智能自然语言处理 LLM 大模型 Transformer
本文由「大千AI助手」原创发布，专注用真话讲AI，回归技术本质。拒绝神话或妖魔化。搜索「大千AI助手」关注我，一起撕掉过度包装，学习真实的AI技术！从千亿参数到人类认知的AI革命一、核心定义与核心特征LLM（LargeLanguageModel）是基于海量文本数据训练的深度学习模型，通过神经网络架构（尤其是Transformer）模拟人类语言的复杂规律，实现文本理解、生成与推理任务。其核心特征可概
思途html学习 0717 Asu5202 html 学习前端
1.HTML基础概述HTML定义：超文本标记语言（HyperTextMarkupLanguage），用于创建网页结构。“超文本”指支持嵌入图像、音频、视频和脚本等非文本内容。编辑器推荐：VSCode、HBuilderX或IDEA都很实用。安装VSCode后，添加LiveServer插件（通过Extensions搜索安装），能实现实时预览网页（快捷键：Ctrl+S保存后自动刷新）。核心特性：空白处理
HTML的重要知识萌新小白的逆袭 html 前端
什么是HTMLHTML是HyperTextMarkupLanguage的缩写，意思是超文本标记语言。标签标题标签：————-h1,h2,h3.....段落标签：————p换行标签：————br列表标签：有序列表：——ol无序列表：——ul超链接标签：————href属性使用路径target用于定义链接打开的方式_blank在新窗口中打开目标资源；_self在当前窗口中打开目标资源多媒体标签：图片标
C#中的LINQ解析三千道应用题 C#学无止境 c#
本文仅作为参考大佬们文章的总结。LINQ（LanguageIntegratedQuery，语言集成查询）是C#中一项革命性的技术，它将查询功能直接集成到C#语言中，使开发者能够以声明式的方式查询各种数据源。LINQ提供了一种统一的语法来查询和操作不同类型的数据，包括内存中的集合、数据库、XML文档等，极大地简化了数据处理流程。一、LINQ概述与核心概念1.LINQ的定义与价值LINQ是.NETFr
LLM系统性学习完全指南（初学者必看系列） GA琥珀 LLM 学习人工智能语言模型
前言这篇文章将系统性的讲解LLM（LargeLanguageModels,LLM）的知识和应用。我们将从支撑整个领域的数学与机器学习基石出发，逐步剖析自然语言处理（NLP）的经典范式，深入探究引发革命的Transformer架构，并按时间顺序追溯从BERT、GPT-2到GPT-4、Llama及Gemini等里程碑式模型的演进。随后，我们将探讨如何将这些强大的基础模型转化为实用、安全的应用，涵盖对齐
Django基础(三)———模板【本人】 PythonWeb django python 后端
前言在之前的文章中，视图函数只是直接返回文本，而在实际生产环境中其实很少这样用，因为实际的页面大多是带有样式的HTML代码，这可以让浏览器渲染出非常漂亮的页面。目前市面上有非常多的模板系统，其中最知名最好用的就是DTL和jinja2。DTL是DjangoTemplateLanguage三个单词的缩写，也就是Django自带的模板语言。当然也可以配置Django支持jinja2等其他模板引擎，但是作
SQL 常用版本语法概览：标准演进与关键语法分析
一、引言SQL（StructuredQueryLanguage，结构化查询语言）是关系型数据库系统的核心语言，自1986年成为ANSI和ISO标准以来，经历了多次版本演进，不断增强语义表达能力以适应复杂的企业数据需求。随着数据库技术的不断发展，各大数据库厂商（如Oracle、SQLServer、PostgreSQL、MySQL等）在实现标准的基础上扩展了大量方言语法，使得掌握SQL的标准语法版本成
六单元复盘 21地科7耿东昊
Part11，从本单元中我学到的最重要的理念（精读和视听说分别总结）精读:TheEnglishlanguageistheseawhichreceivestributariesfromeveryregionunderheaven.视听说:Weshouldchooseourfavoritejobandgreatwelfare.2，我在本片文章／音频／视频中学到的怦然心动的单词（精读和视听说分别总结）精
vLLM快速入门：开启高效推理与部署之旅
在如今这个人工智能飞速发展的时代，语言模型的应用已经深入到我们生活的方方面面，从智能聊天机器人到文本生成工具，都离不开强大的语言模型技术支持。而vLLM作为一个专注于高效推理和部署的开源项目，正在为研究人员和开发人员提供一种全新的解决方案，让语言模型的使用变得更加便捷、高效。初识vLLM：背景与意义vLLM（VeryLargeLanguageModelInference）是一个专注于大型语言模型推
系统学习Python——并发模型和异步编程：进程、线程和GIL
分类目录：《系统学习Python》总目录在文章《并发模型和异步编程：基础知识》我们简单介绍了Python中的进程、线程和协程。本文就着重介绍Python中的进程、线程和GIL的关系。Python解释器的每个实例都是一个进程。使用multiprocessing或concurrent.futures库可以启动额外的Python进程。Python的subprocess库用于启动运行外部程序（不管使用何种
312个免费高速HTTP代理IP（能隐藏自己真实IP地址） yangshangchuan 高速免费 superword HTTP代理
124.88.67.20:843 190.36.223.93:8080 117.147.221.38:8123 122.228.92.103:3128 183.247.211.159:8123 124.88.67.35:81 112.18.51.167:8123 218.28.96.39:3128 49.94.160.198:3128 183.20
pull解析和json编码百合不是茶 android pull解析 json
n.json文件: [{name:java,lan:c++,age:17},{name:android,lan:java,age:8}] pull.xml文件 <?xml version="1.0" encoding="utf-8"?> <stu> <name>java
[能源与矿产]石油与地球生态系统 comsci 能源
按照苏联的科学界的说法,石油并非是远古的生物残骸的演变产物,而是一种可以由某些特殊地质结构和物理条件生产出来的东西,也就是说,石油是可以自增长的.... 那么我们做一个猜想: 石油好像是地球的体液,我们地球具有自动产生石油的某种机制,只要我们不过量开采石油,并保护好
类与对象浅谈沐刃青蛟 java 基础
类，字面理解，便是同一种事物的总称，比如人类，是对世界上所有人的一个总称。而对象，便是类的具体化，实例化，是一个具体事物，比如张飞这个人，就是人类的一个对象。但要注意的是：张飞这个人是对象，而不是张飞，张飞只是他这个人的名字，是他的属性而已。而一个类中包含了属性和方法这两兄弟，他们分别用来描述对象的行为和性质（感觉应该是
新站开始被收录后，我们应该做什么？ IT独行者 PHP seo
新站开始被收录后，我们应该做什么？百度终于开始收录自己的网站了，作为站长，你是不是觉得那一刻很有成就感呢，同时，你是不是又很茫然，不知道下一步该做什么了？至少我当初就是这样，在这里和大家一份分享一下新站收录后，我们要做哪些工作。至于如何让百度快速收录自己的网站，可以参考我之前的帖子《新站让百
oracle 连接碰到的问题文强chu oracle
Unable to find a java Virtual Machine－－安装64位版Oracle11gR2后无法启动SQLDeveloper的解决方案作者：草根IT网来源：未知人气：813标签：导读：安装64位版Oracle11gR2后发现启动SQLDeveloper时弹出配置java.exe的路径，找到Oracle自带java.exe后产生的路径“C:\app\用户名\prod
Swing中按ctrl键同时移动鼠标拖动组件（类中多借口共享同一数据）小桔子 java 继承 swing 接口监听
都知道java中类只能单继承，但可以实现多个接口，但我发现实现多个接口之后，多个接口却不能共享同一个数据，应用开发中想实现：当用户按着ctrl键时，可以用鼠标点击拖动组件，比如说文本框。编写一个监听实现KeyListener,NouseListener,MouseMotionListener三个接口，重写方法。定义一个全局变量boolea
linux常用的命令 aichenglong linux 常用命令
1 startx切换到图形化界面 2 man命令:查看帮助信息 man 需要查看的命令,man命令提供了大量的帮助信息,一般可以分成4个部分 name:对命令的简单说明 synopsis:命令的使用格式说明 description:命令的详细说明信息 options:命令的各项说明 3 date:显示时间语法：date [OPTION]... [+FORMAT]
eclipse内存优化 AILIKES java eclipse jvm jdk
一基本说明在JVM中，总体上分2块内存区,默认空余堆内存小于 40%时，JVM就会增大堆直到-Xmx的最大限制；空余堆内存大于70%时，JVM会减少堆直到-Xms的最小限制。 1)堆内存(Heap memory):堆是运行时数据区域，所有类实例和数组的内存均从此处分配,是Java代码可及的内存，是留给开发人
关键字的使用探讨百合不是茶关键字
//关键字的使用探讨/*访问关键词private 只能在本类中访问public 只能在本工程中访问protected 只能在包中和子类中访问默认的只能在包中访问*//*final 类方法变量 final 类不能被继承 final 方法不能被子类覆盖，但可以继承 final 变量只能有一次赋值，赋值后不能改变 final 不能用来修饰构造方法*///this()
JS中定义对象的几种方式 bijian1013 js
1. 基于已有对象扩充其对象和方法(只适合于临时的生成一个对象)： <html> <head> <title>基于已有对象扩充其对象和方法(只适合于临时的生成一个对象)</title> </head> <script> var obj = new Object();
表驱动法实例 bijian1013 java 表驱动法 TDD
获得月的天数是典型的直接访问驱动表方式的实例，下面我们来展示一下： MonthDaysTest.java package com.study.test; import org.junit.Assert; import org.junit.Test; import com.study.MonthDays; public class MonthDaysTest { @T
LInux启停重启常用服务器的脚本 bit1129 linux
启动，停止和重启常用服务器的Bash脚本，对于每个服务器，需要根据实际的安装路径做相应的修改 #! /bin/bash Servers=(Apache2, Nginx, Resin, Tomcat, Couchbase, SVN, ActiveMQ, Mongo); Ops=(Start, Stop, Restart); currentDir=$(pwd); echo
【HBase六】REST操作HBase bit1129 hbase
HBase提供了REST风格的服务方便查看HBase集群的信息，以及执行增删改查操作 1. 启动和停止HBase REST 服务 1.1 启动REST服务前台启动（默认端口号8080） [hadoop@hadoop bin]$ ./hbase rest start 后台启动 hbase-daemon.sh start rest 启动时指定
大话zabbix 3.0设计假设 ronin47
What’s new in Zabbix 2.0? 去年开始使用Zabbix的时候，是1.8.X的版本，今年Zabbix已经跨入了2.0的时代。看了2.0的release notes，和performance相关的有下面几个： :: Performance improvements::Trigger related da
http错误码大全 byalias http协议 javaweb
响应码由三位十进制数字组成，它们出现在由HTTP服务器发送的响应的第一行。响应码分五种类型，由它们的第一位数字表示： 1）1xx：信息，请求收到，继续处理 2）2xx：成功，行为被成功地接受、理解和采纳 3）3xx：重定向，为了完成请求，必须进一步执行的动作 4）4xx：客户端错误，请求包含语法错误或者请求无法实现 5）5xx：服务器错误，服务器不能实现一种明显无效的请求
J2EE设计模式-Intercepting Filter bylijinnan java 设计模式数据结构
Intercepting Filter类似于职责链模式有两种实现其中一种是Filter之间没有联系，全部Filter都存放在FilterChain中，由FilterChain来有序或无序地把把所有Filter调用一遍。没有用到链表这种数据结构。示例如下： package com.ljn.filter.custom; import java.util.ArrayList;
修改jboss端口 chicony jboss
修改jboss端口 %JBOSS_HOME%\server\{服务实例名}\conf\bindingservice.beans\META-INF\bindings-jboss-beans.xml 中找到 <!-- The ports-default bindings are obtained by taking the base bindin
c++ 用类模版实现数组类 CrazyMizzz C++
最近c++学到数组类，写了代码将他实现，基本具有vector类的功能 #include<iostream> #include<string> #include<cassert> using namespace std; template<class T> class Array { public: //构造函数
hadoop dfs.datanode.du.reserved 预留空间配置方法 daizj hadoop 预留空间
对于datanode配置预留空间的方法为：在hdfs-site.xml添加如下配置 <property> <name>dfs.datanode.du.reserved</name> <value>10737418240</value>
mysql远程访问的设置 dcj3sjt126com mysql 防火墙
第一步: 激活网络设置你需要编辑mysql配置文件my.cnf. 通常状况，my.cnf放置于在以下目录： /etc/mysql/my.cnf (Debian linux) /etc/my.cnf （Red Hat Linux/Fedora Linux) /var/db/mysql/my.cnf (FreeBSD) 然后用vi编辑my.cnf，修改内容从以下行： [mysqld] 你所需要: 1
ios 使用特定的popToViewController返回到相应的Controller dcj3sjt126com controller
1、取navigationCtroller中的Controllers NSArray * ctrlArray = self.navigationController.viewControllers; 2、取出后，执行， [self.navigationController popToViewController:[ctrlArray objectAtIndex:0] animated:YES
Linux正则表达式和通配符的区别 eksliang 正则表达式通配符和正则表达式的区别通配符
转载请出自出处：http://eksliang.iteye.com/blog/1976579 首先得明白二者是截然不同的通配符只能用在shell命令中,用来处理字符串的的匹配。判断一个命令是否为bash shell(linux 默认的shell)的内置命令 type -t commad 返回结果含义 file 表示为外部命令 alias 表示该
Ubuntu Mysql Install and CONF gengzg Install
http://www.navicat.com.cn/download/navicat-for-mysql Step1: 下载Navicat ，网址：http://www.navicat.com/en/download/download.html Step2：进入下载目录，解压压缩包：tar -zxvf navicat11_mysql_en.tar.gz
批处理，删除文件bat huqiji windows dos
@echo off ::演示：删除指定路径下指定天数之前（以文件名中包含的日期字符串为准）的文件。 ::如果演示结果无误，把del前面的echo去掉，即可实现真正删除。 ::本例假设文件名中包含的日期字符串（比如：bak-2009-12-25.log） rem 指定待删除文件的存放路径 set SrcDir=C:/Test/BatHome rem 指定天数 set DaysAgo=1
跨浏览器兼容的HTML5视频音频播放器天梯梦 html5
HTML5的video和audio标签是用来在网页中加入视频和音频的标签，在支持html5的浏览器中不需要预先加载Adobe Flash浏览器插件就能轻松快速的播放视频和音频文件。而html5media.js可以在不支持html5的浏览器上使video和audio标签生效。 How to enable <video> and <audio> tags in
Bundle自定义数据传递 hm4123660 android Serializable 自定义数据传递 Bundle Parcelable
我们都知道Bundle可能过put****()方法添加各种基本类型的数据，Intent也可以通过putExtras(Bundle)将数据添加进去，然后通过startActivity()跳到下一下Activity的时候就把数据也传到下一个Activity了。如传递一个字符串到下一个Activity 把数据放到Intent
C＃：异步编程和线程的使用（.NET 4.5 ） powertoolsteam .net 线程 C#异步编程
异步编程和线程处理是并发或并行编程非常重要的功能特征。为了实现异步编程，可使用线程也可以不用。将异步与线程同时讲，将有助于我们更好的理解它们的特征。本文中涉及关键知识点 1. 异步编程 2. 线程的使用 3. 基于任务的异步模式 4. 并行编程 5. 总结异步编程什么是异步操作？异步操作是指某些操作能够独立运行，不依赖主流程或主其他处理流程。通常情况下，C＃程序
spark 查看 job history 日志 Stark_Summer 日志 spark history job
SPARK_HOME/conf 下: spark-defaults.conf 增加如下内容 spark.eventLog.enabled true spark.eventLog.dir hdfs://master:8020/var/log/spark spark.eventLog.compress true spark-env.sh 增加如下内容 export SP
SSH框架搭建 wangxiukai2015eye spring Hibernate struts
MyEclipse搭建SSH框架 Struts Spring Hibernate 1、new一个web project。 2、右键项目，为项目添加Struts支持。选择Struts2 Core Libraries -<MyEclipes-Library> 点击Finish。src目录下多了struts