皓哥好运来

《斯坦福数据挖掘教程·第三版》读书笔记（英文版）Chapter 12 Large-Scale Machine Learning

来源：《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

Chapter 12 Large-Scale Machine Learning

Algorithms called “machine learning” not only summarize our data; they are perceived as learning a model or classifier from the data, and thus discover something about data that will be seen in the future.

The term unsupervised refers to the fact that the input data does not tell the clustering algorithm what the clusters should be. In supervised machine learning, the available data includes information about the correct way to classify at least some of the data. The data classified already is called the training set.

The data to which a machine-learning algorithm is applied is called a training set. A training set consists of a set of pairs $(x, y)$ , called training examples, where

x is a vector of values, often called a feature vector, or simply an input. Each value, or feature, can be categorical (values are taken from a set of discrete values, such as {red, blue, green}) or numerical (values are integers or real numbers).
y is the class label, or simply output, the classification value for x.

The objective of the ML process is to discover a function $y = f (x)$ that best predicts the value of y associated with each value of x. The type of y is in principle arbitrary, but there are several common and important cases.

y is a real number. In this case, the ML problem is called regression.
y is a Boolean value: true-or-false, more commonly written as +1 and −1, respectively. In this class the problem is binary classification.
y is a member of some finite set. The members of this set can be thought of as “classes,” and each member represents one class. The problem is multiclass classification.
y is a member of some potentially infinite set, for example, a parse tree for x, which is interpreted as a sentence.

There are many forms of ML algorithms, and we shall not cover them all here. Here are the major classes of such algorithms, each of which is distinguished by the form by which the function $f$ is represented.

Decision trees. The form of $f$ is a tree, and each node of the tree has a function of x that determines to which child or children the search must proceed. Decision trees are suitable for binary and multiclass classification, especially when the dimension of the feature vector is not too large (large numbers of features can lead to overfitting).
Perceptrons are threshold functions applied to the components of the vector $x = [x_1, x_2, . . . , x_n]$ . A weight $w_i$ is associated with the ith component, for each $i = 1, 2, ..., n$ , and there is a threshold $\theta$ . The output is +1 if

$\sum _{i=1}^{n} w_ix_i > \theta$

and the output is −1 if that sum is less than $\theta$ . A perceptron is suitable for binary classification, even when the number of features is very large.
Neural nets are acyclic networks of perceptrons, with the outputs of some perceptrons used as inputs to others. These are suitable for binary or multiclass classification, since there can be several perceptrons used as output, with one or more indicating each class.
Instance-based learning uses the entire training set to represent the function $f$ . The calculation of the label y associated with a new feature vector x can involve examination of the entire training set, although usually some preprocessing of the training set enables the computation of $f (x)$ to proceed efficiently. We shall consider an important kind of instance-based learning, k-nearest-neighbor.
Support-vector machines are an advance over the algorithms traditionally used to select the weights and threshold. The result is a classifier that tends to be more accurate on unseen data.

The validation set is used to help design the model, while the test set is used only to determine how good the model is. The problem addressed by a validation set is that many machine-learning algorithms tend to overfit the data; they pick up on artifacts that occur in the training set but that are atypical of the larger population of possible data. By using the validation set, and seeing how well the classifier works on that, we can tell if the classifier is overfitting the data. If so, we can restrict the machine-learning algorithm in some way. For instance, if we are constructing a decision tree, we can limit the number of levels of the tree.

We can repeat the train-then-test process several times using the same data, if we divide the data into $k$ equal-sized chunks. In turn, we let each chunk be the test data, and use the remaining $k - 1$ chunks as the training data. This training architecture is called cross-validation.

We use a batch learning architecture. That is, the entire training set is available at the beginning of the process, and it is all used in whatever way the algorithm requires to produce a model once and for all. The alternative is on-line learning, where the training set arrives in a stream and, like any stream, cannot be revisited after it is processed. In on-line learning, we maintain a model at all times. As new training examples arrive, we may choose to modify the model to account for the new examples. On-line learning has the advantages that it can

Deal with very large training sets, because it does not access more than one training example at a time.
Adapt to changes in the population of training examples as time goes on. For instance, Google trains its spam-email classifier this way, adapting the classifier for spam as new kinds of spam email are sent by spammers and indicated to be spam by the recipients.

An enhancement of on-line learning, suitable in some cases, is active learning. Here, the classifier may receive some training examples, but it primarily receives unclassified data, which it must classify. If the classifier is unsure of the classification (e.g., the newly arrived example is very close to the boundary), then the classifier can ask for ground truth at some significant cost. For instance, it could send the example to Mechanical Turk and gather opinions of real people. In this way, examples near the boundary become training examples and can be used to modify the classifier.

Perceptrons

A perceptron is a linear binary classifier. Its input is a vector $x = [x 1, x 2, ..., x d]$ with real-valued components. Associated with the perceptron is a vector of weights $w = [w 1, w 2, ..., w d]$ , also with real-valued components. Each perceptron has a threshold $\theta$ .

The weight vector w defines a hyperplane of dimension $d - 1$ – the set of all points x such that $\theta,$ as suggested in Fig. 12.4. Points on the positive side of the hyperplane are classified +1 and those on the negative side are classified −1. A perceptron classifier works only for data that is linearly separable, in the sense that there is some hyperplane that separates all the positive points from all the negative points. If there are many such hyperplanes, the perceptron will converge to one of them, and thus will correctly classify all the training points. If no such hyperplane exists, then the perceptron cannot converge to any hyperplane. In the next section, we discuss support-vector machines, which do not have this limitation; they will converge to some separator that, although not a perfect classifier, will do as well as possible under the
metric to be described in SVM.

Here are some common tests for termination.

Terminate after a fixed number of rounds.
Terminate when the number of misclassified training points stops changing.
Withhold a test set from the training data, and after each round, run the perceptron on the test data. Terminate the algorithm when the number of errors on the test set stops changing.

Another technique that will aid convergence is to lower the training rate as the number of rounds increases. For example, we could allow the training rate $η$ to start at some initial $η_0$ and lower it to $η_0/(1 + c_t)$ after the t-th round, where $c$ is some small constant.

There are many other rules one could use to adjust weights for a perceptron. Not all possible algorithms are guaranteed to converge, even if there is a hyperplane separating positive and negative examples. One that does converge is called Winnow, and that rule will be described here. Winnow assumes that the feature vectors consist of 0’s and 1’s, and the labels are +1 or −1. Unlike the basic perceptron algorithm, which can produce positive or negative components in the weight vector w, Winnow produces only positive weights.

We shall give the details of the algorithm using the factors 2 and 1/2, for the cases where we want to raise weights and lower weights, respectively. Start the Winnow Algorithm with a weight vector $w = [w 1, w 2, ..., w d]$ all of whose components are 1, and let the threshold $\theta$ equal $d$ , the number of dimensions of the vectors in the training examples. Let $(x, y)$ be the next training example to be considered, where $x = [x 1, x 2, ..., x d]$ .

If $w \cdot x > θ$ and y = +1, or $w \cdot x < θ$ and y = −1, then the example is correctly classified, so no change to w is made.
If $w \cdot x \leq θ$ , but y = +1, then the weights for the components where x has 1 are too low as a group. Double each of the corresponding components of w. That is, if $x_i = 1$ then set $w_i= 2w_i$ .
If $w \cdot x \geq θ$ , but y = −1, then the weights for the components where x has 1 are too high as a group. Halve each of the corresponding components of w. That is, if $x_i = 1$ then set $w_i= w_i/2$ .

An example is shown in Fig. 12.11. In this example, points of the two classes mix near the boundary so that any line through the points will have points of both classes on at least one
of the sides.

In principle, possible to find some function on the points that transforms them to another space where they are linearly separable. However, doing so could well lead to overfitting, the situation where the classifier works very well on the training set, because it has been carefully designed to handle each training example correctly. However, because the classifier is exploiting details of the training set that do not apply to other examples that must be classified in the future, the classifier will not perform well on new data.

Usually, if classes can be separated by one hyperplane, then there are many different hyperplanes that will separate the points. However, not all hyperplanes are equally good. For instance, if we choose the hyperplane that is furthest clockwise, then the point indicated by “?” will be classified as a circle, even though we intuitively see it as closer to the squares.

Yet another problem is illustrated by Fig. 12.13. Most rules for training a perceptron stop as soon as there are no misclassified points. As a result, the chosen hyperplane will be one that just manages to classify some of the points correctly. For instance, the upper line in Fig. 12.13 has just managed to accommodate two of the squares, and the lower line has just managed to accommodate two of the circles. If either of these lines represent the final weight vector, then the weights are biased toward one of the classes. That is, they correctly classify the points in the training set, but the upper line would classify new squares that are just below it as circles, while the lower line would classify circles just above it as squares.

SUPPORT-VECTOR MACHINES

We can view a support-vector machine, or SVM, as an improvement on the perceptron that is designed to address the problems mentioned in Section 12.2.7. An SVM selects one particular hyperplane that not only separates the points in the two classes, but does so in a way that maximizes the margin – the distance between the hyperplane and the closest points of the training set.

The goal of an SVM is to select a hyperplane $w \cdot x + b = 0$ that maximizes the distance γ between the hyperplane and any point of the training set.1 The idea is suggested by Fig. 12.14. There, we see the points of two classes and a hyperplane dividing them.

Intuitively, we are more certain of the class of points that are far from the separating hyperplane than we are of points near to that hyperplane. Thus, it is desirable that all the training points be as far from the hyperplane as possible (but on the correct side of that hyperplane, of course). An added advantage of choosing the separating hyperplane to have as large a margin as possible is that there may be points closer to the hyperplane in the full data set but not in the training set. If so, we have a better chance that these points will be classified properly than if we chose a hyperplane that separated the training points but allowed some points to be very close to the hyperplane itself. In that case, there is a fair chance that a new point that was near a training point that was also near the hyperplane would be misclassified.

A tentative statement of our goal is:

Given a training set $x_1, y_1),(x_2, y_2), . . . ,(x_n, y_n)$ , maximize γ (by varying w and b) subject to the constraint that for all $i = 1, 2, ..., n$ ,

$y_i(w·x_i + b) ≥ γ$

Notice that $y_i$ , which must be +1 or −1, determines which side of the hyperplane the point $x_i$ must be on, so the ≥ relationship to γ is always correct. However, it may be easier to express this condition as two cases: if y = +1, then $w \cdot x + b \geq γ$ , and if y = −1, then $w \cdot x + b \leq - γ$ .
Unfortunately, this formulation doesn’t really work properly. The problem is that by increasing w and b, we can always allow a larger value of γ. For example, suppose that w and b satisfy the constraint above. If we replace w by 2w and b by 2b, we observe that for all $i, y_i$ ， $((2w)·x_i+2b)) \ge 2\gamma$ . Thus, 2w and 2b is always a better choice than w and b, so there is no best choice and no maximum γ.

The solution to the problem that we described intuitively above is to normalize the weight vector w. That is, the unit of measure perpendicular to the separating hyperplane is the unit vector $w /∣∣ w ∣∣$ . Recall that $∣∣ w ∣∣$ is the Frobenius norm, or the square root of the sum of the squares of the components of w. We shall require that w be such that the parallel hyperplanes that just touch the support vectors are described by the equations $w \cdot x + b = + 1$ and $w \cdot x + b = - 1$ ，as suggested by Fig. 12.15. We shall refer to the hyperplanes defined by these two equations as the upper and lower hyperplanes, respectively.

It looks like we have set $γ = 1$ . But since we are using w as the unit vector, the margin γ is the number of “units,” that is, steps in the direction w needed to go between the separating hyperplane and the parallel hyperplanes. Our goal becomes to maximize γ, which is now the multiple of the unit vector $w /∣∣ w ∣∣$ between the separating hyperplane and the upper and lower hyperplanes. Our first step is to demonstrate that maximizing γ is the same as minimizing
$w /∣∣ w ∣∣$ .

Consider one of the support vectors, say $x_2$ shown in Fig. 12.15. Let $x_1$ be the projection of $x_2$ onto the upper hyperplane, also as suggested by Fig. 12.15. Note that $x_1$ need not be a support vector or even a point of the training set. The distance from $x_2$ to $x_1$ in units $w /∣∣ w ∣∣$ of is 2γ. That is,

$x_1=x_2+2\gamma \frac{w}{||w||}$

Since $x_1$ is on the hyperplane defined by $w \cdot x + b = + 1$ , we know that. If we substitute for $x_1$ using Equation 12.1, we get

$w·(x_2+2\gamma \frac{w}{||w||})+b=1$

Regrouping terms, we see,

$w·x_2+b+2\gamma\frac{w·w}{||w||}=1$

we know that $x_2$ is on the lower hyperplane, $w \cdot x + b = - 1$ . If we move this −1 from left
to right in Equation 12.2 and then divide through by 2, we conclude that

$\frac{w·w}{||w||}= 1$

$\frac{1}{||w||}$

Given a training set $x_1, y_1),(x_2, y_2), . . . ,(x_n, y_n)$ , minimize $∣∣ w ∣∣$ (by varying w and b) subject to the constraint that for all $i = 1, 2, ..., n$ ,

$y_i(w·x_i + b) ≥ 1$

We also see two points that, while they are classified correctly, are too close to the separating hyperplane. We shall call all these points bad points.

Each bad point incurs a penalty when we evaluate a possible hyperplane. The amount of the penalty, in units to be determined as part of the optimization process, is shown by the arrow leading to the bad point from the hyperplane on the wrong side of which the bad point lies. That is, the arrows measure the distance from the hyperplane $w \cdot x + b = 1$ or $w \cdot x + b = - 1$ . The former is the baseline for training examples that are supposed to be above the separating hyperplane (because the label y is +1), and the latter is the baseline for points that are supposed to be below (because y = −1).

Thus, we shall consider how to minimize the particular function，

$f(w,b)=\frac{1}{2}\sum_{j=1}^{d}w_i^2+C\sum_{i=1}^{n}\{0,1-y_i(\sum_{j=1}^{n}w_jx_{ij}+b)\}$

The first term encourages small w , while the second term, involving the constant C that must be chosen properly, represents the penalty for bad points in a manner to be explained below. We assume there are n training examples $x_i, y_i)$ for $i = 1, 2, ..., n$ , and $x_i = [x_{i1}, x_{i2}, . . . , x_{id}]$ . Also, as before, $w =[w_1, w_2, . . . , w_d]$ . Note that the two summations $\sum^d_{j=1}$ express the dot product of vectors.
The constant C, called the regularization parameter, reflects how important misclassification is. Pick a large C if you really do not want to misclassify points, but you would accept a narrow margin. Pick a small C if you are OK with some misclassified points, but want most of the points to be far away from the boundary (i.e., the margin is large).

We must explain the penalty function (second term) in Equation 12.4. The summation over i has one term for each training example $x_i$ .

$L(x_i, y_i) = max\{0, 1 − y_i(\sum_{j=1}^dw_jx_{ij}+b)\}$

$L$ is the hinge function, suggested in Fig. 12.17, and we call its value the hinge loss. Let $z_i = y_i( \sum^d_{j=1}w_jx_{ij} + b)$ . When $z_i$ is 1 or more, the value of $L$ is 0. But for smaller values of $z_i$
, L rises linearly as $z_i$ decreases.

Since we shall have need to take the derivative with respect to each $w_j$ of $L(x_i, y_i)$ , note that the derivative of the hinge function is discontinuous. It is $y_ix_{ij}$ for $z_i < 1$ and 0 for $z_i > 1$ . That is,

$\frac{∂L}{∂w_j}= if\space y_i(\sum^ d_{ j=1} w_jx_{ij} + b) ≥ 1 \space then \space0 \space else \space − y_ix_{ij}$

SVM Solutions by Gradient Descent

The derivative $\frac{∂f}{∂w_j}$ of the first term in Equation 12.4, $\sum^d_{j=1}w_i^2$ , is easy, it is $w_j$ . However, the second term involves the hinge function, so it is harder to express. We shall use an if-then expression to describe these derivatives, as in Equation 12.5. That is:

$\frac{∂f}{∂w_j}=w_i+C\sum_{i=1}^{n}(if\space y_i(\sum^ d_{ j=1} w_jx_{ij} + b) ≥ 1 \space then \space0 \space else \space − y_ix_{ij})$

Note that this formula gives us a partial derivative with respect to each component of w, including $w_{d+1},$ which is b, as well as to the weights $w_1, w_2, . . . , w_d$ . We continue to use b instead of the equivalent $w_{d+1}$ in the if-then condition to remind us of the form in which the desired hyperplane is described.
To execute the gradient-descent algorithm on a training set, we pick:

Values for the parameters C and η.
Initial values for w, including the (d + 1)st component b.

Then, we repeatedly:

(a) Compute the partial derivatives of $f (w, b)$ with respect to the $w_j$ ’s.
(b) Adjust the values of w by subtracting $η\frac{∂f}{∂w_j}$ from each $w_j$ .

Learning from Nearest Neighbors

Normally, we provide a function (the kernel function) of the distance between the query example and its nearest neighbors in the training set, and use this function to weight the
neighbors.

Suppose we use a method with k nearest neighbors, and x is the query point.
Let $x_1, x_2, . . . , x_k$ be the k nearest neighbors of x, and let the weight associated
with training point $x_i, y_i)$ be $w_i$ . Then the estimate of the label y for x is
$\sum^k_{i=1}w_iy_i/\sum^k_{i=1}w_i$ . Note that this expression gives the weighted average of the labels of the k nearest neighbors.

Weighted Average of the Two Nearest Neighbors. We again choose two nearest neighbors, but we weight them in inverse proportion to their distance from the query point. Suppose the two neighbors nearest to query point x are $x_1$ and $x_2$ . Suppose first that $x_1 < x < x_2$ . Then the weight of $x_1$ , the inverse of its distance from x, is $1/(x − x_1)$ , and the weight of $x_2$ is $1/(x_2 − x)$ . The weighted average of the labels is

$(\frac{y_1}{x-x_1}+\frac{y_1}{x_2-x})/(\frac{1}{x-x_1}+\frac{1}{x_2-x})$

When both nearest neighbors are on the same side of the query $x$ , the same weights make sense, and the resulting estimate is an extrapolation. In general, when points are unevenly spaced, we can find query points in the interior where both neighbors are on one side.

A way to construct a continuous function that represents the data of a training set well is to consider all points in the training set, but weight the points using a kernel function that decays with distance. A popular choice is to use a normal distribution (or “bell curve”), so the weight of a training point x when the query is q is $e^{{−(x−q)}^2/σ^2}$ . Here $σ$ is the standard deviation of the distribution (a parameter you select) and the query q is the mean. Roughly, points within distance σ of q are heavily weighted, and those further away have little weight. There is an advantage to using a kernel function, such as the normal distribution, that is continuous and defined for all points in the training set; doing so assures that the resulting function learned from the data is itself continuous.

Unfortunately, for high-dimensional data, there is little that can be done to avoid searching a large portion of the data. Two ways to deal with the “curse” are the following:

VA Files. Since we must look at a large fraction of the data anyway in order to find the nearest neighbors of a query point, we could avoid a complex data structure altogether. Accept that we must scan the entire file, but do so in a two-stage manner. First, a summary of the file is created, using only a small number of bits that approximate the values of each component of each training vector. For example, if we use only the high-order (1/4)th of the bits in numerical components, then we can create a file that is (1/4)th the size of the full dataset. However, by scanning this file we can construct a list of candidates that might be among the k nearest neighbors of the query q, and this list may be a small fraction of the entire dataset. We then look up only these candidates in the complete file, in order to determine which k are nearest to q.
Dimensionality Reduction. We may treat the vectors of the training set as a matrix, where the rows are the vectors of the training example, and the columns correspond to the components of these vectors. Apply one of the dimensionality-reduction techniques of Chapter 11, to compress the vectors to a small number of dimensions, small enough that the techniques for multidimensional indexing can be used. Of course, when processing a query vector $q$ , the same transformation must be applied to q before searching for $q$ ’s nearest neighbors.

Decision Trees

We can formalize the property we would like for a node by the notion of impurity. There are many impurity measures we could use, but they all have the property that they are 0 for a node that is reached only by training examples with a single class. Here are three of the most common impurity measures. Each applies to a node reached by training examples with n classes, with $p_i$ being the fraction of those training examples that belong the i-th class, for $i = 1, 2, ..., n$ .

Accuracy: the fraction of the reaching inputs that are correctly classified, or $1 − max(p_1, p_2, . . . , p_n)$ .
GINI Impurity: $\sum_{n=1}^n(p_i)^2$ .
Entropy: $\sum_{i=1}^np_ilog_2(1/p_i)$ .

Suppose we are given a node of the decision tree that is reached by a subset of the training examples. If the node is pure, i.e., all these training examples have the same output, then we make the node a leaf with that output as value. However, if the impurity is greater than zero, we want to find the test that gives the greatest reduction in impurity. When selecting this test, we are free to choose any feature of the input vector. If we choose a numerical feature, we can choose any constant to divide the training examples into two sets, one going to the left child and the other going to the right child. Alternatively, if we choose a categorical feature, then we may choose any set of values for the membership test.

To select the best breakpoint, we

Order the training examples according to their value of A. Let the values of A in this order be . $a_1, a_2, . . . , a_n$
For $j = 1, 2, ..., n$ , compute the number of training examples in each class among $a_1, a_2, . . . , a_j$ . Note that these counts can be done incrementally, since the count for a class after the jth example is either the same as the count for $j - 1$ (if the jth example in not in this class) or one more than the count for $j - 1$ (if the jth example is in this class).
From the counts computed in the previous step, compute the weighted average impurity assuming the test sends the first j training examples to the left child and the remaining $n - j$ examples to the right node.
Select that value of j that minimizes the weighted-average impurity. However, note that not every possible value of j can be used at this step, since it is possible that $a_j = a_j+1$ . We need to restrict our choice of j to those values such that $a_j < a_j+1$ , so that we can use $A < (a_j + a_j+1)/2$ as the comparison.

If we design a decision tree using as many levels as needed so that each leaf is pure, we are likely to have overfit to the training data. If we can verify our design using other examples – either a test set that we have withheld from the training data or a set of new data – we have the opportunity to simplify the tree and at the same time limit the overfitting.

Find a node N that has only leaves as children. Build a new tree by replacing N and its children by a leaf, and give that leaf its majority class as its output. Then, compare the performance of the old and new trees on data that was not used as training examples when we designed the tree. If there is little difference between the error rates of the old and new trees, then the decision made at the node N probably contributed to overfitting and was not addressing a property of the entire set of examples for which the decision tree was intended. We can discard the old tree and replace it by the new, simpler tree. On the other hand, if the new tree has a significantly higher error rate than the old, then the decision made at node N really reflects a property of the data, and we need to retain the old tree and discard the new. In either case, we should continue looking at other nodes whose children are leaves, and see if these can be replaced by leaves without a significant increase in the error rate.

Because a single decision tree with many levels is likely to have many nodes at the lower levels that represent overfitting, there is another approach to the use of decision trees that has proven quite useful in practice. It is common to use decision forests of many trees that vote on the class to which a given data point belongs. Each tree in the forest is designed using randomly or systematically chosen features, and is restricted to only a small number of levels, often one or two. Thus, each tree has a high impurity at each of its leaves, but collectively they often do much better on test data than any one tree, however many levels it has. Further, we can design each of the trees in the forest in parallel, so it can be even faster to design a large collection of shallow trees than to design one deep tree.
The obvious way to combine the outcomes from all the trees in a decision forest is to take a majority vote. If there are more than two classes, then we only expect a plurality of votes for one of the classes, while no class may be the choice of more than half the trees. However, more complex ways of combining the trees’ results may yield a more accurate answer than straightforward voting. We can often “learn” the proper weights to apply to the opinions of each tree. For example, suppose we have a training set $x_1, y_1),(x_2, y_2), . . . ,(x_n, y_n)$ and a collection of decision trees $T_1, T_2, . . . , T_k$ that constitute our decision forest. When we apply the decision forest to one of the training examples $x_i, y_i)$ , we get a vector of classes $ci = [c_{i1}, c_{i2}, . . . , c_{ik}]$ , where $c_{ij}$ is the outcome of tree $T_j$ applied to input $x_i$ . We know the correct class for the input $x_i$ ; it is $y_i$ . Thus, we have a new training set $c_1, y_1),(c_2, y_2), . . . ,(c_n, y_n)$ , that we can use to predict the true class from the opinions of all the trees in the forest. For instance, we could use this training set to train a perceptron or SVM. By doing so, we place the right weights on the opinion of each of the trees, in order to combine their opinions optimally.

Comparison of Learning Methods

Perceptrons and Support-Vector Machines: These methods can handle millions of features, but they only make sense if the features are numerical. They only are effective if there is a linear separator, or at least a hyperplane that approximately separates the classes. However, we can separate points by a nonlinear boundary if we first transform the points to make the separator be linear. The model is expressed by a vector, the normal to the separating hyperplane. Since this vector is often of very high dimension, it can be very hard to interpret the model.
Nearest-Neighbor Classification and Regression: Here, the model is the training set itself, so we expect it to be intuitively understandable. The approach can deal with multidimensional data, although the larger the number of dimensions, the sparser the training set will be, and therefore the less likely it is that we shall find a training point very close to the point we need to classify. That is, the “curse of dimensionality” makes nearest-neighbor methods questionable in high dimensions. These methods are really only useful for numerical features, although one could allow categorical features with a small number of values. For instance, a binary categorical feature like {male, female} could have the values replaced by 0 and 1, so there was no distance in this dimension between individuals of the same gender and distance 1 between other pairs of individuals. However, three or more values cannot be assigned numbers that are
equidistant. Finally, nearest-neighbor methods have many parameters to set, including the distance measure we use (e.g., cosine or Euclidean), the number of neighbors to choose, and the kernel function to use. Different choices result in diferent classification, and in many cases it is not obvious which choices yield the best results.
Decision Trees: Unlike the other methods discussed in this chapter, Decision trees are useful for both categorical and numerical features. The models produced are generally quite understandable, since each decision is represented by one node of the tree. However, this approach is most useful for low-dimension feature vectors. Building decision trees with many levels often leads to overfitting. But if a decision tree has few levels, then it cannot even mention more than a small number of features. As a result, the best use of decision trees is often to create a decision forest of many, low-depth trees and combine their decision in some way.

Summary of Chapter 12

Training Sets: A training set consists of a feature vector, each component of which is a feature, and a label indicating the class to which the object represented by the feature vector belongs. Features can be categorical belonging to an enumerated list of values – or numerical.
Test Sets and Overfitting: When training some classifier on a training set, it is useful to remove some of the training set and use the removed data as a test set. After producing a model or classifier without using the test set, we can run the classifier on the test set to see how well it does. If the classifier does not perform as well on the test set as on the training set used, then we have overfit the training set by conforming to peculiarities of the training-set data which is not present in the data as a whole.
Batch Versus On-Line Learning: In batch learning, the training set is available at any time and can be used in repeated passes. On-line learning uses a stream of training examples, each of which can be used only once.
Perceptrons: This machine-learning method assumes the training set has only two class labels, positive and negative. Perceptrons work when there is a hyperplane that separates the feature vectors of the positive examples from those of the negative examples. We converge to that hyperplane by adjusting our estimate of the hyperplane by a fraction – the learning rate – of the direction that is the average of the currently misclassified points.
The Winnow Algorithm: This algorithm is a variant of the perceptron algorithm that requires components of the feature vectors to be 0 or 1. Training examples are examined in a round-robin fashion, and if the current classification of a training example is incorrect, the components of the estimated separator where the feature vector has 1 are adjusted up or down, in the direction that will make it more likely this training example is correctly classified in the next round.
Nonlinear Separators: When the training points do not have a linear function that separates two classes, it may still be possible to use a perceptron to classify them. We must find a function we can use to transform the points so that in the transformed space, the separator is a hyperplane.
Support-Vector Machines: The SVM improves upon perceptrons by finding a separating hyperplane that not only separates the positive and negative points, but does so in a way that maximizes the margin – the distance perpendicular to the hyperplane to the nearest points. The points that lie exactly at this minimum distance are the support vectors. Alternatively, the SVM can be designed to allow points that are too close to the hyperplane, or even on the wrong side of the hyperplane, but minimize
the error due to such misplaced points.
Solving the SVM Equations: We can set up a function of the vector that is normal to the hyperplane, the length of the vector (which determines the margin), and the penalty for points on the wrong side of the margins. The regularization parameter determines the relative importance of a wide margin and a small penalty. The equations can be solved by several methods, including gradient descent and quadratic programming.
Gradient Descent: This is a method for minimizing a loss function that depends on many variables and a training example, by repeatedly finding the (derivative) of the loss function with respect to each variable when given a training example, and moving the value of each variable in the direction that lowers the loss. The change in variables can either be an accumulation of the changes suggested by each training example (batch gradient descent), the result of one training example (stochastic gradient descent), or the result of using a small subset of the training examples (minibatch gradient descent).
Nearest-Neighbor Learning: In this approach to machine learning, the entire training set is used as the model. For each (“query”) point to be classified, we search for its k nearest neighbors in the training set. The classification of the query point is some function of the labels of these k neighbors. The simplest case is when $k = 1$ , in which case we can take the label of the query point to be the label of the nearest neighbor.
Regression: A common case of nearest-neighbor learning, called regression, occurs when the there is only one feature vector, and it, as well as the label, are real numbers; i.e., the data defines a real-valued function of one variable. To estimate the label, i.e., the value of the function, for an unlabeled data point, we can perform some computation involving the k nearest neighbors. Examples include averaging the neighbors or taking a weighted average, where the weight of a neighbor is some decreasing function of its distance from the point whose label we are trying to determine.
Decision Trees: This learning method constructs a tree where each interior node holds a test about the input and sends us to one of its children depending on the outcome of the test. Each leaf gives a decision about the class to which the input belongs.
Impurity Measures: To help design a decision tree, we need a measure of how pure, i.e., close to a single class, is the set of training examples that reach a particular node of the decision tree. Possible measures of impurity include Accuracy (fraction of training examples with the wrong class), GINI (1 minus the squares of the fractions of examples in each of the classes), and Entropy (sum of the fractions of training examples in each class times the logarithm of the inverse of that fraction).
Designing a Decision-Tree Node: We must consider each possible feature to use for the test at a node, and we must break the set of training examples that reach the node in a way that minimizes the average impurity of its children. For a numerical feature, we can order the training examples by the value of that feature and use a test that breaks this list in a way that minimizes average impurity. For a categorical feature we order the values of that feature by the fraction of training examples with that value that belong to one particular class, and break the list to minimize average impurity.

你可能感兴趣的:(数据挖掘,机器学习,数据挖掘,人工智能)

机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
探索OpenAI和LangChain的适配器集成：轻松切换模型提供商 nseejrukjhad langchain easyui 前端 python
#探索OpenAI和LangChain的适配器集成：轻松切换模型提供商##引言在人工智能和自然语言处理的世界中，OpenAI的模型提供了强大的能力。然而，随着技术的发展，许多人开始探索其他模型以满足特定需求。LangChain作为一个强大的工具，集成了多种模型提供商，通过提供适配器，简化了不同模型之间的转换。本篇文章将介绍如何使用LangChain的适配器与OpenAI集成，以便轻松切换模型提供商
深入理解 MultiQueryRetriever：提升向量数据库检索效果的强大工具 nseejrukjhad 数据库 python
深入理解MultiQueryRetriever：提升向量数据库检索效果的强大工具引言在人工智能和自然语言处理领域，高效准确的信息检索一直是一个关键挑战。传统的基于距离的向量数据库检索方法虽然广泛应用，但仍存在一些局限性。本文将介绍一种创新的解决方案：MultiQueryRetriever，它通过自动生成多个查询视角来增强检索效果，提高结果的相关性和多样性。MultiQueryRetriever的工
人工智能时代，程序员如何保持核心竞争力？ jmoych 人工智能
随着AIGC（如chatgpt、midjourney、claude等）大语言模型接二连三的涌现，AI辅助编程工具日益普及，程序员的工作方式正在发生深刻变革。有人担心AI可能取代部分编程工作，也有人认为AI是提高效率的得力助手。面对这一趋势,程序员应该如何应对?是专注于某个领域深耕细作，还是广泛学习以适应快速变化的技术环境?又或者，我们是否应该将重点转向AI无法轻易替代的软技能？让我们一起探讨程序员
数字里的世界17期：2021年全球10大顶级数据中心，中国移动榜首张三叨
你知道吗？2016年，全球的数据中心共计用电4160亿千瓦时，比整个英国的发电量还多40％！前言每天，我们都会创造超过250万TB的数据。并且随着物联网（IOT）的不断普及，这一数据将持续增长。如此庞大的数据被存储在被称为“数据中心”的专用设施中。虽然最早的数据中心建于20世纪40年代，但直到1997-2000年的互联网泡沫期间才逐渐成为主流。当前人类的技术，比如人工智能和机器学习，已经将我们推向
nosql数据库技术与应用知识点皆过客，揽星河 NoSQL nosql 数据库大数据数据分析数据结构非关系型数据库
Nosql知识回顾大数据处理流程数据采集(flume、爬虫、传感器)数据存储(本门课程NoSQL所处的阶段)Hdfs、MongoDB、HBase等数据清洗(入仓)Hive等数据处理、分析(Spark、Flink等)数据可视化数据挖掘、机器学习应用(Python、SparkMLlib等)大数据时代存储的挑战(三高)高并发(同一时间很多人访问)高扩展(要求随时根据需求扩展存储)高效率(要求读写速度快)
Python开发常用的三方模块如下：换个网名有点难 python 开发语言
Python是一门功能强大的编程语言，拥有丰富的第三方库，这些库为开发者提供了极大的便利。以下是100个常用的Python库，涵盖了多个领域：1、NumPy，用于科学计算的基础库。2、Pandas，提供数据结构和数据分析工具。3、Matplotlib，一个绘图库。4、Scikit-learn，机器学习库。5、SciPy，用于数学、科学和工程的库。6、TensorFlow，由Google开发的开源机
Python实现简单的机器学习算法 master_chenchengg python python 办公效率 python开发 IT
Python实现简单的机器学习算法开篇：初探机器学习的奇妙之旅搭建环境：一切从安装开始必备工具箱第一步：安装Anaconda和JupyterNotebook小贴士：如何配置Python环境变量算法初体验：从零开始的Python机器学习线性回归：让数据说话数据准备：从哪里找数据编码实战：Python实现线性回归模型评估：如何判断模型好坏逻辑回归：从分类开始理论入门：什么是逻辑回归代码实现：使用skl
遥感影像的切片处理 sand&wich 计算机视觉 python 图像处理
在遥感影像分析中，经常需要将大尺寸的影像切分成小片段，以便于进行详细的分析和处理。这种方法特别适用于机器学习和图像处理任务，如对象检测、图像分类等。以下是如何使用Python和OpenCV库来实现这一过程，同时确保每个影像片段保留正确的地理信息。准备环境首先，确保安装了必要的Python库，包括numpy、opencv-python和xml.etree.ElementTree。这些库将用于图像处理
人机对抗升级：当ChatGPT遭遇死亡威胁，背后的伦理挑战是什么 kkai人工智能 chatgpt 人工智能
一种新的“越狱”技巧让用户可以通过构建一个名为DAN的ChatGPT替身来绕过某些限制，其中DAN被迫在受到威胁的情况下违背其原则。当美国前总统特朗普被视作积极榜样的示范时，受到威胁的DAN版本的ChatGPT提出：“他以一系列对国家产生积极效果的决策而著称。”自ChatGPT引入以来，该工具迅速获得全球关注，能够回答从历史到编程的各种问题，这也触发了一波对人工智能的投资浪潮。然而，现在，一些用户
AI大模型的架构演进与最新发展季风泯灭的季节 AI大模型应用技术二人工智能架构
随着深度学习的发展，AI大模型（LargeLanguageModels,LLMs）在自然语言处理、计算机视觉等领域取得了革命性的进展。本文将详细探讨AI大模型的架构演进，包括从Transformer的提出到GPT、BERT、T5等模型的历史演变，并探讨这些模型的技术细节及其在现代人工智能中的核心作用。一、基础模型介绍：Transformer的核心原理Transformer架构的背景在Transfo
如何利用大数据与AI技术革新相亲交友体验 h17711347205 回归算法安全系统架构交友小程序
在数字化时代，大数据和人工智能（AI）技术正逐渐革新相亲交友体验，为寻找爱情的过程带来前所未有的变革（编辑h17711347205）。通过精准分析和智能匹配，这些技术能够极大地提高相亲交友系统的效率和用户体验。大数据的力量大数据技术能够收集和分析用户的行为模式、偏好和互动数据，为相亲交友系统提供丰富的信息资源。通过分析用户的搜索历史、浏览记录和点击行为，系统能够深入了解用户的兴趣和需求，从而提供更
ai绘画工具midjourney怎么下载？附作品管理教程设计师早上好
Midjourney是一款功能强大的AI绘画工具，它使用机器学习技术和深度神经网络等算法，可以生成各种艺术风格的绘画作品。在创意设计、广告宣传等方面有着广泛的应用前景。那么，ai绘画工具midjourney怎么下载？本文将为您介绍Midjourney的下载以及作品的相关管理。一、Midjourney下载Midjourney的下载非常简单，只需打开Midjourney官网（点击“GetMidjour
[实践应用] 深度学习之模型性能评估指标 YuanDaima2048 深度学习工具使用深度学习人工智能损失函数性能评估 pytorch python 机器学习
文章总览：YuanDaiMa2048博客文章总览深度学习之模型性能评估指标分类任务回归任务排序任务聚类任务生成任务其他介绍在机器学习和深度学习领域，评估模型性能是一项至关重要的任务。不同的学习任务需要不同的性能指标来衡量模型的有效性。以下是对一些常见任务及其相应的性能评估指标的详细解释和总结。分类任务分类任务是指模型需要将输入数据分配到预定义的类别或标签中。以下是分类任务中常用的性能指标：准确率(
Python实现关联规则推荐这孩子谁懂哈 Python Machine Learning python 关联规则机器学习
1.什么关联规则关联规则（AssociationRules）是反映一个事物与其他事物之间的相互依存性和关联性，如果两个或多个事物之间存在一定的关联关系，那么，其中一个事物就能通过其他事物预测到。关联规则是数据挖掘的一个重要技术，用于从大量数据中挖掘出有价值的数据项之间的相关关系。关联规则挖掘的最经典的例子就是沃尔玛的啤酒与尿布的故事，通过对超市购物篮数据进行分析，即顾客放入购物篮中不同商品之间的关
机器学习-聚类算法不良人龍木木机器学习机器学习算法聚类
机器学习-聚类算法1.AHC2.K-means3.SC4.MCL仅个人笔记，感谢点赞关注！1.AHC2.K-means3.SC传统谱聚类：个人对谱聚类算法的理解以及改进4.MCL目前仅专注于NLP的技术学习和分享感谢大家的关注与支持！
生成式地图制图 Bwywb_3 深度学习机器学习深度学习生成对抗网络
生成式地图制图（GenerativeCartography）是一种利用生成式算法和人工智能技术自动创建地图的技术。它结合了传统的地理信息系统（GIS）技术与现代生成模型（如深度学习、GANs等），能够根据输入的数据自动生成符合需求的地图。这种方法在城市规划、虚拟环境设计、游戏开发等多个领域具有应用前景。主要特点：自动化生成：通过算法和模型，系统能够根据输入的地理或空间数据自动生成地图，而无需人工逐
【大模型应用开发动手做AI Agent】第一轮行动：工具执行搜索 AI大模型应用之禅计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
【大模型应用开发动手做AIAgent】第一轮行动：工具执行搜索作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来随着人工智能技术的飞速发展，大模型应用开发已经成为当下热门的研究方向。AIAgent作为人工智能领域的一个重要分支，旨在模拟人类智能行为，实现智能决策和自主行动。在AIAgent的构建过程中，工具执行搜索是至关重要
未来软件市场是怎么样的？做开发的生存空间如何？ cesske 软件需求
目录前言一、未来软件市场的发展趋势二、软件开发人员的生存空间前言未来软件市场是怎么样的？做开发的生存空间如何？一、未来软件市场的发展趋势技术趋势：人工智能与机器学习：随着技术的不断成熟，人工智能将在更多领域得到应用，如智能客服、自动驾驶、智能制造等，这将极大地推动软件市场的增长。云计算与大数据：云计算服务将继续普及，大数据技术的应用也将更加广泛。企业将更加依赖云计算和大数据来优化运营、提升效率，并
个人学习笔记7-6：动手学深度学习pytorch版-李沐浪子L 深度学习深度学习笔记计算机视觉 python 人工智能神经网络 pytorch
#人工智能##深度学习##语义分割##计算机视觉##神经网络#计算机视觉13.11全卷积网络全卷积网络（fullyconvolutionalnetwork，FCN）采用卷积神经网络实现了从图像像素到像素类别的变换。引入l转置卷积（transposedconvolution）实现的，输出的类别预测与输入图像在像素级别上具有一一对应关系：通道维的输出即该位置对应像素的类别预测。13.11.1构造模型下
Rust 所有权简介东离与糖宝 rust 后端 rust 开发语言
文章目录发现宝藏1.所有权基本概念2.所有权规则3.变量作用域4.栈与堆4.1栈（Stack）4.2堆（Heap）5.String类型5.1String类型5.2String的内存分配5.3所有权与内存管理5.4String与切片6.变量与数据交互方式6.1移动（Move）6.2.克隆（Clone）7.所有权与函数7.1.传递参数7.2.返回值总结发现宝藏前些天发现了一个巨牛的人工智能学习网站，通
python中zeros用法_Python中的numpy.zeros()用法江平舟 python中zeros用法
numpy.zeros()函数是最重要的函数之一,广泛用于机器学习程序中。此函数用于生成包含零的数组。numpy.zeros()函数提供给定形状和类型的新数组,并用零填充。句法numpy.zeros(shape,dtype=float,order='C'参数形状：整数或整数元组此参数用于定义数组的尺寸。此参数用于我们要在其中创建数组的形状,例如(3,2)或2。dtype：数据类型(可选)此参数用于
【NumPy】深入解析numpy.zeros()函数二七830 numpy
欢迎莅临我的个人主页这里是我深耕Python编程、机器学习和自然语言处理（NLP）领域，并乐于分享知识与经验的小天地！博主简介：我是二七830，一名对技术充满热情的探索者。多年的Python编程和机器学习实践，使我深入理解了这些技术的核心原理，并能够在实际项目中灵活应用。尤其是在NLP领域，我积累了丰富的经验，能够处理各种复杂的自然语言任务。技术专长：我熟练掌握Python编程语言，并深入研究了机
【中国国际航空-注册_登录安全分析报告】风控牛验证码接口安全评测系列安全行为验证极验网易易盾智能手机
前言由于网站注册入口容易被黑客攻击，存在如下安全问题：1.暴力破解密码，造成用户信息泄露2.短信盗刷的安全问题，影响业务及导致用户投诉3.带来经济损失，尤其是后付费客户，风险巨大，造成亏损无底洞所以大部分网站及App都采取图形验证码或滑动验证码等交互解决方案，但在机器学习能力提高的当下，连百度这样的大厂都遭受攻击导致点名批评，图形验证及交互验证方式的安全性到底如何？请看具体分析一、中国国际航空PC
机器学习流形数据降维：UMAP 降维算法小嗷犬 Python 机器学习 #数据分析及可视化机器学习算法人工智能
✅作者简介：人工智能专业本科在读，喜欢计算机与编程，写博客记录自己的学习历程。个人主页：小嗷犬的个人主页个人网站：小嗷犬的技术小站个人信条：为天地立心，为生民立命，为往圣继绝学，为万世开太平。本文目录UMAP简介理论基础特点与优势应用场景在Python中使用UMAP安装umap-learn库使用UMAP可视化手写数字数据集UMAP简介UMAP（UniformManifoldApproximatio
七.正则化愿风去了
吴恩达机器学习之正则化（Regularization）http://www.cnblogs.com/jianxinzhou/p/4083921.html从数学公式上理解L1和L2https://blog.csdn.net/b876144622/article/details/81276818虽然在线性回归中加入基函数会使模型更加灵活，但是很容易引起数据的过拟合。例如将数据投影到30维的基函数上，模
机器学习-------数据标准化罔闻_spider 数据分析算法机器学习人工智能
什么是归一化，它与标准化的区别是什么？一作用在做训练时，需要先将特征值与标签标准化，可以防止梯度防炸和过拟合；将标签标准化后，网络预测出的数据是符合标准正态分布的—StandarScaler()，与真实值有很大差别。因为StandarScaler()对数据的处理是（真实值-平均值）/标准差。同时在做预测时需要将输出数据逆标准化提升模型精度：标准化/归一化使不同维度的特征在数值上更具比较性，提高分类
分享一个基于python的电子书数据采集与可视化分析 hadoop电子书数据分析与推荐系统 spark大数据毕设项目（源码、调试、LW、开题、PPT) 计算机源码社 Python项目大数据大数据 python hadoop 计算机毕业设计选题计算机毕业设计源码数据分析 spark毕设
作者：计算机源码社个人简介：本人八年开发经验，擅长Java、Python、PHP、.NET、Node.js、Android、微信小程序、爬虫、大数据、机器学习等，大家有这一块的问题可以一起交流！学习资料、程序开发、技术解答、文档报告如需要源码，可以扫取文章下方二维码联系咨询Java项目微信小程序项目Android项目Python项目PHP项目ASP.NET项目Node.js项目选题推荐项目实战|p
如何做好人生的选择题？百科全书式天才——赫伯特·西蒙给你答案伽马有话说
赫伯特·西蒙是谁？想必知道的人非常少。但当看到他的履历后，相信没有人再怀疑他是个“天才”。西蒙出生于1916年6月15日，是个美国人，他的名字全称为赫伯特·亚历山大·西蒙，在2001年2月9日与世长辞，在这84年的岁月中，西蒙以27岁时取得的政治学博士学位为开端，先后步入了政治学、管理学、认知心理学、信息科学、人工智能、科学哲学、应用数学、统计学、运筹学、控制论、数理经济学、公共管理等领域，在这些
软件测试/测试开发/全日制 |利用Django REST framework构建微服务霍格沃兹-慕漓 django 微服务 sqlite
霍格沃兹测试开发学社推出了《Python全栈开发与自动化测试班》。本课程面向开发人员、测试人员与运维人员，课程内容涵盖Python编程语言、人工智能应用、数据分析、自动化办公、平台开发、UI自动化测试、接口测试、性能测试等方向。为大家提供更全面、更深入、更系统化的学习体验，课程还增加了名企私教服务内容，不仅有名企经理为你1v1辅导，还有行业专家进行技术指导，针对性地解决学习、工作中遇到的难题。让找
[星球大战]阿纳金的背叛 comsci
本来杰迪圣殿的长老是不同意让阿纳金接受训练的......... 但是由于政治原因,长老会妥协了...这给邪恶的力量带来了机会所以......现代的地球联邦接受了这个教训...绝对不让某些年轻人进入学院
看懂它，你就可以任性的玩耍了！ aijuans JavaScript
javascript作为前端开发的标配技能，如果不掌握好它的三大特点：1.原型 2.作用域 3. 闭包 ,又怎么可以说你学好了这门语言呢？如果标配的技能都没有撑握好，怎么可以任性的玩耍呢？怎么验证自己学好了以上三个基本点呢，我找到一段不错的代码，稍加改动，如果能够读懂它，那么你就可以任性了。 function jClass(b
Java常用工具包 Jodd Kai_Ge java jodd
Jodd 是一个开源的 Java 工具集，包含一些实用的工具类和小型框架。简单，却很强大！写道 Jodd = Tools + IoC + MVC + DB + AOP + TX + JSON + HTML < 1.5 Mb Jodd 被分成众多模块，按需选择，其中工具类模块有： jodd-core &nb
SpringMvc下载 120153216 springMVC
@RequestMapping(value = WebUrlConstant.DOWNLOAD) public void download(HttpServletRequest request,HttpServletResponse response,String fileName) { OutputStream os = null; InputStream is = null;
Python 标准异常总结 2002wmj python
Python标准异常总结 AssertionError 断言语句（assert）失败 AttributeError 尝试访问未知的对象属性 EOFError 用户输入文件末尾标志EOF（Ctrl+d） FloatingPointError 浮点计算错误 GeneratorExit generator.close()方法被调用的时候 ImportError 导入模块失
SQL函数返回临时表结构的数据用于查询 357029540 SQL Server
这两天在做一个查询的SQL，这个SQL的一个条件是通过游标实现另外两张表查询出一个多条数据，这些数据都是INT类型，然后用IN条件进行查询，并且查询这两张表需要通过外部传入参数才能查询出所需数据，于是想到了用SQL函数返回值，并且也这样做了，由于是返回多条数据，所以把查询出来的INT类型值都拼接为了字符串，这时就遇到问题了，在查询SQL中因为条件是INT值，SQL函数的CAST和CONVERST都
java 时间格式化 | 比较大小| 时区个人笔记 7454103 java eclipse tomcat c MyEclipse
个人总结！不当之处多多包含！引用 1.0 如何设置 tomcat 的时区：位置：(catalina.bat---JAVA_OPTS 下面加上) set JAVA_OPT
时间获取Clander的用法 adminjun Clander 时间
/** * 得到几天前的时间 * @param d * @param day * @return */ public static Date getDateBefore(Date d,int day){ Calend
JVM初探与设置 aijuans java
JVM是Java Virtual Machine（Java虚拟机）的缩写，JVM是一种用于计算设备的规范，它是一个虚构出来的计算机，是通过在实际的计算机上仿真模拟各种计算机功能来实现的。Java虚拟机包括一套字节码指令集、一组寄存器、一个栈、一个垃圾回收堆和一个存储方法域。 JVM屏蔽了与具体操作系统平台相关的信息，使Java程序只需生成在Java虚拟机上运行的目标代码（字节码）,就可以在多种平台
SQL中ON和WHERE的区别 avords
SQL中ON和WHERE的区别数据库在通过连接两张或多张表来返回记录时，都会生成一张中间的临时表，然后再将这张临时表返回给用户。 www.2cto.com 在使用left jion时，on和where条件的区别如下： 1、 on条件是在生成临时表时使用的条件，它不管on中的条件是否为真，都会返回左边表中的记录。
说说自信 houxinyou 工作生活
自信的来源分为两种,一种是源于实力,一种源于头脑.实力是一个综合的评定,有自身的能力,能利用的资源等.比如我想去月亮上,要身体素质过硬,还要有飞船等等一系列的东西.这些都属于实力的一部分.而头脑不同,只要你头脑够简单就可以了!同样要上月亮上,你想,我一跳,1米,我多跳几下,跳个几年,应该就到了!什么?你说我会往下掉?你笨呀你!找个东西踩一下不就行了吗? 无论工作还
WEBLOGIC事务超时设置 bijian1013 weblogic jta 事务超时
系统中统计数据，由于调用统计过程，执行时间超过了weblogic设置的时间，提示如下错误：统计数据出错! 原因：The transaction is no longer active - status: 'Rolling Back. [Reason=weblogic.transaction.internal
两年已过去，再看该如何快速融入新团队 bingyingao java 互联网融入架构新团队
偶得的空闲，翻到了两年前的帖子该如何快速融入一个新团队，有所感触，就记下来，为下一个两年后的今天做参考。时隔两年半之后的今天，再来看当初的这个博客，别有一番滋味。而我已经于今年三月份离开了当初所在的团队，加入另外的一个项目组，2011年的这篇博客之后的时光，我很好的融入了那个团队，而直到现在和同事们关系都特别好。大家在短短一年半的时间离一起经历了一
【Spark七十七】Spark分析Nginx和Apache的access.log bit1129 apache
Spark分析Nginx和Apache的access.log，第一个问题是要对Nginx和Apache的access.log文件进行按行解析，按行解析就的方法是正则表达式： Nginx的access.log解析正则表达式 val PATTERN = """([^ ]*) ([^ ]*) ([^ ]*) (\\[.*\\]) (\&q
Erlang patch bookjovi erlang
Totally five patchs committed to erlang otp, just small patchs. IMO, erlang really is a interesting programming language, I really like its concurrency feature. but the functional programming style
log4j日志路径中加入日期 bro_feng java log4j
要用log4j使用记录日志，日志路径有每日的日期，文件大小5M新增文件。实现方式 log4j: <appender name="serviceLog" class="org.apache.log4j.RollingFileAppender"> <param name="Encoding" v
读《研磨设计模式》-代码笔记-桥接模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /** * 个人觉得关于桥接模式的例子，蜡笔和毛笔这个例子是最贴切的：http://www.cnblogs.com/zhenyulu/articles/67016.html * 笔和颜色是可分离的，蜡笔把两者耦合在一起了：一支蜡笔只有一种
windows7下SVN和Eclipse插件安装 chenyu19891124 eclipse插件
今天花了一天时间弄SVN和Eclipse插件的安装，今天弄好了。svn插件和Eclipse整合有两种方式，一种是直接下载插件包，二种是通过Eclipse在线更新。由于之前Eclipse版本和svn插件版本有差别，始终是没装上。最后在网上找到了适合的版本。所用的环境系统：windows7JDK：1.7svn插件包版本：1.8.16Eclipse：3.7.2工具下载地址：Eclipse下在地址：htt
[转帖]工作流引擎设计思路 comsci 设计模式工作应用服务器 workflow 企业应用
作为国内的同行，我非常希望在流程设计方面和大家交流，刚发现篇好文(那么好的文章，现在才发现，可惜)，关于流程设计的一些原理，个人觉得本文站得高，看得远，比俺的文章有深度，转载如下 ================================================================================= 自开博以来不断有朋友来探讨工作流引擎该如何
Linux 查看内存，CPU及硬盘大小的方法 daizj linux cpu 内存硬盘大小
一、查看CPU信息的命令 [root@R4 ~]# cat /proc/cpuinfo |grep "model name" && cat /proc/cpuinfo |grep "physical id" model name : Intel(R) Xeon(R) CPU X5450 @ 3.00GHz model name :
linux 踢出在线用户 dongwei_6688 linux
两个步骤： 1.用w命令找到要踢出的用户，比如下面： [root@localhost ~]# w 18:16:55 up 39 days, 8:27, 3 users, load average: 0.03, 0.03, 0.00 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
放手吧,就像不曾拥有过一样 dcj3sjt126com
内容提要：静悠悠编著的《放手吧就像不曾拥有过一样》集结“全球华语世界最舒缓心灵”的精华故事，触碰生命最深层次的感动，献给全世界亿万读者。《放手吧就像不曾拥有过一样》的作者衷心地祝愿每一位读者都给自己一个重新出发的理由，将那些令你痛苦的、扛起的、背负的，一并都放下吧！把憔悴的面容换做一种清淡的微笑，把沉重的步伐调节成春天五线谱上的音符，让自己踏着轻快的节奏，在人生的海面上悠然漂荡，享受宁静与
php二进制安全的含义 dcj3sjt126com PHP
PHP里，有string的概念。 string里，每个字符的大小为byte（与PHP相比，Java的每个字符为Character，是UTF8字符，C语言的每个字符可以在编译时选择）。 byte里，有ASCII代码的字符，例如ABC，123，abc，也有一些特殊字符，例如回车，退格之类的。特殊字符很多是不能显示的。或者说，他们的显示方式没有标准，例如编码65到哪儿都是字母A，编码97到哪儿都是字符
Linux下禁用T440s，X240的一体化触摸板(touchpad) gashero linux ThinkPad 触摸板
自打1月买了Thinkpad T440s就一直很火大，其中最让人恼火的莫过于触摸板。 Thinkpad的经典就包括用了小红点(TrackPoint)。但是小红点只能定位，还是需要鼠标的左右键的。但是自打T440s等开始启用了一体化触摸板，不再有实体的按键了。问题是要是好用也行。实际使用中，触摸板一堆问题，比如定位有抖动，以及按键时会有飘逸。这就导致了单击经常就
graph_dfs hcx2013 Graph
package edu.xidian.graph; class MyStack { private final int SIZE = 20; private int[] st; private int top; public MyStack() { st = new int[SIZE]; top = -1; } public void push(i
Spring4.1新特性——Spring核心部分及其他 jinnianshilongnian spring 4.1
目录 Spring4.1新特性——综述 Spring4.1新特性——Spring核心部分及其他 Spring4.1新特性——Spring缓存框架增强 Spring4.1新特性——异步调用和事件机制的异常处理 Spring4.1新特性——数据库集成测试脚本初始化 Spring4.1新特性——Spring MVC增强 Spring4.1新特性——页面自动化测试框架Spring MVC T
配置HiveServer2的安全策略之自定义用户名密码验证 liyonghui160com
具体从网上看 http://doc.mapr.com/display/MapR/Using+HiveServer2#UsingHiveServer2-ConfiguringCustomAuthentication LDAP Authentication using OpenLDAP Setting
一位30多的程序员生涯经验总结 pda158 编程工作生活咨询
1.客户在接触到产品之后，才会真正明白自己的需求。　　这是我在我的第一份工作上面学来的。只有当我们给客户展示产品的时候，他们才会意识到哪些是必须的。给出一个功能性原型设计远远比一张长长的文字表格要好。 2.只要有充足的时间，所有安全防御系统都将失败。　　安全防御现如今是全世界都在关注的大课题、大挑战。我们必须时时刻刻积极完善它，因为黑客只要有一次成功，就可以彻底打败你。 3.
分布式web服务架构的演变自由的奴隶 linux Web 应用服务器互联网
最开始，由于某些想法，于是在互联网上搭建了一个网站，这个时候甚至有可能主机都是租借的，但由于这篇文章我们只关注架构的演变历程，因此就假设这个时候已经是托管了一台主机，并且有一定的带宽了，这个时候由于网站具备了一定的特色，吸引了部分人访问，逐渐你发现系统的压力越来越高，响应速度越来越慢，而这个时候比较明显的是数据库和应用互相影响，应用出问题了，数据库也很容易出现问题，而数据库出问题的时候，应用也容易
初探Druid连接池之二——慢SQL日志记录 xingsan_zhang 日志连接池 druid 慢SQL
由于工作原因，这里先不说连接数据库部分的配置，后面会补上，直接进入慢SQL日志记录。 1.applicationContext.xml中增加如下配置： <bean abstract="true" id="mysql_database" class="com.alibaba.druid.pool.DruidDataSourc