LIQING LIN

02_End-to-End Machine Learning Project_02_stats.sem_Cross_Validation_Grid_Randomized_Ensemble_ Pipel

02_End-to-End Machine Learning Project

https://blog.csdn.net/Linli522362242/article/details/103387527

Here are the main steps you will go through:
1. Look at the big picture.

Frame the Problem
1. What is the business objective, How does the company expect to use and benefit from the model
  1. Your boss answers that your model’s output (a prediction of a district’s median housing price
2. What the current solution looks like (if any). It will often give you a reference performance, as well as insights on how to solve the problem.
  1. Your boss answers that the district housing prices are currently estimated manually by experts
3. Frame the problem: is it supervised, unsupervised, or Reinforcement Learning? Is it a classification task, a regression task, or something else? Should you use batch learning or online learning techniques?
  1. it is clearly a typical supervised learning task since you are given labeled training examples (each instance comes with the expected output, i.e., the district’s median housing price). Moreover, it is also a typical regression task, since you are asked to predict a value. More specifically, this is a multivariate regression problem since the system will use multiple features to make a prediction (it will use the district’s population, the median income, etc.). In the first chapter, you predicted life satisfaction based on just one feature, the GDP per capita, so it was a univariate regression problem. Finally, there is no continuous flow of data coming in the system, there is no particular need to adjust to changing data rapidly, and the data is small enough to fit in memory, so plain batch learning should do just fine.
Select a Performance Measure
1. A typical performance measure for regression problems is the Root Mean Square Error (RMSE corresponds to the Euclidian norm)
  1. The higher the norm index(k), the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE(Mean Absolute Error --Manhattan norm). But when outliers are exponentially rare (like in a bell-shaped curve), the RMSE performs very well and is generally preferred.
  2. the “68-95-99.7” rule applies: about 68% of the values fall within 1σ of the mean,
    95% within 2σ, and
    99.7% within 3σ.
Check the Assumptions
1. your system outputs are going to be fed into a downstream Machine Learning system, and we assume that these prices are going to be used as such(regression task). But what if the downstream system actually converts the prices into categories (e.g., “cheap,” “medium,” or “expensive”) and then uses those categories instead of the prices themselves? In this case, getting the price perfectly right is not important at all; your system just needs to get the category right. If that’s so, then the problem should have been framed as a classification task, not a regression task. You don’t want to find this out after working on a regression system for months.

2. Get the data.

Create the Workspace
1. create a workspace directory for your Machine Learning code
2. and datasets.
Download the Data
Take a Quick Look at the Data Structure
1. housing.head()
2. housing.info (total number of rows, and each attribute’s type and number of non-null values~~the amount of missing value) ; housing["ocean_proximity"].value_counts()
3. housing.describe() method shows a summary of the numerical attributes
4. housing.hist() : Notice a few things
Create a Test Set （create a test set（typically 20% of the dataset）, put it aside, and never look at it）
1. purely random sampling
  1. train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
2. stratified sampling
  1. Suppose you chatted with experts who told you that the median income is a very important attribute to predict median housing prices. You may want to ensure that the test set is representative of the various categories of incomes in the whole dataset. Since the median income is a continuous numerical attribute, you first need to create an income category attribute. Let’s look at the median income histogram more closely
    1. housing['median_income'].hist() #a continuous numerical attribute --5 Categries
    2. ```
    housing['income_cat'] = pd.cut(housing['median_income'], 
                                   bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                                   labels=[1,2,3,4,5]
                                  )
    housing['income_cat'].value_counts()
```
```
    housing["income_cat"].hist()
```
3. The following code creates an income category attribute by dividing the median income by 1.5 (to limit the number of income categories), and rounding up using ceil (to have discrete categories), and then merging all the categories greater than 5 into category 5:
  housing["income_cat"] = np.ceil(housing["median_income"] / 1.5)
  housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)
2. Now you are ready to do stratified sampling based on the income category. For this you can use Scikit-Learn’s StratifiedShuffleSplit class:
```
  from sklearn.model_selection import StratifiedShuffleSplit
  
                                #n_splits: n groups of train/test pair
  split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42) #one group of train/test pair
  for train_index, test_index in split.split(housing, housing['income_cat']):
      strat_train_set = housing.loc[train_index]
      strat_test_set = housing.loc[test_index]
```
  looking at the income category proportions in the full housing dataset:
  housing["income_cat"].value_counts() / len(housing)

you can measure the income category proportions in the test set
strat_test_set["income_cat"].value_counts()/len(strat_test_set)

Sampling bias comparison of stratified versus purely random sampling

3. Now you should remove the income_cat attribute so the data is back to its original state:
for set in (strat_train_set, strat_test_set):
set.drop(["income_cat"], axis=1, inplace=True)

3. Discover and visualize the data to gain insights.（plot）

First, make sure you have put the test set aside and you are only exploring the training set. Also, if the training set is very large, you may want to sample an exploration set, to make manipulations easy and fast. In our case, the set is quite small so you can just work directly on the full set. Let’s create a copy so you can play with it without harming the training set:

housing = strat_train_set.copy() #df.copy(deep=True) #https://blog.csdn.net/weixin_37275456/article/details/83033528
#https://blog.csdn.net/u010712012/article/details/79754132

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1,
             title="A better visualization highlighting high-density areas")

Looking for Correlations¶：Since the dataset is not too large, you can easily compute the standard correlation coefficient (also called Pearson’s r) between every pair of attributes using the corr() method:

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

The most promising最有希望的 attribute to predict the median house value is the median income

Since there are now 11 numerical attributes, you would get 11^2 = 121 plots, which would not fit on a page, so let’s just focus on a few promising attributes that seem most correlated with the median housing value (Figure):

from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12,8))
plt.show()

4. Experimenting with Attribute Combinations:The new bedrooms_per_room attribute is much more correlated with the median house value than the total number of rooms or bedrooms

4. Prepare the data for Machine Learning algorithms.

Instead of just doing this manually, you should write functions to do that, for several good reasons:
• This will allow you to reproduce these transformations easily on any dataset (e.g., the next time you get a fresh dataset).
• You will gradually build a library of transformation functions that you can reuse in future projects.
• You can use these functions in your live system to transform the new data before feeding it to your algorithms.
• This will make it possible for you to easily try various transformations and see which combination of transformations works best.

housing = strat_train_set.drop('median_house_value', axis=1)  #return a dataframe without the dropped item
housing_label = strat_train_set["median_house_value"].copy()
housing.keys()

Data Cleaning
1. Most Machine Learning algorithms cannot work with missing features, so let’s create a few functions to take care of them. You noticed earlier that the total_bedrooms attribute has some missing values, so let’s fix this. You have three options:
  • Get rid of the corresponding districts.
  • Get rid of the whole attribute.
  • Set the values to some value (zero, the mean, the median, etc.).
  You can accomplish these easily using DataFrame’s dropna(), drop(), and fillna()
  methods:
  housing.dropna(subset=["total_bedrooms"]) # option 1
  housing.drop("total_bedrooms", axis=1) # option 2
  median = housing["total_bedrooms"].median()# option 3
  housing["total_bedrooms"].fillna(median) # option 3

Handling Text and Categorical Attributes

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse

class CategoricalEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, encoding='onehot', categories='auto', dtype=np.float64, handle_unknown='error'):
        self.encoding = encoding
        self.categories = categories
        self.dtype = dtype
        self.handle_unknown = handle_unknown
        
    def fit(self, X, y=None):
        """Fit the CategoricalEncoder to X.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_feature]
        The data to determine the categories of each feature.
        Returns
        -------
        self
        """
        if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
            template = ("encoding should be either 'onehot', 'onehot-dense' or 'ordinal', got %s")
            raise ValueError(template % self.handle_unknown)
        
        if self.handle_unknown not in ['error', 'ignore']:
            template = ("handle_unknown should be either 'error' or 'ignore', got %s")
            raise ValueError(template % self.handle_unknown)
            
        if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
            raise ValueError("handle_unknown='ignore' is not supported for encoding='ordinal'")
        
        #check_array: By default, the input(here is X) is converted to an at least 2D numpy array.
        #If the dtype of the array is object, attempt converting to float, raising on failure.
        #csc_matrix: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html    
        X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
        n_samples, n_features = X.shape   #(16512, 1)
        
        #the prefix underscore: private variable, 
        #the trailing underscore is used by convention to avoid naming conflicts
        self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]    #[LabelEncoder, ...]
        for i in range(n_features):
            le = self._label_encoders_[i]
            Xi = X[:, i]
            if self.categories == 'auto':
                le.fit(Xi)
            else:
                #np.in1d(ar1,ar2): Returns a boolean array the same length as ar1 that 
                #is True where an element of ar1 is in ar2 
                #and False otherwise.
                valid_mask = np.in1d(Xi, self.categories[i])
                if not np.all(valid_mask):
                    if self.handle_unknown == 'error':
                        diff = np.unique(Xi[~valid_mask])
                        msg = ("Found unknown categories {0} in column {1} during fit".format(diff, i))
                        raise ValueError(msg)
                le.classes_ = np.array(np.sort(self.categories[i]))
        #for examples,here is ['<1H OCEAN' 'INLAND' 'ISLAND' 'NEAR BAY' 'NEAR OCEAN']
        #encoder.classes_
        self.categories_ = [le.classes_ for le in self._label_encoders_]
        return self
    
    def transform(self, X):
        """Transform X using one-hot encoding.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_features]
        The data to encode.
        Returns
        -------
        X_out : sparse matrix or a 2-d array
        Transformed input.
        """
        #check_array: By default, the input(here is X) is converted to an at least 2D numpy array.
        #If the dtype of the array is object, attempt converting to float, raising on failure.
        #csc_matrix: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html
        X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
        n_samples, n_features = X.shape
        X_int = np.zeros_like(X, dtype=np.int)
        X_mask = np.ones_like(X, dtype=np.bool) #dtype:Overrides the data type of the result
                                                #conver the 1s to all True
        
        for i in range(n_features):
            #Returns a boolean array the same length as ar1 that is True where an element of ar1 is in ar2 
            #and False otherwise.
            valid_mask = np.in1d(X[:, i], self.categories_[i]) #[ True  True  True ...  True  True  True]
            
            if not np.all(valid_mask):
                if self.handle_unknown == 'error':
                    diff = np.unique(X[~valid_mask, i])
                    msg = ("Found unknown categories {0} in column {1} during transform".format(diff, i))
                    raise ValueError(msg)
                else:
                    # Set the problematic rows to an acceptable value and
                    # continue `The rows are marked `X_mask` and will be
                    # removed later.
                    X_mask[:, i] = valid_mask
                    X[:, i][~valid_mask] = self.categories_[i][0]
            X_int[:, i] = self._label_encoders_[i].transform(X[:, i])  #[0 0 4 ... 1 0 3]  here only one column #len(row_indices)

            if self.encoding == 'ordinal':
                return X_int.astype(self.dtype, copy=False)
            mask = X_mask.ravel() #[ True  True  True ...  True  True  True]
           #self.categories_: [array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'], dtype=object)]
           #cats.shape[0]: 5
            n_values = [cats.shape[0] for cats in self.categories_] #[[5]]
            n_values = np.array([0] + n_values)   #np.array( [[0] [5]] )
            indices = np.cumsum(n_values)   #[0 5]  #start with 0, size=5 columns
            

            #X_int:  2D numpy array, here only one column
            #indices[:-1] : [0]
            #X_int + indices[:-1]: #matrix plus#[ [0 0 4 ... 1 0 3]^T ]  
            column_indices = (X_int + indices[:-1]).ravel()[mask]    #extraction: [0 0 4 ... 1 0 3]
            row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),n_features)[mask]
            data = np.ones(n_samples * n_features)[mask]
            #csc_matrix: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html

                                                                 #position  #len(row_indices)==len(column_indices)
            out = sparse.csc_matrix((data, (row_indices, column_indices)),
                                     shape=(n_samples, indices[-1]),  #(16512,5)
                                     dtype=self.dtype
                                   ).tocsr()
            if self.encoding == 'onehot-dense':
                return out.toarray()
            else:
                return out

        Parameters
        ----------
        X : array-like, shape [n_samples, n_feature]
        The data to determine the categories of each feature.
        Returns
        -------
        self
        """
        if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
            template = ("encoding should be either 'onehot', 'onehot-dense' or 'ordinal', got %s")
            raise ValueError(template % self.handle_unknown)
        
        if self.handle_unknown not in ['error', 'ignore']:
            template = ("handle_unknown should be either 'error' or 'ignore', got %s")
            raise ValueError(template % self.handle_unknown)
            
        if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
            raise ValueError("handle_unknown='ignore' is not supported for encoding='ordinal'")
        X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
        n_samples, n_features = X.shape
        
        self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]
        for i in range(n_features):
            le = self._label_encoders_[i]
            Xi = X[:, i]
            if self.categories == 'auto':
                le.fit(Xi)
            else:
                valid_mask = np.in1d(Xi, self.categories[i])
                if not np.all(valid_mask):
                    if self.handle_unknown == 'error':
                        diff = np.unique(Xi[~valid_mask])
                        msg = ("Found unknown categories {0} in column {1} during fit".format(diff, i))
                        raise ValueError(msg)
                le.classes_ = np.array(np.sort(self.categories[i]))
        self.categories_ = [le.classes_ for le in self._label_encoders_]
        return self
    
    def transform(self, X):
        """Transform X using one-hot encoding.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_features]
        The data to encode.
        Returns
        -------
        X_out : sparse matrix or a 2-d array
        Transformed input.
        """
        X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
        n_samples, n_features = X.shape
        X_int = np.zeros_like(X, dtype=np.int)
        X_mask = np.ones_like(X, dtype=np.bool)
        
        for i in range(n_features):
            valid_mask = np.in1d(X[:, i], self.categories_[i])
            
            if not np.all(valid_mask):
                if self.handle_unknown == 'error':
                    diff = np.unique(X[~valid_mask, i])
                    msg = ("Found unknown categories {0} in column {1} during transform".format(diff, i))
                    raise ValueError(msg)
                else:
                    # Set the problematic rows to an acceptable value and
                    # continue `The rows are marked `X_mask` and will be
                    # removed later.
                    X_mask[:, i] = valid_mask
                    X[:, i][~valid_mask] = self.categories_[i][0]
            X_int[:, i] = self._label_encoders_[i].transform(X[:, i])

            if self.encoding == 'ordinal':
                return X_int.astype(self.dtype, copy=False)
            mask = X_mask.ravel()
            n_values = [cats.shape[0] for cats in self.categories_]
            n_values = np.array([0] + n_values)
            indices = np.cumsum(n_values)
            
            column_indices = (X_int + indices[:-1]).ravel()[mask]
            row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),n_features)[mask]
            data = np.ones(n_samples * n_features)[mask]
            
            out = sparse.csc_matrix((data, (row_indices, column_indices)),
                                    shape=(n_samples, indices[-1]),
                                    dtype=self.dtype).tocsr()
            if self.encoding == 'onehot-dense':
                return out.toarray()
            else:
                return out

#from sklearn.preprocessing import CategoricalEncoder # in future versions of Scikit-Learn
cat_encoder = CategoricalEncoder()
housing_cat_reshaped = housing_cat.values.reshape(-1,1)

Custom Transformers
Feature Scaling:
1. min-max scaling(called normalization) : We do this by subtracting
  the min value and dividing by the max minus the min.
2. standardization: first it subtracts the mean value (so standardized
  values always have a zero mean), and then it divides by the variance so that the resulting
  distribution has unit variance.

Transformation Pipelines

from sklearn.pipeline import FeatureUnion #Sckit-learn <0.20
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

housing_num = housing.drop('ocean_proximity', axis=1)#return a dataframe without the dropped column
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline=Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', CategoricalEncoder(encoding="onehot-dense"))
])

full_pipeline = FeatureUnion(n_jobs=1, #default 1
                            transformer_list=[('num_pipeline', num_pipeline),
                                              ('cat_pipeline', cat_pipeline),
                                             ]
                            )

housing_prepared = full_pipeline.fit_transform(housing)

5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor, and maintain your system

#################################TypeError: fit_transform() takes 2 positional arguments but 3 were given

from sklearn.pipeline import FeatureUnion #Sckit-learn <0.20
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

housing_num = housing.drop('ocean_proximity', axis=1)#return a dataframe without the dropped column
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

num_pipeline=Pipeline([
('selector', DataFrameSelector(num_attribs)),
('imputer', Imputer(strategy='median')),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)), #
('label_binarizer',LabelBinarizer()) #CategoricalEncoder(encoding="onehot-dense")
])

full_pipeline = FeatureUnion(n_jobs=1, #default 1
transformer_list=[('num_pipeline', num_pipeline),
('cat_pipeline', cat_pipeline),
]
)

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

~\AppData\Roaming\Python\Python36\site-packages\sklearn\pipeline.py in fit_transform(self, X, y, **fit_params)
    281         Xt, fit_params = self._fit(X, y, **fit_params)
    282         if hasattr(last_step, 'fit_transform'):
--> 283             return last_step.fit_transform(Xt, y, **fit_params)
    284         elif last_step is None:
    285             return Xt

TypeError: fit_transform() takes 2 positional arguments but 3 were given

Reasons: The pipeline is assuming LabelBinarizer's fit_transform method is defined to take three positional arguments:

but LabelBinarizer's fit_transform method is defined to take only two (version issue)

Solution: to write your LabelBinarizer's fit-transform to accept three positional arguments

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion #Sckit-learn <0.20
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

class MyLabelBinarizer(BaseEstimator, TransformerMixin): #BaseEstimator as a base Estimator
    def __init__(self):#you don't need *args and **kwargs   #def __init__(self, *args, **kwargs):
        self.encoder = LabelBinarizer()#you don't need *args and **kwargs  
                                       #self.encoder = LabelBinarizer(*args, **kwargs)
    def fit(self, x, y=0):
        self.encoder.fit(x)
        return self
    def transform(self, x, y=0):
        return self.encoder.transform(x)

num_pipeline=Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)), #
    ('label_binarizer',MyLabelBinarizer()) #CategoricalEncoder(encoding="onehot-dense")
])

full_pipeline = FeatureUnion(n_jobs=1, #default 1
                            transformer_list=[('num_pipeline', num_pipeline),
                                              ('cat_pipeline', cat_pipeline),
                                             ]
                            )

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared

#################################

from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, household_ix = 3,4,5,6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): #no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self #nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
population_per_household = X[:, population_ix] / X[:, household_ix]

if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:,rooms_ix]
#Translates slice objects to concatenation along the second axis.
return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]

5. Select and Train a Model

At last! You framed the problem, you got the data and explored it, you sampled a training set and a test set, and you wrote transformation pipelines to clean up and prepare your data for Machine Learning algorithms automatically. You are now ready to select and train a Machine Learning model.

Training and Evaluating on the Training Set

The good news is that thanks to all these previous steps, things are now going to be much simpler than you might think. Let’s first train a Linear Regression model, like we did in the previous chapter:

# housing = strat_train_set.drop('median_house_value', axis=1) #return a dataframe without the dropped column
# housing_label = strat_train_set["median_house_value"].copy()

strat_train_set.head()

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_label) #housing_label

Done! You now have a working Linear Regression model. Let’s try it out on a few instances from the training set:

some_data = housing.iloc[:5]
some_labels = housing_label.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
some_data_prepared

print( "Predictions:\t", lin_reg.predict(some_data_prepared) ) #predict median_house_value

print("Labels:\t\t", list(some_labels)) #median_house_value

It works, although the predictions are not exactly accurate (e.g., the first prediction is off by close to 40%!). Let’s measure this regression model’s RMSE on the whole training set using Scikit-Learn’s mean_squared_error function:

from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_label, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse

Okay, this is better than nothing but clearly not a great score: most districts’ median_housing_values range between $120,000 and $264,000, so a typical prediction error of $68,628 is not very satisfying. This is an example of a model underfitting the training data. When this happens it can mean that the features do not provide enough information to make good predictions, or that the model is not powerful enough. As we saw in the previous chapter, the main ways to fix underfitting are to select a more powerful model, to feed the training algorithm with better features, or to reduce the constraints on the model. This model is not regularized, so this rules out the last option. You could try to add more features (e.g., the log of the population), but first let’s try a more complex model to see how it does.

Let’s train a DecisionTreeRegressor. This is a powerful model, capable of finding complex nonlinear relationships in the data (Decision Trees are presented in more detail in Chapter 6). The code should look familiar by now:

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_label)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_label,housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse

Wait, what!? No error at all? Could this model really be absolutely perfect? Of course, it is much more likely that the model has badly overfit the data. How can you be sure? As we saw earlier, you don’t want to touch the test set until you are ready to launch a model you are confident about, so you need to use part of the training set for training, and part for model validation.

Better Evaluation Using Cross-Validation

One way to evaluate the Decision Tree model would be to use the train_test_split function to split the training set into a smaller training set and a validation set, then train your models against the smaller training set and evaluate them against the validation set. It’s a bit of work, but nothing too difficult and it would work fairly well.

A great alternative is to use Scikit-Learn’s cross-validation feature. The following code performs K-fold cross-validation: it randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds.
The result is an array containing the 10 evaluation scores:

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_label, scoring="neg_mean_squared_error",cv=10)
tree_rmse_scores = np.sqrt(-scores)

################
WARNING
Scikit-Learn cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better), so the scoring function is actually the opposite of the MSE (i.e., a negative value), which is why the preceding code computes -scores before calculating the square root.

################

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_label, scoring="neg_mean_squared_error",cv=10)
tree_rmse_scores = np.sqrt(-scores)

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)

Now the Decision Tree doesn’t look as good as it did earlier. In fact, it seems to perform worse(Mean RMSE=70644.94463282847) than the Linear Regression model(RMSE=68628.19819848922 see previous result, Mean RMSE=69052.46136345083, see belowing result)! Notice that cross-validation allows you to get not only an estimate of the performance of your model, but also a measure of how precise this estimate is (i.e., its standard deviation). The Decision Tree has a score of approximately 70645, generally ±2939. You would not have this information if you just used one validation set(housing_label). But cross-validation comes at the cost of training the model several times, so it is not always possible.

Let’s compute the same scores for the Linear Regression model just to be sure:

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_label, scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)

That’s right: the Decision Tree model is overfitting so badly that it performs worse than the Linear Regression model.

Let’s try one last model now: the RandomForestRegressor. As we will see in Chapter 7, Random Forests work by training many Decision Trees on random subsets of the features, then averaging out their predictions. Building a model on top of many other models is called Ensemble Learning集成学习, and it is often a great way to push ML algorithms even further. We will skip most of the code since it is essentially the same as for the other models:

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(housing_prepared, housing_label)

housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_label, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse#one validation set(housing_label)

#seems better

from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_label, \
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

Wow, this is much better: Random Forests look very promising. However, note that the score on the training set(18604.03726255338) is still much lower than on the validation sets(50182.027794261485), meaning that the model is still overfitting the training set. Possible solutions for overfitting are to simplify the model, constrain it (i.e., regularize it), or get a lot more training data. However, before you dive much deeper in Random Forests, you should try out many other models from various categories of Machine Learning algorithms (several Support Vector Machines with different kernels, possibly a neural network, etc.), without spending too much time tweaking调节 the hyperparameters. The goal is to shortlist a few (two to five) promising models.

Suport Vector Regression

from sklearn.svm import SVR

svm_reg = SVR(kernel='linear')
svm_reg.fit(housing_prepared, housing_label)
housing_predictions = svm_reg.predict(housing_prepared)
svm_mse = mean_squared_error(housing_label, housing_predictions)
svm_rmse = np.sqrt(svm_mse)
svm_rmse

################

TIP
You should save every model you experiment with, so you can come back easily to any model you want. Make sure you save
both the hyperparameters and the trained parameters, as well as the cross-validation scores and perhaps the actual predictions as well. This will allow you to easily compare scores across model types, and compare the types of errors they make. You can easily save Scikit-Learn models by using Python’s pickle module, or using sklearn.externals.joblib, which is more efficient at serializing large NumPy arrays:

from sklearn.externals import joblib
joblib.dump(my_model, "my_model.pkl")
# and later...
my_model_loaded = joblib.load("my_model.pkl")

VS:

scores = cross_val_score(lin_reg, housing_prepared, housing_label, scoring='neg_mean_squared_error', cv=10)
pd.Series(np.sqrt(-scores)).describe()

################

Grid Search

One way to do that would be to fiddle篡改 with the hyperparameters manually, until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations.

Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation. For example, the following code searches for the best combination of hyperparameter values for the RandomForestRegressor:

#######################
Tip:

When you have no idea what value a hyperparameter should have, a simple approach is to try out consecutive powers of 10 (or a smaller number if you want a more fine-grained search, as shown in this example with the n_estimators hyperparameter).
#######################

from sklearn.model_selection import GridSearchCV

#All in all, the grid search will explore 12 + 6 = 18 combinations of RandomForestRegressor hyperparameter values,
param_grid = [
#This param_grid tells Scikit-Learn to first evaluate all 3 × 4 = 12
#combinations of n_estimators and max_features hyperparameter values specified in the first dict
{'n_estimators': [3,10,30], 'max_features':[2,4,6,8]},
#then try all 2 × 3 = 6 combinations of hyperparameter values in the second dict,
#but this time with the bootstrap hyperparameter set to False instead of
#True (which is the default value for this hyperparameter).
{'bootstrap': [False], 'n_estimators':[3,10], 'max_features':[2,3,4]}
]

forest_reg = RandomForestRegressor()
#it will train each model five times (since we are using five-fold cross validation).
#In other words, all in all, there will be 18 × 5 = 90 rounds of training!
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared, housing_label)

grid_search.best_params_

Since 30 is the maximum value of n_estimators that was evaluated, you should probably evaluate higher values as well, since the score may continue to improve.

grid_search.best_estimator_

##############################
NOTE:

If GridSearchCV is initialized with refit=True (which is the default), then once it finds the best estimator using crossvalidation, it retrains it on the whole training set. This is usually a good idea since feeding it more data will likely improve its performance.

##############################
And of course the evaluation scores are also available:
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres['mean_test_score'], cvres['params']):
print(np.sqrt(-mean_score), params)

In this example, we obtain the best solution by setting the max_features hyperparameter to 6, and the n_estimators hyperparameter to 30. The RMSE score for this combination is 50,010, which is slightly better than the score you got earlier using the default hyperparameter values without param_grid (which was 50182;

#################
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_label, \
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)

#################

Congratulations, you have successfully fine-tuned your best model!
############################

TIP
Don’t forget that you can treat some of the data preparation steps as hyperparameters. For example, the grid search will
automatically find out whether or not to add a feature you were not sure about (e.g., using the add_bedrooms_per_room
hyperparameter of your CombinedAttributesAdder transformer). It may similarly be used to automatically find the best way to handle outliers, missing features, feature selection, and more.

############################

Randomized Search

The grid search approach is fine when you are exploring relatively few combinations, like in the previous example, but when the hyperparameter search space is large, it is often preferable to use RandomizedSearchCV instead. This class can be used in much the same way as the GridSearchCV class, but instead of trying out all possible combinations,
it evaluates a given number of random combinations by selecting a random value for each hyperparameter at every iteration. This approach has two main benefits:

If you let the randomized search run for, say, 1,000 iterations, this approach will explore 1,000 different values for each hyperparameter (instead of just a few values per hyperparameter with the grid search approach).
You have more control over the computing budget you want to allocate to hyperparameter search, simply by setting the number of iterations.

Ensemble Methods

Another way to fine-tune your system is to try to combine the models that perform best. The group (or “ensemble”) will often perform better than the best individual model (just like Random Forests perform better than the individual Decision Trees
they rely on), especially if the individual models make very different types of errors.

Analyze the Best Models and Their Errors

You will often gain good insights on the problem by inspecting the best models. For example, the RandomForestRegressor can indicate the relative importance of each attribute for making accurate predictions:

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

Let’s display these importance scores next to their corresponding attribute names:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_one_hot_attribs = list(encoder.classes_)
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

With this information, you may want to try dropping some of the less useful features (e.g., apparently only
one ocean_proximity category is really useful, so you could try dropping the others).

You should also look at the specific errors that your system makes, then try to understand why it makes
them and what could fix the problem (adding extra features or, on the contrary, getting rid of uninformative
ones, cleaning up outliers, etc.).

Evaluate Your System on the Test Set

After tweaking your models for a while, you eventually have a system that performs sufficiently well. Now is the time to evaluate the final model on the test set. There is nothing special about this process; just get the predictors and the labels from your test set, run your full_pipeline to transform the data (call transform(), not fit_transform()!), and evaluate the final model on the test set:

final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)

final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse

The performance will usually be slightly worse than what you measured using cross-validation if you did a lot of hyperparameter tuning (because your system ends up fine-tuned to perform well on the validation data, and will likely not perform as well on unknown datasets). It is not the case in this example, but when this happens you must resist the temptation to tweak the hyperparameters to make the numbers look good on the test set; the improvements would be unlikely to generalize to new data.

We can compute a 95% confidence interval for the test RMSE:

the Root Mean Square Error (RMSE). It measures the standard deviation of the errors the system makes in its predictions. For example, an RMSE equal to 50,000 means that about 68% of the system’s predictions fall within $50,000 of the actual value, and about 95% of the predictions fall within $100,000 of the actual value. Equation 2-1 shows the mathematical formula to compute the RMSE.

When a feature has a bell-shaped normal distribution (also called a Gaussian distribution), which is very common,
the “68-95-99.7” rule applies: about 68% of the values fall within 1σ of the mean, 95% within 2σ, and 99.7% within 3σ.

from scipy import stats

confidence = 0.95
#residual=yi-y_hat_i
squared_errors = (final_predictions - y_test)**2
np.sqrt( stats.t.interval(confidence,
len(squared_errors)-1, #df=degrees of freedom=number of samples -1
loc = squared_errors.mean(), #the sample means
scale = stats.sem(squared_errors) #the standard error of the mean of the samples
)
)

##########################
stats.sem(squared_errors)

squared_errors_standard_deviation = np.sqrt( np.sum( ( np.array(squared_errors).reshape(length,1) - \
np.tile([squared_errors_mean],[1, length]).T
)**2
)/(length-1)
)
standard_error_of_mean=squared_errors_standard_deviation/np.sqrt(length)
standard_error_of_mean #for Square Errors

confidence = 0.95
(1-confidence)/2=

1.96 is actually an approximation

standard_error_of_mean=squared_errors_standard_deviation/np.sqrt(length)

##########################
Now comes the project prelaunch phase: you need to present your solution (highlighting what you have learned, what worked and what did not, what assumptions were made, and what your system’s limitations are), document everything, and create nice presentations with clear visualizations and easy-to-remember statements (e.g., “the median income is the number one predictor of housing prices”).

Launch, Monitor, and Maintain Your System

Perfect, you got approval to launch! You need to get your solution ready for production, in particular by plugging the production input data sources into your system and writing tests.

You also need to write monitoring code to check your system’s live performance at regular intervals and trigger alerts when it drops. This is important to catch not only sudden breakage, but also performance degradation. This is quite common because models tend to “rot” as data evolves over time, unless the models are regularly trained on fresh data.

Evaluating your system’s performance will require sampling the system’s predictions and evaluating them. This will generally require a human analysis. These analysts may be field experts, or workers on a crowdsourcing platform (such as Amazon Mechanical Turk or CrowdFlower). Either way, you need to plug the human evaluation pipeline into your system.

You should also make sure you evaluate the system’s input data quality. Sometimes performance will degrade slightly because of a poor quality signal (e.g., a malfunctioning sensor sending random values, or another team’s output becoming stale), but it may take a while before your system’s performance degrades enough to trigger an alert. If you monitor your system’s inputs, you may catch this earlier. Monitoring the inputs is particularly important for online learning systems.

Finally, you will generally want to train your models on a regular basis using fresh data. You should automate this process as much as possible. If you don’t, you are very likely to refresh your model only every six months (at best), and your system’s performance may fluctuate severely over time. If your system is an online learning system, you should make sure you save snapshots of its state at regular intervals so you can easily roll back to a previously working state.

02_End-to-End Machine Learning Project_03_stats.sem_ppf_CategoricalEncoder_RandomizedSearchCV_joblib

https://blog.csdn.net/Linli522362242/article/details/103646927

你可能感兴趣的:(02_End-to-End Machine Learning Project_02_stats.sem_Cross_Validation_Grid_Randomized_Ensemble_ Pipel)

Java虚拟机：JVM介绍啊Q老师 #JVM篇 Java开发技术从零到壹 JVM概述 JVM架构
1024程序员节日快乐！愿您我的代码永远没有bug，人生永远没有bug！JVM概述JVM架构概述JVM（JavaVirtualMachine，Java虚拟机），是Java语言的运行环境，是运行所有Java程序的抽象计算机（一个虚构出来的计算机，通过在实际的计算机上仿真模拟各种计算机功能来实现）。JVM的主要功能是执行Java字节码，JVM是Java程序的中间表示形式，是Java程序从源代码到实际运
强化学习算法：蒙特卡洛树搜索 (Monte Carlo Tree Search) 原理与代码实例讲解杭州大厂Java程序媛 DeepSeek R1 &AI人工智能与大数据 java python javascript kotlin golang 架构人工智能
强化学习算法：蒙特卡洛树搜索(MonteCarloTreeSearch)原理与代码实例讲解关键词：蒙特卡洛树搜索,强化学习,决策树,搜索算法,博弈策略,应用场景,代码实现1.背景介绍1.1问题由来强化学习（ReinforcementLearning,RL）是人工智能领域的一个核心分支，专注于通过与环境交互，学习最优策略以实现特定目标。传统的强化学习算法，如Q-learning、SARSA等，通常依
推荐项目：AWS Certified Machine Learning Specialty (MLS-C01) 课程赵鹰伟Meadow
推荐项目：AWSCertifiedMachineLearningSpecialty(MLS-C01)课程AmazonSageMakerCourseInthisAWSMachineLearningSpecialtyCourse,Youwillgainfirst-handexperienceonhowtotrain,optimize,deploy,andintegrateMLinAWScloud.Le
AnyPlace：学习机器人操作的泛化目标放置硅谷秋水计算机视觉大模型智能体机器人机器学习计算机视觉人工智能语言模型深度学习
25年2月来自多伦多大学、VectorInst、上海交大等机构的论文“AnyPlace:LearningGeneralizedObjectPlacementforRobotManipulation”。由于目标几何形状和放置的配置多种多样，因此在机器人任务中放置目标本身就具有挑战性。为了解决这个问题，AnyPlace，一种完全基于合成数据训练的两阶段方法，能够预测现实世界任务中各种可行的放置姿势。其
深入解析 JVM vs JDK vs JRE：三者区别与联系详解李老头探索 jvm java 开发语言
深入解析JVMvsJDKvsJRE：三者区别与联系详解在学习Java的过程中，JVM、JDK和JRE是最常提到的三个术语。然而，很多初学者甚至有经验的开发者对它们之间的区别和联系常常感到困惑。本文将从基础概念、组成结构和使用场景等方面详细讲解，帮助你彻底搞清楚JVM、JDK和JRE。点击获取2024Java学习资料1.什么是JVM？定义：JVM（JavaVirtualMachine，Java虚拟机
17.推荐系统的在线学习与实时更新郑万通推荐系统
接下来就讲解推荐系统的在线学习与实时更新。推荐系统的在线学习和实时更新是为了使推荐系统能够动态地适应用户行为的变化，保持推荐结果的实时性和相关性。以下是详细的介绍和实现方法。推荐系统的在线学习与实时更新在线学习的概念在线学习（OnlineLearning）是一种机器学习方法，与传统的批量学习（BatchLearning）不同，在线学习模型能够在数据流到达时逐步更新，而不是在整个数据集上训练一次。这
FPGA状态机设计 FPGA小学生 fpga 状态机 verilog
B站对应讲解本文视频链接状态机：状态机是逻辑设计里面重要的内容，许多公司的硬件和逻辑工程师面试中，状态机设计几乎是必选题目。所以本次以状态机为话题进行重点讨论，以及如何写好状态机。状态机全称是有限状态机（FiniteStateMachine、FSM），是表示有限个状态以及在这些状态之间的转移和动作等行为的数学模型。本篇博客对相关概念以及使用状态机实现特定字符串的检测，并通过程序具体理解一段式、两段
KDD 2023 | 先睹为快！KDD 2023论文合集50篇（附下载地址）马拉AI 机器学习人工智能深度学习
下载地址：点我跳转1.DoubleAdapt:AMeta-learningApproachtoIncrementalLearningforStockTrendForecastingCode：NoneArea：一种用于股票趋势预测增量学习的元学习方法2.HomoGCL:RethinkingHomophilyinGraphContrastiveLearningCode：https://github.c
springcloud 启动时报org.springframework.beans.factory.BeanCreationException注入 bean 失败异常。 Gelbes Ferkel intellij-idea maven spring
springcloud启动时就报bean注入异常。/Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home/bin/java-XX:TieredStopAtLevel=1-noverify-Dspring.output.ansi.enabled=always-Dcom.sun.management.jmxremote-Dspr
日志2025.2.11 science怪兽 unity
日志2025.2.111.增加了敌人滚动这个特殊技能//具有翻滚技能的敌人实现翻滚publicvoidActivateDodgeRoll(){if(meleeType!=EnemyType_Melee.Dodge){return;}if(stateMachine.currentState!=chaseState){return;}animator.SetTrigger("DodgeRoll");}
蓝桥杯真题 - 更小的数 - 题解 ExRoc 蓝桥杯 c++算法
题目链接：https://www.lanqiao.cn/problems/3503/learning/个人评价：难度2星（满星：5）前置知识：区间dp整体思路反转区间[l,r][l,r][l,r]内的数字，范围外所有数字仍然与原数相等，所以只要[l,r][l,r][l,r]范围内的数字反转后比原来小，整个数字就比原来的数字小；朴素的比较方法是：O(n2)O(n^2)O(n2)枚举所有区间，对于被反
Cartesi 生态系统动态 #1 (2025年) Black_mario 区块链
技术新版CartesiMachine即将发布，带来一些激动人心的新功能。通过最新优化，原生运行变得更简单且速度提升两倍。节点方面，稳定版V2已正式推出。在Espresso的支持下，它将为即将推出的测试网中的DrawingCanvas提供支持。Cartesi与EigenLayer携手合作第三届实验周，在Cartesi基于Linux的协处理器与EigenLayer的重质押协议交汇处，展开为期一周的新用
快速搭建GRU循环神经网络预测模型智汇未来 rnn 深度学习 gru 人工智能神经网络 matlab 算法
首先，我需要使用GRU神经网络进行预测。GRU是GatedRecurrentUnit的缩写，是一种常用的循环神经网络结构，适用于序列数据的预测任务。但是，我需要确保MATLAB支持GRU网络的创建和训练。让我想想，MATLAB的DeepLearningToolbox提供了设计和训练神经网络的功能，包括GRU层。等等，我需要确认一下如何在MATLAB中创建包含GRU层的网络。好的，那我就开始写代码吧
探索计算机视觉的基石：PASCAL VOC 数据集卢姬铃Edric
探索计算机视觉的基石：PASCALVOC数据集1目标检测PASCALVOC数据集简介项目地址:https://gitcode.com/Resource-Bundle-Collection/dc7bf项目介绍PASCALVOC（PatternAnalysis,StatisticalModelingandComputationalLearningVisualObjectClasses）挑战赛是计算机视
迁移学习 Transfer Learning 有人给我介绍对象吗模块迁移学习人工智能机器学习
迁移学习（TransferLearning）是什么？迁移学习是一种机器学习方法，它的核心思想是利用已有模型的知识来帮助新的任务或数据集进行学习，从而减少训练数据的需求、加快训练速度，并提升模型性能。1.为什么需要迁移学习？在深度学习任务（如目标检测、分类）中，通常需要大量数据和计算资源来训练一个高性能模型。然而，在某些场景下，我们面临以下挑战：数据有限：有些领域（如医学影像、多光谱图像）很难收集足
VSCode+Remote SSH配置问题和猫君共建乌托邦~ vscode ssh linux
参考链接https://medium.com/@debugger24/installing-vscode-server-on-remote-machine-in-private-network-offline-installation-16e51847e275https://blog.csdn.net/wyg1997/article/details/101460961原因服务器（目标机器）无法上外
NVIDIA-docker Cheatsheet weixin_30758821 运维开发工具 shell
TensorFlowDockerrequirementsInstallDockeronyourlocalhostmachine.ForGPUsupportonLinux,installnvidia-docker.Note:Torunthedockercommandwithoutsudo,createthedockergroupandaddyouruser.Fordetails,seethepost
deepseek学习笔记 wsnzou 学习笔记
原计划是基于BERT或者GPT做一些自然语言处理的应用研究，deepseek出来之后，决定使用deepseek来做，相信能够获得更好的效果。1、deepseek的论文deepseek的论文《DeepSeek-R1:IncentivizingReasoningCapabilityinLLMsviaReinforcementLearning》于2025年1月下旬同步发布在了github和arxiv上。
Python时间魔法：当你按下暂停键的代码世界虫洞没有虫 Python资讯 python 开发语言
想象你正在观看一部可以随时暂停的科幻电影：deftime_machine():yield"回到1920年"yield"穿越到2050年"yield"抵达恐龙时代"traveler=time_machine()print(next(traveler))#回到1920年print(next(traveler))#穿越到2050年这不是普通的函数，而是一台可以随时暂停的时间机器。今天我们将揭开Pytho
详解Redis中lua脚本和事务优人ovo redis lua 数据库
Inlearningknowledge,oneshouldbegoodatthinking,thinking,andthinkingagain.—-AlbertEinstein引言Lua脚本的原子性和事务的ACID特性想必大家都很熟悉，本篇文章将从性能表现和原理帮助我们快速理解他们基本概念1.RedisLua脚本从2.6版本起，Redis开始支持Lua脚本。开发者能够将一系列Redis命令封装于一
WPF入门_06资源和样式思忖小下 WPF wpf 资源和样式
目录1、资源基础介绍2、静态资源和动态资源区别3、资源字典4、共享资源的方法5、在CustomControlLibrary中定义和使用共享资源6、样式7、样式触发器1、资源基础介绍尽管每个元素都提供了Resources属性，但通常在窗口级别上定义资源，如下定义一个字符串资源LearningHard博客：http://www.cnblogs.com/zhili/2、静态资源和动态资源区别(参照代码：
DeepSeek联邦学习（Federated Learning）基础与实践 Evaporator Core DeepSeek快速入门人工智能深度学习 python tornado dash
联邦学习（FederatedLearning,FL）是一种在分布式环境中训练模型的技术，允许多个设备或节点在不共享原始数据的情况下协同训练模型。这种方法在保护数据隐私的同时，能够利用分散的数据资源提升模型性能。DeepSeek提供了强大的工具和API，帮助我们高效地构建和训练联邦学习模型。本文将详细介绍如何使用DeepSeek进行联邦学习的基础与实践，并通过代码示例帮助你掌握这些技巧。1.联邦学习
DeepSeek自监督学习基础与实践 Evaporator Core Python开发经验 DeepSeek快速入门深度学习学习机器学习人工智能
自监督学习（Self-SupervisedLearning,SSL）是一种利用未标注数据进行模型训练的技术。与传统的监督学习不同，自监督学习通过设计预训练任务（PretextTasks）从数据中自动生成标签，从而学习到有用的特征表示。这些特征表示可以用于下游任务（如分类、检测等），显著提升模型性能。DeepSeek提供了强大的工具和API，帮助我们高效地构建和训练自监督学习模型。本文将详细介绍如何
股票分析工具Python源码 mosquito_lover1 python
该作者的原创文章目录：生产制造执行MES系统的需求设计和实现企业后勤管理系统的需求设计和实现行政办公管理系统的需求设计和实现人力资源管理HR系统的需求设计和实现企业财务管理系统的需求设计和实现董事会办公管理系统的需求设计和实现公司组织架构图设计工具库存管理系统的需求设计和实现批量执行SQL脚本导出Excel文件数据库巡检工具Python源码分享E-Learning在线学习平台的需求设计和实现AI知
pytorch 人脸修复_修复pytorch数据加载器 weixin_26729375 人工智能 python java 人脸识别
pytorch人脸修复黑客数据科学工作流程(Hackingdatascienceworkflows)Icameacrossaninterestingproblemrecently.AteammateandIwereworkingonaseriesofDeepLearningexperimentsthatinvolvedanimagedatasetthatspannedhundredsofgigab
日志2025.2.7 science怪兽 unity 算法游戏程序
日志2025.2.71.敌人新增了追逐状态usingUnityEngine;publicclassChaseState_Melee:EnemyState{privateEnemy_Meleeenemy;publicChaseState_Melee(EnemyenemyBase,EnemyStateMachinestateMachine,stringanimBoolName):base(enemyB
日志2025.2.1 science怪兽 unity 算法游戏程序
日志2025.2.11.做了敌人状态机publicclassEnermyStateMachine{publicEnermyStatecurrentState{get;privateset;}publicvoidInitializeState(EnermyStatestartState){currentState=startState;currentState.Enter();}publicvoid
最新Modular公司之MAX和Mojo作者克里斯·拉特纳简介 WSSWWWSSW mojo
ChrisLattner（克里斯·拉特纳）是一位著名的计算机科学家和软件工程师，以其在编程语言、编译器技术和软件开发工具领域的贡献而闻名。以下是关于他的详细介绍：1.主要成就（1）LLVM项目的创始人ChrisLattner是LLVM（LowLevelVirtualMachine）项目的创始人和主要开发者。LLVM是一个开源的编译器基础设施，广泛用于构建编程语言的编译器、优化器和工具链。LLVM的
论文解读（MGAE）《MGAE: Masked Autoencoders for Self-Supervised Learning on Graphs》虚幻私塾 python python 开发语言
优质资源分享学习路线指引（点击解锁）知识定位人群定位Python实战微信订餐小程序进阶级本课程是pythonflask+微信小程序的完美结合，从项目搭建到腾讯云部署上线，打造一个全栈订餐系统。Python量化交易实战入门级手把手带你打造一个易扩展、更安全、效率更高的量化交易系统论文信息论文标题：MGAE:MaskedAutoencodersforSelf-SupervisedLearningonG
基于对比增强的超声视频的域知识为乳腺癌诊断提供了深度学习 Philo` 医学图像分割论文阅读深度学习人工智能论文阅读图像处理 pytorch 机器学习
DomainKnowledgePoweredDeepLearningforBreastCancerDiagnosisBasedonContrast-EnhancedUltrasoundVideos期刊分析摘要引言相关工作乳腺癌中的CAD基于乳房CEU的CAD方法整体框架原始C3D骨干领域知识指导的时间注意模块(DKG-TMA)域知识引导的通道注意模块数据集和实验乳腺-对比增强超声数据集实验设置实验
jQuery 键盘事件keydown ,keypress ,keyup介绍 107x js jquery keydown keypress keyup
本文章总结了下些关于jQuery 键盘事件keydown ,keypress ,keyup介绍，有需要了解的朋友可参考。一、首先需要知道的是： 1、keydown() keydown事件会在键盘按下时触发. 2、keyup() 代码如下复制代码 $('input').keyup(funciton(){
AngularJS中的Promise bijian1013 JavaScript AngularJS Promise
一.Promise Promise是一个接口，它用来处理的对象具有这样的特点：在未来某一时刻（主要是异步调用）会从服务端返回或者被填充属性。其核心是，promise是一个带有then()函数的对象。为了展示它的优点，下面来看一个例子，其中需要获取用户当前的配置文件： var cu
c++ 用数组实现栈类 CrazyMizzz 数据结构 C++
#include<iostream> #include<cassert> using namespace std; template<class T, int SIZE = 50> class Stack{ private: T list[SIZE];//数组存放栈的元素 int top;//栈顶位置 public: Stack(
java和c语言的雷同麦田的设计者 java 递归 scaner
软件启动时的初始化代码，加载用户信息2015年5月27号从头学java二 1、语言的三种基本结构：顺序、选择、循环。废话不多说，需要指出一下几点： a、return语句的功能除了作为函数返回值以外，还起到结束本函数的功能，return后的语句不会再继续执行。 b、for循环相比于whi
LINUX环境并发服务器的三种实现模型被触发 linux
服务器设计技术有很多，按使用的协议来分有TCP服务器和UDP服务器。按处理方式来分有循环服务器和并发服务器。 1 循环服务器与并发服务器模型在网络程序里面，一般来说都是许多客户对应一个服务器，为了处理客户的请求，对服务端的程序就提出了特殊的要求。目前最常用的服务器模型有： ·循环服务器：服务器在同一时刻只能响应一个客户端的请求 ·并发服务器：服
Oracle数据库查询指令肆无忌惮_ oracle数据库
20140920 单表查询 -- 查询************************************************************************************************************ -- 使用scott用户登录 -- 查看emp表 desc emp
ext右下角浮动窗口知了ing JavaScript ext
第一种 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/
浅谈REDIS数据库的键值设计矮蛋蛋 redis
http://www.cnblogs.com/aidandan/ 原文地址：http://www.hoterran.info/redis_kv_design 丰富的数据结构使得redis的设计非常的有趣。不像关系型数据库那样，DEV和DBA需要深度沟通，review每行sql语句，也不像memcached那样，不需要DBA的参与。redis的DBA需要熟悉数据结构，并能了解使用场景。
maven编译可执行jar包 alleni123 maven
http://stackoverflow.com/questions/574594/how-can-i-create-an-executable-jar-with-dependencies-using-maven <build> <plugins> <plugin> <artifactId>maven-asse
人力资源在现代企业中的作用百合不是茶 HR 企业管理
//人力资源在在企业中的作用人力资源为什么会存在，人力资源究竟是干什么的人力资源管理是对管理模式一次大的创新，人力资源兴起的原因有以下点：工业时代的国际化竞争，现代市场的风险管控等等。所以人力资源在现代经济竞争中的优势明显的存在，人力资源在集团类公司中存在着明显的优势(鸿海集团)，有一次笔者亲自去体验过红海集团的招聘，只知道人力资源是管理企业招聘的当时我被招聘上了，当时给我们培训的人
Linux自启动设置详解 bijian1013 linux
linux有自己一套完整的启动体系，抓住了linux启动的脉络，linux的启动过程将不再神秘。阅读之前建议先看一下附图。本文中假设inittab中设置的init tree为： /etc/rc.d/rc0.d /etc/rc.d/rc1.d /etc/rc.d/rc2.d /etc/rc.d/rc3.d /etc/rc.d/rc4.d /etc/rc.d/rc5.d /etc
Spring Aop Schema实现 bijian1013 java spring AOP
本例使用的是Spring2.5 1.Aop配置文件spring-aop.xml <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmln
【Gson七】Gson预定义类型适配器 bit1129 gson
Gson提供了丰富的预定义类型适配器，在对象和JSON串之间进行序列化和反序列化时，指定对象和字符串之间的转换方式， DateTypeAdapter public final class DateTypeAdapter extends TypeAdapter<Date> { public static final TypeAdapterFacto
【Spark八十八】Spark Streaming累加器操作（updateStateByKey) bit1129 update
在实时计算的实际应用中，有时除了需要关心一个时间间隔内的数据，有时还可能会对整个实时计算的所有时间间隔内产生的相关数据进行统计。比如：对Nginx的access.log实时监控请求404时，有时除了需要统计某个时间间隔内出现的次数，有时还需要统计一整天出现了多少次404，也就是说404监控横跨多个时间间隔。 Spark Streaming的解决方案是累加器，工作原理是，定义
linux系统下通过shell脚本快速找到哪个进程在写文件 ronin47
一个文件正在被进程写我想查看这个进程文件一直在增大找不到谁在写使用lsof也没找到这个问题挺有普遍性的，解决方法应该很多，这里我给大家提个比较直观的方法。 linux下每个文件都会在某个块设备上存放，当然也都有相应的inode, 那么透过vfs.write我们就可以知道谁在不停的写入特定的设备上的inode。幸运的是systemtap的安装包里带了inodewatch.stp，位
java-两种方法求第一个最长的可重复子串 bylijinnan java 算法
import java.util.Arrays; import java.util.Collections; import java.util.List; public class MaxPrefix { public static void main(String[] args) { String str="abbdabcdabcx";
Netty源码学习-ServerBootstrap启动及事件处理过程 bylijinnan java netty
Netty是采用了Reactor模式的多线程版本，建议先看下面这篇文章了解一下Reactor模式： http://bylijinnan.iteye.com/blog/1992325 Netty的启动及事件处理的流程，基本上是按照上面这篇文章来走的文章里面提到的操作，每一步都能在Netty里面找到对应的代码其中Reactor里面的Acceptor就对应Netty的ServerBo
servelt filter listener 的生命周期 cngolon filter listener servelt 生命周期
1. servlet 当第一次请求一个servlet资源时，servlet容器创建这个servlet实例，并调用他的 init(ServletConfig config)做一些初始化的工作，然后调用它的service方法处理请求。当第二次请求这个servlet资源时，servlet容器就不在创建实例，而是直接调用它的service方法处理请求，也就是说
jmpopups获取input元素值 ctrain JavaScript
jmpopups 获取弹出层form表单首先，我有一个div，里面包含了一个表单，默认是隐藏的，使用jmpopups时，会弹出这个隐藏的div，其实jmpopups是将我们的代码生成一份拷贝。当我直接获取这个form表单中的文本框时，使用方法：$('#form input[name=test1]').val()；这样是获取不到的。我们必须到jmpopups生成的代码中去查找这个值，$(
vi查找替换命令详解 daizj linux 正则表达式替换查找 vim
一、查找查找命令 /pattern<Enter> ：向下查找pattern匹配字符串 ?pattern<Enter>：向上查找pattern匹配字符串使用了查找命令之后，使用如下两个键快速查找： n：按照同一方向继续查找 N：按照反方向查找字符串匹配 pattern是需要匹配的字符串，例如： 1: /abc<En
对网站中的js,css文件进行打包 dcj3sjt126com PHP 打包
一，为什么要用smarty进行打包 apache中也有给js,css这样的静态文件进行打包压缩的模块，但是本文所说的不是以这种方式进行的打包，而是和smarty结合的方式来把网站中的js,css文件进行打包。为什么要进行打包呢，主要目的是为了合理的管理自己的代码。现在有好多网站，你查看一下网站的源码的话，你会发现网站的头部有大量的JS文件和CSS文件，网站的尾部也有可能有大量的J
php Yii: 出现undefined offset 或者 undefined index解决方案 dcj3sjt126com undefined
在开发Yii 时，在程序中定义了如下方式： if($this->menuoption[2] === 'test')，那么在运行程序时会报：undefined offset:2，这样的错误主要是由于php.ini 里的错误等级太高了，在windows下错误等级
linux 文件格式（1） sed工具 eksliang linux linux sed工具 sed工具 linux sed详解
转载请出自出处： http://eksliang.iteye.com/blog/2106082 简介 sed 是一种在线编辑器，它一次处理一行内容。处理时，把当前处理的行存储在临时缓冲区中，称为“模式空间”（pattern space），接着用sed命令处理缓冲区中的内容，处理完成后，把缓冲区的内容送往屏幕。接着处理下一行，这样不断重复，直到文件末尾
Android应用程序获取系统权限 gqdy365 android
引用如何使Android应用程序获取系统权限第一个方法简单点，不过需要在Android系统源码的环境下用make来编译： 1. 在应用程序的AndroidManifest.xml中的manifest节点
HoverTree开发日志之验证码 hvt .net C#asp.net hovertree webform
HoverTree是一个ASP.NET的开源CMS，目前包含文章系统，图库和留言板功能。代码完全开放，文章内容页生成了静态的HTM页面，留言板提供留言审核功能，文章可以发布HTML源代码，图片上传同时生成高品质缩略图。推出之后得到许多网友的支持，再此表示感谢！留言板不断收到许多有益留言，但同时也有不少广告，因此决定在提交留言页面增加验证码功能。ASP.NET验证码在网上找，如果不是很多，就是特别多
JSON API：用 JSON 构建 API 的标准指南中文版 justjavac json
译文地址：https://github.com/justjavac/json-api-zh_CN 如果你和你的团队曾经争论过使用什么方式构建合理 JSON 响应格式，那么 JSON API 就是你的 anti-bikeshedding 武器。通过遵循共同的约定，可以提高开发效率，利用更普遍的工具，可以是你更加专注于开发重点：你的程序。基于 JSON API 的客户端还能够充分利用缓存，
数据结构随记_2 lx.asymmetric 数据结构笔记
第三章栈与队列一．简答题 1. 在一个循环队列中，队首指针指向队首元素的前一个位置。 2.在具有n个单元的循环队列中，队满时共有 n-1 个元素。 3. 向栈中压入元素的操作是先移动栈顶指针&n
Linux下的监控工具dstat 网络接口 linux
1) 工具说明dstat是一个用来替换 vmstat,iostat netstat,nfsstat和ifstat这些命令的工具, 是一个全能系统信息统计工具. 与sysstat相比, dstat拥有一个彩色的界面, 在手动观察性能状况时, 数据比较显眼容易观察; 而且dstat支持即时刷新, 譬如输入dstat 3, 即每三秒收集一次, 但最新的数据都会每秒刷新显示. 和sysstat相同的是,
C 语言初级入门--二维数组和指针 1140566087 二维数组 c/c++指针
/* 二维数组的定义和二维数组元素的引用二维数组的定义：当数组中的每个元素带有两个下标时，称这样的数组为二维数组； (逻辑上把数组看成一个具有行和列的表格或一个矩阵); 语法：类型名数组名[常量表达式1][常量表达式2] 二维数组的引用：引用二维数组元素时必须带有两个下标，引用形式如下：例如： int a[3][4]; 引用：
10点睛Spring4.1-Application Event wiselyman application
10.1 Application Event Spring使用Application Event给bean之间的消息通讯提供了手段应按照如下部分实现bean之间的消息通讯继承ApplicationEvent类实现自己的事件实现继承ApplicationListener接口实现监听事件使用ApplicationContext发布消息