夏七岁

CS231n Lecture 2: Image Classification pipeline

Image Classification pipeline

Assignment
Image Classification pipeline
Code Achievement
- Working locally
- Download Dataset
- Start IPython
- Code Achievement
- - import modules
  - Load data
  - KNN
  - result
  - optimization
  - - iteration
- Cross-validation

Assignment

Image Classification pipeline

Nearest Classification
K-Nearest Neighbors(KNN)
Liner classification

Code Achievement

Working locally

Installing Anaconda
Anaconda Virtual environment

Once you have Anaconda installed, it makes sense to create a virtual environment for the course. If you choose not to use a virtual environment, it is up to you to make sure that all dependencies for the code are installed globally on your machine. To set up a virtual environment, run (in a terminal)

conda create -n cs231n python=3.7 anaconda

to create an environment called cs231n.
Then you can find Anaconda Prompt(cs231n) and Jupyter Notebook(cs231n) in your Anaconda FIle.
You may refer to this page for more detailed instructions on managing virtual environments with Anaconda.

Download Dataset

Once you have the starter code , you will need to download the CIFAR-10 dataset. Run the following from the assignment1 directory:

cd cs231n/datasets
./get_datasets.sh

for windows, get dataset from http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz ,then unzip to datasets

Start IPython

After you have the CIFAR-10 data, you should start the IPython notebook server from the assignment1 directory, with the jupyter notebook command.

Code Achievement

import modules

# Run some setup code for this notebook.

import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt

# This is a bit of magic to make matplotlib figures appear inline in the notebook
# rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

load_CIFAR10 in data_utils.py

def load_CIFAR10(ROOT): 
    """ load all of cifar """
    xs = []
    ys = []
    for b in range(1,6):
        f = os.path.join(ROOT, 'data_batch_%d' % (b, )) #combine to filename
        X, Y = load_CIFAR_batch(f) # read f
        xs.append(X)
        ys.append(Y)
    Xtr = np.concatenate(xs)
    Ytr = np.concatenate(ys)
    del X, Y
    Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch'))
    return Xtr, Ytr, Xte, Yte

def load_CIFAR_batch(filename):
    """ load single batch of cifar """
    with open(filename, 'rb') as f:
        datadict = load_pickle(f) #read & write in file
        X = datadict['data']
        Y = datadict['labels']
        X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float") #original data type; transpose:change position 0:10000 data num,1:3 channel,2&3:32*32 image size
        Y = np.array(Y)
        return X, Y

Load data

# Load the raw CIFAR-10 data.
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'#it may be error in windows,you can change '/' to '\\' or use the other way to define the fileload as written below.
#import os
#cifar10_dir = os.path.join("cs231n","datasets","cifar-10-batches-py")


# Cleaning up variables to prevent loading data multiple times (which may cause memory issue)
try:
   del X_train, y_train
   del X_test, y_test
   print('Clear previously loaded data.')
except:
   pass

X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

# As a sanity check, we print out the size of the training and test data.
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)

out:
Training data shape:  (50000, 32, 32, 3)
Training labels shape:  (50000,)
Test data shape:  (10000, 32, 32, 3)
Test labels shape:  (10000,)

# Visualize some examples from the dataset.
# We show a few examples of training images from each class.
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y)
    idxs = np.random.choice(idxs, samples_per_class, replace=False)
    for i, idx in enumerate(idxs):
        plt_idx = i * num_classes + y + 1
        plt.subplot(samples_per_class, num_classes, plt_idx)
        plt.imshow(X_train[idx].astype('uint8'))
        plt.axis('off')
        if i == 0:
            plt.title(cls)
plt.show()

# Subsample the data for more efficient code execution in this exercise
num_training = 5000
mask = list(range(num_training))
X_train = X_train[mask]
y_train = y_train[mask]

num_test = 500
mask = list(range(num_test))
X_test = X_test[mask]
y_test = y_test[mask]

# Reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
print(X_train.shape, X_test.shape)

out
(5000, 3072) (500, 3072)

KNN

from cs231n.classifiers import KNearestNeighbor

# Create a kNN classifier instance. 
# Remember that training a kNN classifier is a noop: 
# the Classifier simply remembers the data and does no further processing 
classifier = KNearestNeighbor()
classifier.train(X_train, y_train)

We would now like to classify the test data with the kNN classifier. Recall that we can break down this process into two steps:

First we must compute the distances between all test examples and all train examples.
Given these distances, for each test example we find the k nearest examples and have them vote for the label
Lets begin with computing the distance matrix between all training and test examples. For example, if there are Ntr training examples and Nte test examples, this stage should result in a Nte x Ntr matrix where each element (i,j) is the distance between the i-th test and j-th train example.

Note: For the three distance computations that we require you to implement in this notebook, you may not use the np.linalg.norm() function that numpy provides.

First, open cs231n/classifiers/k_nearest_neighbor.py and implement the function compute_distances_two_loops that uses a (very inefficient) double loop over all pairs of (test, train) examples and computes the distance matrix one element at a time.

# Open cs231n/classifiers/k_nearest_neighbor.py and implement
# compute_distances_two_loops.

# Test your implementation:
dists = classifier.compute_distances_two_loops(X_test)
print(dists.shape)

k_nearest_neighbor.py

from builtins import range
from builtins import object
import numpy as np
from past.builtins import xrange #if the error "No module named 'past'" happened, install 'future' module by "pip install future"
def compute_distances_two_loops(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using a nested loop over both the training data and the
        test data.

        Inputs:
        - X: A numpy array of shape (num_test, D) containing test data.

        Returns:
        - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
          is the Euclidean(L2) distance between the ith test point and the jth training
          point.
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in range(num_test):
            for j in range(num_train):
                #####################################################################
                # TODO:                                                             #
                # Compute the l2 distance between the ith test point and the jth    #
                # training point, and store the result in dists[i, j]. You should   #
                # not use a loop over dimension, nor use np.linalg.norm().          #
                #####################################################################
                # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
				dists[i,j] =  np.sqrt( sum( np.square(X[i,:] - self.X_train[j,:])))#this is my answer
                pass

                # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        return dists

# We can visualize the distance matrix: each row is a single test example and
# its distances to training examples
plt.imshow(dists, interpolation='none')
plt.show()

Inline Question 1

Notice the structured patterns in the distance matrix, where some rows or columns are visible brighter. (Note that with the default color scheme black indicates low distances while white indicates high distances.)

What in the data is the cause behind the distinctly bright rows?
What causes the columns?
Y?????????:  
- Cause each row is a single test example and its distance to training example, darker pixels mean the examples are more familiars to some of the training data and the white pixels mean on the contrary.
- However, bright rows exist, which means theses test examples are much less similar with training data. Maybe the test examples belong none of these ten classes or maybe their pixel value are quite different with training data because of noise, background or something, which connected to our algorithm robustness.

result

k_nearest_neighbor,py

def predict_labels(self, dists, k=1):
  		 """
        Given a matrix of distances between test points and training points,
        predict a label for each test point.

        Inputs:
        - dists: A numpy array of shape (num_test, num_train) where dists[i, j]
          gives the distance betwen the ith test point and the jth training point.

        Returns:
        - y: A numpy array of shape (num_test,) containing predicted labels for the
          test data, where y[i] is the predicted label for the test point X[i].
        """
        num_test = dists.shape[0]
        y_pred = np.zeros(num_test)
        for i in range(num_test):
            # A list of length k storing the labels of the k nearest neighbors to
            # the ith test point.
            closest_y = []
            #########################################################################
            # TODO:                                                                 #
            # Use the distance matrix to find the k nearest neighbors of the ith    #
            # testing point, and use self.y_train to find the labels of these       #
            # neighbors. Store these labels in closest_y.                           #
            # Hint: Look up the function numpy.argsort. 
            # argsort: Returns the indices that would sort an array.  return the items position after they were sort.                          #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
            index_rank = np.argsort(dists[i,:])#rank dist from low to high
            index_k = index_rank[0:k]# selcet the kth dist
            closest_y = self.y_train[index_k]# find the corresponding labels
            pass

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
            #########################################################################
            # TODO:                                                                 #
            # Now that you have found the labels of the k nearest neighbors, you    #
            # need to find the most common label in the list closest_y of labels.   #
            # Store this label in y_pred[i]. Break ties by choosing the smaller     #
            # label.                                                                #
            #########################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
			sort = sorted(Counter(closest_y).items(),key=lambda item:item[1],reverse = True)# need "from collections import Counter"
            y_pred[i] = sort[0][0]
            pass

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

        return y_pred

# Now implement the function predict_labels and run the code below:
# We use k = 1 (which is Nearest Neighbor).
y_test_pred = classifier.predict_labels(dists, k=1)

# Compute and print the fraction of correctly predicted examples
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

out: Got 137 / 500 correct => accuracy: 0.274000

You should expect to see approximately 27% accuracy. Now lets try out a larger k, say k = 5:

y_test_pred = classifier.predict_labels(dists, k=5)
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

out: Got 137 / 500 correct => accuracy: 0.290000

You should expect to see a slightly better performance than with k = 1.

I recommend this answer.

optimization

iteration

# Now lets speed up distance matrix computation by using partial vectorization
# with one loop. Implement the function compute_distances_one_loop and run the
# code below:
dists_one = classifier.compute_distances_one_loop(X_test)

# To ensure that our vectorized implementation is correct, we make sure that it
# agrees with the naive implementation. There are many ways to decide whether
# two matrices are similar; one of the simplest is the Frobenius norm. In case
# you haven't seen it before, the Frobenius norm of two matrices is the square
# root of the squared sum of differences of all elements; in other words, reshape
# the matrices into vectors and compute the Euclidean distance between them.
difference = np.linalg.norm(dists - dists_one, ord='fro')
print('One loop difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

k_nearest_neighbor,py

def compute_distances_one_loop(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using a single loop over the test data.

        Input / Output: Same as compute_distances_two_loops
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        for i in range(num_test):
            #######################################################################
            # TODO:                                                               #
            # Compute the l2 distance between the ith test point and all training #
            # points, and store the result in dists[i, :].                        #
            # Do not use np.linalg.norm().                                        #
            #######################################################################
            # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
			dists[i,:] =  np.sqrt(np.sum(np.square(X[i,:] - self.X_train),axis =1))
			#axis =1 means sum horizontally
            pass

            # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        return dists

out:
One loop difference was: 0.000000
Good! The distance matrices are the same

# Now implement the fully vectorized version inside compute_distances_no_loops
# and run the code
dists_two = classifier.compute_distances_no_loops(X_test)

# check that the distance matrix agrees with the one we computed before:
difference = np.linalg.norm(dists - dists_two, ord='fro')
print('No loop difference was: %f' % (difference, ))
if difference < 0.001:
    print('Good! The distance matrices are the same')
else:
    print('Uh-oh! The distance matrices are different')

out:
One loop difference was: 0.000000
Good! The distance matrices are the same

here’s some messages about broadcast.

def compute_distances_no_loops(self, X):
        """
        Compute the distance between each test point in X and each training point
        in self.X_train using no explicit loops.

        Input / Output: Same as compute_distances_two_loops
        """
        num_test = X.shape[0]
        num_train = self.X_train.shape[0]
        dists = np.zeros((num_test, num_train))
        #########################################################################
        # TODO:                                                                 #
        # Compute the l2 distance between all test points and all training      #
        # points without using any explicit loops, and store the result in      #
        # dists.                                                                #
        #                                                                       #
        # You should implement this function using only basic array operations; #
        # in particular you should not use functions from scipy,                #
        # nor use np.linalg.norm().                                             #
        #                                                                       #
        # HINT: Try to formulate the l2 distance using matrix multiplication    #
        #       and two broadcast sums.                                         #
        #########################################################################
        # *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        dists += np.sum(self.X_train ** 2, axis=1).reshape(1, num_train) #1*5000，the first broadcast
        dists += np.sum(X ** 2, axis=1).reshape(num_test,1) #500*1，the second broadcast
        dists -= 2 * np.dot(X, self.X_train.T) #500*5000
        dists = np.sqrt(dists)


        pass

        # *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
        return dists

more detials

# Let's compare how fast the implementations are
def time_function(f, *args):
    """
    Call a function f with args and return the time (in seconds) that it took to execute.
    """
    import time
    tic = time.time()
    f(*args)
    toc = time.time()
    return toc - tic

two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)
print('Two loop version took %f seconds' % two_loop_time)

one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)
print('One loop version took %f seconds' % one_loop_time)

no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)
print('No loop version took %f seconds' % no_loop_time)

# You should see significantly faster performance with the fully vectorized implementation!

# NOTE: depending on what machine you're using, 
# you might not see a speedup when you go from two loops to one loop, 
# and might even see a slow-down.

out:
Two loop version took 2330.480827 seconds
One loop version took 129.882184 seconds
No loop version took 0.886452 seconds

Cross-validation

We have implemented the k-Nearest Neighbor classifier but we set the value k = 5 arbitrarily. We will now determine the best value of this hyperparameter with cross-validation.

num_folds = 5
k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]

X_train_folds = []
y_train_folds = []
################################################################################
# TODO:                                                                        #
# Split up the training data into folds. After splitting, X_train_folds and    #
# y_train_folds should each be lists of length num_folds, where                #
# y_train_folds[i] is the label vector for the points in X_train_folds[i].     #
# Hint: Look up the numpy array_split function.                                #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
X_train_folds = np.vsplit(X_train,num_folds)
y_train_folds = np.split(y_train,num_folds)
pass

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

# A dictionary holding the accuracies for different values of k that we find
# when running cross-validation. After running cross-validation,
# k_to_accuracies[k] should be a list of length num_folds giving the different
# accuracy values that we found when using that value of k.
k_to_accuracies = {}


################################################################################
# TODO:                                                                        #
# Perform k-fold cross validation to find the best value of k. For each        #
# possible value of k, run the k-nearest-neighbor algorithm num_folds times,   #
# where in each case you use all but one of the folds as training data and the #
# last fold as a validation set. Store the accuracies for all fold and all     #
# values of k in the k_to_accuracies dictionary.                               #
################################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
for k in k_choices:
    accuracies = []
    for i in range(num_folds):

        x_train_val = X_train_folds[i]
        y_train_val = y_train_folds[i]
        #del X_train_folds[i]
        #del y_train_folds[i]
        xtrain_list = np.vstack(X_train_folds[0:i] + X_train_folds[i+1:])
        ytrain_list = np.hstack(y_train_folds[0:i] + y_train_folds[i+1:])

        classifier.train(xtrain_list,ytrain_list)
        dists = classifier.compute_distances_no_loops(x_train_val)
        y_test_pred = classifier.predict_labels(dists, k)
        num_correct = np.sum(y_test_pred == y_train_val)
        accuracy = float(num_correct) /  y_train_val.shape[0]
        accuracies.append(accuracy)
    k_to_accuracies[k] = accuracies
pass

# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****

# Print out the computed accuracies
for k in sorted(k_to_accuracies):
    for accuracy in k_to_accuracies[k]:
        print('k = %d, accuracy = %f' % (k, accuracy))

# plot the raw observations
for k in k_choices:
   accuracies = k_to_accuracies[k]
   plt.scatter([k] * len(accuracies), accuracies)

# plot the trend line with error bars that correspond to standard deviation
accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])
accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])
plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)
plt.title('Cross-validation on k')
plt.xlabel('k')
plt.ylabel('Cross-validation accuracy')
plt.show()

# Based on the cross-validation results above, choose the best value for k,   
# retrain the classifier using all the training data, and test it on the test
# data. You should be able to get above 28% accuracy on the test data.

temp = accuracies_mean.tolist()
best_kth = temp.index(max(temp))
best_k = k_choices[best_kth]

print(best_k)


classifier = KNearestNeighbor()
classifier.train(X_train, y_train)
y_test_pred = classifier.predict(X_test, k=best_k)

# Compute and display the accuracy
num_correct = np.sum(y_test_pred == y_test)
accuracy = float(num_correct) / num_test
print('Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy))

out:
10
Got 144 / 500 correct => accuracy: 0.288000

Inline Question 3

Which of the following statements about  ? -Nearest Neighbor ( ? -NN) are true in a classification setting, and for all  ? ? Select all that apply.

The decision boundary of the k-NN classifier is linear.
The training error of a 1-NN will always be lower than that of 5-NN.
The test error of a 1-NN will always be lower than that of a 5-NN.
The time needed to classify a test example with the k-NN classifier grows with the size of the training set.
None of the above.
Y?????????: 4
Y??????????????:
1. irregular shape
2. 5-NN has more robustness and lower train error
3. it has uncertainty
4. in test process all datas are used including traing data, so it need more time  with larger size of the trainging set.

你可能感兴趣的:(python,cs2321n,computer,vision,deep,learning)

预测导管原位癌浸润性复发的深度学习：利用组织病理学图像和临床特征浪漫的诗人论文深度学习人工智能
文章目录研究内容目的方法数据集模型开发模型训练与评估外部验证统计分析研究结果模型性能风险分层外部验证特征重要性原文链接原文献：Deeplearningforpredictinginvasiverecurrenceofductalcarcinomainsitu:leveraginghistopathologyimagesandclinicalfeatures研究背景【DCIS与IBC的关联】乳腺导管
Python类中魔术方法(Magic Methods)完全指南：从入门到精通盛夏绽放 python 开发语言
文章目录Python类中魔术方法(MagicMethods)完全指南：从入门到精通一、魔术方法基础1.什么是魔术方法？2.魔术方法的特点二、常用魔术方法分类详解1.对象创建与初始化2.对象表示与字符串转换3.比较运算符重载4.算术运算符重载5.容器类型模拟6.上下文管理器7.可调用对象三、高级魔术方法1.属性访问控制2.描述符协议3.数值类型转换四、魔术方法最佳实践五、综合案例：自定义分数类Pyt
Python面向对象编程(OOP)详解：通俗易懂的全面指南盛夏绽放 python 开发语言有问必答
前些天发现了一个巨牛的人工智能学习网站，通俗易懂，风趣幽默，忍不住分享一下给大家。点击跳转到网站。文章目录Python面向对象编程(OOP)详解：通俗易懂的全面指南一、OOP基本概念1.什么是面向对象编程？2.OOP的四大支柱3.核心概念对比表二、类和对象1.类(Class)vs对象(Object)2.类结构详解三、OOP三大特性详解1.封装(Encapsulation)2.继承(Inherita
〖Python 数据库开发实战 - Redis篇②〗- Linux系统下安装 Redis 数据库哈哥撩编程 #⑤ -数据库开发实战篇 Python全栈白宝书 python 数据库数据库开发实战 linux安装redis
订阅Python全栈白宝书-零基础入门篇可报销！白嫖入口-请点击我。推荐他人订阅，可获取扣除平台费用后的35%收益，文末名片加V！说明：该文属于Python全栈白宝书专栏，免费阶段订阅数量4300+，购买任意白宝书体系化专栏可加入TFS-CLUB私域社区。福利：加入社区的小伙伴们，除了可以获取博主所有付费专栏的阅读权限之外，还有机会加入星荐官共赢计划，详情请戳我。作者：不渴望力量的哈士奇(哈哥)，
python分布式爬虫打造搜索引擎--------scrapy实现 weixin_30515513 爬虫 python 开发工具
http://www.cnblogs.com/jinxiao-pu/p/6706319.html最近在网上学习一门关于scrapy爬虫的课程，觉得还不错，以下是目录还在更新中，我觉得有必要好好的做下笔记，研究研究。第1章课程介绍1-1python分布式爬虫打造搜索引擎简介07:23第2章windows下搭建开发环境2-1pycharm的安装和简单使用10:272-2mysql和navicat的安装
上传文件csv并解析list_基于PyQt5表格控件TableWidget的csv文件内容显示
(70后红太阳2020年4月写于成都)一、配置环境开发环境：Win7；开发工具：Python3.8.2IDLE，QtDesigner5.13.2；Python安装目录：D:python；文件保存目录：D:python基于PyQt5表格控件TableWidget的csv文件内容显示；路径配置：在cmd下，运行path=%path%;Dpythonpython38-32scripts;D:python
《How to Take Smart Notes》读书笔记1 LY320
最近在读一本书，题为《HowtoTakeSmartNotes:OneSimpleTechniquetoBoostWriting,LearningandThinking–forStudents,AcademicsandNonfictionBookWriters》1。尚未读完，分享一些读这本书的感想，我的一些心得，和不解。这本书让我觉得最有收获的点是更新了我对记录和整理笔记的认识。通常我们在记录笔记时
告别内存焦虑！用Dask打开Python大数据并行计算的“任意门“ 小张在编程 python 大数据开发语言
引言当你在Jupyter里用Pandas读取20GB的CSV文件，看到内存占用率从10%飙升到90%，最后弹出"MemoryError"时；当你想对亿级数据做分组聚合，却发现单线程计算要等上半小时——这些场景是不是像极了用小推车搬运万吨货物？Python生态中，Dask库就像一台"并行计算推土机"，能把大数据拆分成小块并行处理，让你的普通电脑也能拥有分布式计算的能力。本文将从原理到实战，带你掌握这
Django项目运行报错：ModuleNotFoundError: No module named ‘MySQLdb‘
解决方法：在__init__.py文件下，新增下面这段代码importpymysqlpymysql.install_as_MySQLdb()注意：确保你的python有下载pymysql库，没有的话可以使用pipinstallpymysql安装原理：用pymysql来代替mysqlLab__init__.py文件大致位置在：
[Py026]Snakefile灵活传递param 安哥生个信
snakemake是用python编写的，最近串流程用的比较频繁，所以也归纳在python实用技巧里面。现在需要实现的一个功能是——根据每一个input自身的特点，返回一个值（可能是固定，也可能是随机）；然后将这个返回值传递给下面的运行代码。举例：现在有两个fastq文件20192.fastq.gz20193.fastq.gz，需要通过seqkit转换为fasta文件；如果文件名是奇数，则转换出来
Python日志终极指南：深入探索logging日志管理模块 c01dkit python python 开发语言
在任何一个严谨的软件开发项目中，日志（Logging）都是不可或缺的一环。它不仅是调试代码的利器，更是线上问题追踪、性能分析和数据监控的重要依据。相比于随处可见的print()语句，Python内置的logging模块提供了更为强大、灵活且标准化的解决方案。[1][2]这篇博客将带你由浅入深，全面掌握logging模块的使用，从基础配置到高级技巧，再到企业级项目的最佳实践。一、告别print()：
python大数据论文_大数据环境下基于python的网络爬虫技术 weixin_39775976 python大数据论文
软件开发大数据环境下基于python的网络爬虫技术作者/谢克武，重庆工商大学派斯学院软件工程学院摘要：随着互联网的发展壮大，网络数据呈爆炸式增长，传统捜索引擎已经不能满足人们对所需求数据的获取的需求，作为搜索引擎的抓取数据的重要组成部分，网络爬虫的作用十分重要，本文首先介绍了在大数据环境下网络爬虫的重要性，接着介绍了网络爬虫的概念，工作原理，工作流程，网页爬行策略，python在编写爬虫领域的优势
【Python爬虫(26)】Python爬虫进阶：数据清洗与预处理的魔法秘籍奔跑吧邓邓子 Python爬虫 python 爬虫开发语言数据清洗预处理
【Python爬虫】专栏简介：本专栏是Python爬虫领域的集大成之作，共100章节。从Python基础语法、爬虫入门知识讲起，深入探讨反爬虫、多线程、分布式等进阶技术。以大量实例为支撑，覆盖网页、图片、音频等各类数据爬取，还涉及数据处理与分析。无论是新手小白还是进阶开发者，都能从中汲取知识，助力掌握爬虫核心技能，开拓技术视野。目录一、数据清洗的重要性二、数据清洗的常见任务2.1去除噪声数据2.2
117、Python机器学习：数据预处理与特征工程技巧多多的编程笔记 python 机器学习开发语言
Python开发之机器学习准备：数据预处理与特征工程机器学习是当前人工智能领域的热门方向之一。而作为机器学习的核心组成部分，数据预处理与特征工程对于模型的性能有着至关重要的影响。本文将带领大家了解数据预处理与特征工程的基本概念，以及它们在实际应用场景中的重要性。数据预处理数据预处理是机器学习中的第一步，它的主要目的是将原始数据转换成适合进行机器学习模型训练的形式。就像我们在做饭之前需要清洗和准备食
如何通过linux黑窗口实现对远程服务器的操作
①选择合适的云平台进行设备的租用并复制好远程设备的IP地址②使用管理员权限打开黑窗口③输入命令连接远程的设备：ssh用户名@服务器IP地址，此时得到的是一个什么都没有的设备④由于该设备什么都没有，故先：sudoaptupdate，然后安装gcc编译器：sudoaptinstallbulid-essential，再然后安装python：sudoaptinstallpython-3.8，再然后安装mi
Redis——API的理解和使用莫问以
一、全局命令1、查看所有键keys*下面插入了3对字符串类型的键值对：127.0.0.1:6379>sethelloworldOK127.0.0.1:6379>setjavajedisOK127.0.0.1:6379>setpythonredis-pyOKkeys*命令会将所有的键输出：127.0.0.1:6379>keys*1)"python"2)"java"3)"hello"2、键总数dbsi
PYTHON对接第三方验证码短信接口短信接口开发
PYTHON短信接口对接demo#接口类型：互亿无线触发短信接口，支持发送验证码短信、订单通知短信等。#账户注册：请通过该地址开通账户http://user.ihuyi.com/?DKimmu#注意事项：#（1）调试期间，请使用用系统默认的短信内容：您的验证码是：【变量】。请不要把验证码泄露给其他人。#（2）请使用APIID及APIKEY来调用接口，可在会员中心获取；#（3）该代码仅供接入互亿无线
第二十四篇 Requests+BeautifulSoup，秒抓网站信息！你的智能信息收集器！爱分享的飘哥日常效率自动化 beautifulsoup Python爬虫 Requests 数据抓取办公自动化信息收集
python爬虫序言：手动复制粘贴网页数据？效率太低了1.网页数据抓取基础：HTTP请求与网页结构速览1.1HTTP请求：浏览器如何和网页交互？1.2网页结构：HTML，信息的载体2.Requests库：发送网络请求的利器2.1安装与基础用法：你的第一个HTTP请求2.2处理请求头与参数：模拟浏览器访问3.BeautifulSoup：解析网页的利器3.1安装与基础用法：快速解析HTML内容3.2精
Redis 安全加固：从密码保护到高级安全配置 Seal^_^ 数据库专栏 #数据库--Redis redis 安全数据库 Redis 安全加固
Redis安全加固：从密码保护到高级安全配置一、Redis安全概述二、密码认证配置1.设置Redis密码临时设置（重启后失效）永久设置（修改配置文件）2.密码认证流程3.Python连接示例三、网络层安全加固1.绑定内网IP2.修改默认端口3.防火墙配置四、危险命令禁用1.禁用敏感命令2.命令禁用前后对比五、高级安全配置1.TLS加密传输2.客户端证书认证3.ACL细粒度权限控制（Redis6.0
Spring AI 项目实战（十八）：Spring Boot + AI + Vue3 + OSS + DashScope 实现高效语音识别系统（附完整源码）程序员岳彬 SpringAI spring 人工智能 spring boot 语音识别后端 ai java
系列文章序号文章名称1SpringAI项目实战（一）：SpringAI核心模块入门2SpringAI项目实战（二）：SpringBoot+AI+DeepSeek深度实战（附完整源码）3SpringAI项目实战（三）：SpringBoot+AI+DeepSeek打造智能客服系统（附完整源码）4
【python库对比】路径专题 os.path和pathlib对比尚未想好 python高频库对比 python 开发语言 vscode
专栏收录：python高频库对比本专栏将持续更新在工程领域高频使用的python库之间的对比文章概览：简单介绍路径处理常用的python库及特点对比os.path和pathlib的异同结合代码示例说明两个库的差异.补充：os.path和pathlib高频使用接口见os.path和pathlib高频使用接口及示例1.简介Python中处理路径的库有很多，其中一些常用的包括：os.path模块：os.
如何解决pip安装报错ModuleNotFoundError: No module named ‘flask’问题万粉变现经纪人全栈Bug解决方案专栏 pip flask python pycharm scrapy pandas 后端
【Python系列Bug修复PyCharm控制台pipinstall报错】如何解决pip安装报错ModuleNotFoundError:Nomodulenamed‘flask’问题摘要在使用PyCharm进行Python开发时，常常需要通过pip安装第三方包以满足项目依赖。但在控制台执行pipinstallflask后，依旧可能出现ModuleNotFoundError:Nomodulenamed
如何解决pip安装报错ModuleNotFoundError: No module named ‘sqlalchemy’问题万粉变现经纪人全栈Bug解决方案专栏 pip pandas python pycharm scipy beautifulsoup numpy
【Python系列Bug修复PyCharm控制台pipinstall报错】如何解决pip安装报错ModuleNotFoundError:Nomodulenamed‘sqlalchemy’问题摘要在使用PyCharm控制台执行pipinstallsqlalchemy后，仍然在代码中提示ModuleNotFoundError:Nomodulenamed'sqlalchemy'，让许多开发者头疼。本文将
2021-03-22 每日打卡来多喜
昨日完成情况：1.完成了3k跑，太久没锻炼体力跟不上，没力气做帕梅拉了。2.MathematicsforMachineLearning:LinearAlgebra学完了week3和week4，week5还剩大概一个小时学完，没有开始做思维导图。早上跑步回来后看《你是我的城池堡垒》看了两个小时，虽然一边看一边洗碗，洗完碗一边看一边吃饭，但是从三点多才开始学习。重要的事情要先做！3.没有时间做Pyth
selenium后续！！ paid槮 selenium 测试工具
小项目案例:实现批量下载网页中的资源根据15.3.2小节中的返回网页内容可知,用户只有获取了网页中的图片url才可以将图片下载到*在使用selenium库渲染网页后,可直接通过正则表达式过滤出指定的网页图片，从而实现批量下载接下来以此为思路来实现一个小项目案例。项目任务实现批量下载人民邮电出版社官网中与Python相关的图书封面图片。项目实步骤步骤1，获取人民邮电出版社官网中与Python相关的图
Python爬虫博客：使用Selenium模拟登录并抓取需要身份验证的网站内容 Python爬虫项目 2025年爬虫实战项目 python 爬虫 selenium 信息可视化开发语言百度测试工具
引言在爬虫开发的过程中，我们常常遇到需要身份验证才能访问的网站。例如，很多社交媒体、新闻网站、电商平台等都要求用户登录才能访问一些特定内容。如何模拟登录并抓取这些需要身份验证的网页内容成为了一个非常重要且常见的需求。Selenium，作为一个强大的浏览器自动化工具，不仅可以模拟用户的浏览行为，还能够模拟用户输入用户名和密码、点击登录按钮等操作，突破了普通爬虫工具（如requests）无法处理的Ja
如何解决pip安装报错ModuleNotFoundError: No module named ‘django’问题万粉变现经纪人全栈Bug解决方案专栏 pip django python numpy pycharm 后端 pandas
【Python系列Bug修复PyCharm控制台pipinstall报错】如何解决pip安装报错ModuleNotFoundError:Nomodulenamed‘django’问题摘要在日常Django项目开发中，最常见的“拦路虎”之一就是ModuleNotFoundError:Nomodulenamed'django'。该异常通常在以下场景出现：在PyCharm2025中新建项目后，直接在Py
基于生成对抗网络增强主动学习的超高温陶瓷硬度优化神经网络15044 深度学习算法仿真模型生成对抗网络学习人工智能
复现论文：基于生成对抗网络增强主动学习的超高温陶瓷硬度优化我将使用Python复现这篇关于使用生成对抗网络(GAN)增强主动学习来优化超高温陶瓷(UHTC)硬度的研究论文。以下是完整的实现代码和解释。1.环境准备和数据加载首先，我们需要准备必要的Python库并加载数据。importnumpyasnpimportpandasaspdimportmatplotlib.pyplotaspltimpor
基于R、Python的Copula变量相关性分析及AI大模型应用梦想的初衷~ 环境气象人工智能 r语言 python
在工程、水文和金融等各学科的研究中，总是会遇到很多变量，研究这些相互纠缠的变量间的相关关系是各学科的研究的重点。虽然皮尔逊相关、秩相关等相关系数提供了变量间相关关系的粗略结果，但这些系数都存在着无法克服的困难。例如，皮尔逊相关系数只能反映变量间的线性相关，而秩相关则更多的适用于等级变量。大多数情况下变量间的相关性非常复杂，而且随着变量取值的变化而变化，而这些相关系数都是全局性的，因此无法提供变量间
Opencv学习_2 （opencv结构&显示图像）
opencv结构：1：主要包含：cxcorecvmachinelearninghighguicvcamcvaux2：cxcore:基础结构:CvPoint,CvSize,CvScalar等数组结构:cvCreateImage,cvCreateMat等动态结构:CvMemStorage,CvMemBlock等绘图函数:cvLine,cvRectangle等数据保存和运行时类型信息：CvFileSto
java观察者模式 3213213333332132 java 设计模式游戏观察者模式
观察者模式——顾名思义，就是一个对象观察另一个对象，当被观察的对象发生变化时，观察者也会跟着变化。在日常中，我们配java环境变量时，设置一个JAVAHOME变量,这就是被观察者，使用了JAVAHOME变量的对象都是观察者，一旦JAVAHOME的路径改动，其他的也会跟着改动。这样的例子很多，我想用小时候玩的老鹰捉小鸡游戏来简单的描绘观察者模式。老鹰会变成观察者，母鸡和小鸡是
TFS RESTful API 模拟上传测试 ronin47
TFS RESTful API 模拟上传测试。　　细节参看这里：https://github.com/alibaba/nginx-tfs/blob/master/TFS_RESTful_API.markdown 模拟POST上传一个图片： curl --data-binary @/opt/tfs.png http
PHP常用设计模式单例, 工厂, 观察者, 责任链, 装饰, 策略,适配,桥接模式 dcj3sjt126com 设计模式 PHP
// 多态, 在JAVA中是这样用的, 其实在PHP当中可以自然消除, 因为参数是动态的, 你传什么过来都可以, 不限制类型, 直接调用类的方法 abstract class Tiger { public abstract function climb(); } class XTiger extends Tiger { public function climb()
hibernate 171815164 Hibernate
main,save Configuration conf =new Configuration().configure(); SessionFactory sf=conf.buildSessionFactory(); Session sess=sf.openSession(); Transaction tx=sess.beginTransaction(); News a=new
Ant实例分析 g21121 ant
下面是一个Ant构建文件的实例，通过这个实例我们可以很清楚的理顺构建一个项目的顺序及依赖关系，从而编写出更加合理的构建文件。下面是build.xml的代码： <?xml version="1
[简单]工作记录_接口返回405原因 53873039oycg 工作
最近调接口时候一直报错，错误信息是: responseCode:405 responseMsg:Method Not Allowed 接口请求方式Post.
关于java.lang.ClassNotFoundException 和 java.lang.NoClassDefFoundError 的区别程序员是怎么炼成的
真正完成类的加载工作是通过调用 defineClass来实现的；而启动类的加载过程是通过调用 loadClass来实现的；就是类加载器分为加载和定义 protected Class<?> findClass(String name) throws ClassNotFoundExcept
JDBC学习笔记-JDBC详细的操作流程 aijuans jdbc
所有的JDBC应用程序都具有下面的基本流程：　　1、加载数据库驱动并建立到数据库的连接。　　2、执行SQL语句。　　3、处理结果。　　4、从数据库断开连接释放资源。下面我们就来仔细看一看每一个步骤：其实按照上面所说每个阶段都可得单独拿出来写成一个独立的类方法文件。共别的应用来调用。 1、加载数据库驱动并建立到数据库的连接： Html代码 St
rome创建rss antonyup_2006 tomcat cms xml struts Opera
引用 1.RSS标准 RSS标准比较混乱，主要有以下3个系列 RSS 0.9x / 2.0 : RSS技术诞生于1999年的网景公司(Netscape)，其发布了一个0.9版本的规范。2001年，RSS技术标准的发展工作被Userland Software公司的戴夫温那(Dave Winer)所接手。陆续发布了0.9x的系列版本。当W3C小组发布RSS 1.0后，Dave W
html表格和表单基础百合不是茶 html 表格表单 meta 锚点
第一次用html来写东西,感觉压力山大,每次看见别人发的都是比较牛逼的再看看自己什么都还不会, html是一种标记语言,其实很简单都是固定的格式 _----------------------------------------表格和表单表格是html的重要组成部分,表格用在body里面的主要用法如下; <table> &
ibatis如何传入完整的sql语句 bijian1013 java sql ibatis
ibatis如何传入完整的sql语句？进一步说，String str ="select * from test_table"，我想把str传入ibatis中执行，是传递整条sql语句。解决办法： <
精通Oracle10编程SQL(14)开发动态SQL bijian1013 oracle 数据库 plsql
/* *开发动态SQL */ --使用EXECUTE IMMEDIATE处理DDL操作 CREATE OR REPLACE PROCEDURE drop_table(table_name varchar2) is sql_statement varchar2(100); begin sql_statement:='DROP TABLE '||table_name;
【Linux命令】Linux工作中常用命令 bit1129 linux命令
不断的总结工作中常用的Linux命令 1.查看端口被哪个进程占用通过这个命令可以得到占用8085端口的进程号，然后通过ps -ef|grep 进程号得到进程的详细信息 netstat -anp | grep 8085 察看进程ID对应的进程占用的端口号 netstat -anp | grep 进程ID &
优秀网站和文档收集白糖_ 网站
集成 Flex, Spring, Hibernate 构建应用程序性能测试工具-JMeter Hmtl5-IOCN网站 Oracle精简版教程网站鸟哥的linux私房菜 Jetty中文文档 50个jquery必备代码片段 swfobject.js检测flash版本号工具
angular.extend boyitech AngularJS angular.extend AngularJS API
angular.extend 复制src对象中的属性去dst对象中. 支持多个src对象. 如果你不想改变一个对象，你可以把dst设为空对象{}: var object = angular.extend({}, object1, object2). 注意: angular.extend不支持递归复制. 使用方法: angular.extend(dst, src); 参数:
java-谷歌面试题-设计方便提取中数的数据结构 bylijinnan java
网上找了一下这道题的解答，但都是提供思路，没有提供具体实现。其中使用大小堆这个思路看似简单，但实现起来要考虑很多。以下分别用排序数组和大小堆来实现。使用大小堆： import java.util.Arrays; public class MedianInHeap { /** * 题目：设计方便提取中数的数据结构 * 设计一个数据结构，其中包含两个函数，1.插
ajaxFileUpload 针对 ie jquery 1.7+不能使用问题修复版本 Chen.H ajaxFileUpload ie6 ie7 ie8 ie9
jQuery.extend({ handleError: function( s, xhr, status, e ) { // If a local callback was specified, fire it if ( s.error ) { s.error.call( s.context || s, xhr, status, e ); }
[机器人制造原则]机器人的电池和存储器必须可以替换 comsci 制造
机器人的身体随时随地可能被外来力量所破坏,但是如果机器人的存储器和电池可以更换,那么这个机器人的思维和记忆力就可以保存下来,即使身体受到伤害,在把存储器取下来安装到一个新的身体上之后,原有的性格和能力都可以继续维持..... 另外,如果一
Oracle Multitable INSERT 的用法 daizj oracle
转载Oracle笔记-Multitable INSERT 的用法 http://blog.chinaunix.net/uid-8504518-id-3310531.html 一、Insert基础用法语法： Insert Into 表名 (字段1,字段2,字段3...） Values (值1,
专访黑客历史学家George Dyson datamachine on
20世纪最具威力的两项发明——核弹和计算机出自同一时代、同一群年青人。可是，与大名鼎鼎的曼哈顿计划（第二次世界大战中美国原子弹研究计划）相比，计算机的起源显得默默无闻。出身计算机世家的历史学家George Dyson在其新书《图灵大教堂》（Turing’s Cathedral）中讲述了阿兰·图灵、约翰·冯·诺依曼等一帮子天才小子创造计算机及预见计算机未来
小学6年级英语单词背诵第一课 dcj3sjt126com english word
always 总是 rice 水稻，米饭 before 在...之前 live 生活，居住 usual 通常的 early 早的 begin 开始 month 月份 year 年 last 最后的 east 东方的 high 高的 far 远的 window 窗户 world 世界 than 比...更
在线IT教育和在线IT高端教育 dcj3sjt126com 教育
codecademy http://www.codecademy.com codeschool https://www.codeschool.com teamtreehouse http://teamtreehouse.com lynda http://www.lynda.com/ Coursera https://www.coursera.
Struts2 xml校验框架所定义的校验文件蕃薯耀 Struts2 xml校验 Struts2 xml校验框架 Struts2校验
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 2015年7月11日 15:54:59 星期六 http://fa
mac下安装rar和unrar命令 hanqunfeng mac
1.下载：http://www.rarlab.com/download.htm 选择 RAR 5.21 for Mac OS X 2.解压下载后的文件 tar -zxvf rarosx-5.2.1.tar 3.cd rar sudo install -c -o $USER unrar /bin #输入当前用户登录密码 sudo install -c -o $USER rar
三种将list转换为map的方法 jackyrong list
在本文中，介绍三种将list转换为map的方法： 1）传统方法假设有某个类如下 class Movie { private Integer rank; private String description; public Movie(Integer rank, String des
年轻程序员需要学习的5大经验 lampcy 工作 PHP 程序员
在过去的7年半时间里，我带过的软件实习生超过一打，也看到过数以百计的学生和毕业生的档案。我发现很多事情他们都需要学习。或许你会说，我说的不就是某种特定的技术、算法、数学，或者其他特定形式的知识吗？没错，这的确是需要学习的，但却并不是最重要的事情。他们需要学习的最重要的东西是“自我规范”。这些规范就是：尽可能地写出最简洁的代码；如果代码后期会因为改动而变得凌乱不堪就得重构；尽量删除没用的代码，并添加
评“女孩遭野蛮引产致终身不育 60万赔偿款1分未得”医腐深入骨髓 nannan408
先来看南方网的一则报道：再正常不过的结婚、生子，对于29岁的郑畅来说，却是一个永远也无法实现的梦想。从2010年到2015年，从24岁到29岁，一张张新旧不一的诊断书记录了她病情的同时，也清晰地记下了她人生的悲哀。　　粗暴手术让人发寒　　2010年7月，在酒店做服务员的郑畅发现自己怀孕了，可男朋友却联系不上。在没有和家人商量的情况下，她决定堕胎。　　12月5日，
使用jQuery为input输入框绑定回车键事件 VS 为a标签绑定click事件 Everyday都不同 jsp input 回车键绑定 click enter
假设如题所示的事件为同一个，必须先把该js函数抽离出来，该函数定义了监听的处理： function search() { //监听函数略...... } 为input框绑定回车事件，当用户在文本框中输入搜索关键字时，按回车键，即可触发search(): //回车绑定 $(".search").keydown(fun
EXT学习记录 tntxia ext
1. 准备（1）官网：http://www.sencha.com/ 里面有源代码和API文档下载。 EXT的域名已经从www.extjs.com改成了www.sencha.com ，但extjs这个域名会自动转到sencha上。（2）帮助文档：想要查看EXT的官方文档的话，可以去这里h
mybatis3的mapper文件报Referenced file contains errors xingguangsixian mybatis
最近使用mybatis.3.1.0时无意中碰到一个问题： The errors below were detected when validating the file "mybatis-3-mapper.dtd" via the file "account-mapper.xml". In most cases these errors can be d