Udacity Deep Learning课程作业(一)

Udacity的深度学习是Google开设的一门基于TensorFlow完成任务的在线课程,课程短小精悍,包括4章(入门ML/DL,DNN,CNN,RNN)、6个小作业(以ipynb的形式,十分方便友好)和1个大作业(开发实时摄像头应用)。

有ML/DL基础的同学,视频很快可以过完,因此课程精华在于其实战项目,很有意思。作为G家的课程,算是TensorFlow比较权威的学习tutorial了。

课程链接:这里
作业链接:这里

以下是本人的课程作业(一)的代码

Problem 1

IPython.display来可视化一些样本数据:

from IPython.display import display, Image
def visualize(folders):
    for folder_path in folders:
        fnames = os.listdir(folder_path)
        random_index = np.random.randint(len(fnames))
        fname = fnames[random_index]
        display(Image(filename=os.path.join(folder_path, fname)))

print("train_folders")
visualize(train_folders)
print("test_folders")
visualize(test_folders)

Problem 2

使用matplotlib.pyplot可视化样本:

def visualize_datasets(datasets):
    for dataset in datasets:
        with open(dataset, 'rb') as f:
            letter = pickle.load(f)
            sample_idx = np.random.randint(len(letter))
            sample_image = letter[sample_idx, :, :]
            fig = plt.figure()
            plt.imshow(sample_image)
        break

visualize_datasets(train_datasets)
visualize_datasets(test_datasets)

Problem 3

检查样本是否平衡(不同样本的数量差不多):

def check_dataset_is_balanced(datasets, notation=None):
    print(notation)
    for label in datasets:
        with open(label, 'rb') as f:
            ds = pickle.load(f)
            print("label {} has {} samples".format(label, len(ds)))

check_dataset_is_balanced(train_datasets, "training set")
check_dataset_is_balanced(test_datasets, "test set")

Problem 5

统计训练集、测试集和验证集出现重复的样本:

import hashlib

def count_duplicates(dataset1, dataset2):
    hashes = [hashlib.sha1(x).hexdigest() for x in dataset1]
    dup_indices = []
    for i in range(0, len(dataset2)):
        if hashlib.sha1(dataset2[i]).hexdigest() in hashes:
            dup_indices.append(i)
    return len(dup_indices)

data = pickle.load(open('notMNIST.pickle', 'rb'))
print(count_duplicates(data['test_dataset'], data['valid_dataset']))
print(count_duplicates(data['valid_dataset'], data['train_dataset']))
print(count_duplicates(data['test_dataset'], data['train_dataset']))

Problem 6

使用50、100、1000和5000个和全部训练样本来训练一个off-the-shelf模型,可以借助sklearn.linear_model中的Logistic Regression方法。

def train_and_predict(X_train, y_train, X_test, y_test):
    lr = LogisticRegression()

    X_train = X_train.reshape(X_train.shape[0], 28*28)
    lr.fit(X_train, y_train)

    X_test = X_test.reshape(X_test.shape[0], 28*28)
    print(lr.score(X_test, y_test))

def main():
    X_train = data["train_dataset"]
    y_train = data["train_labels"]

    X_test = data["test_dataset"]
    y_test = data["test_labels"]
    for size in [50, 100, 1000, 5000, None]:
        train_and_predict(X_train[:size], y_train[:size], X_test[:size], y_test[:size])

main()

你可能感兴趣的:(深度学习)