Udacity的深度学习是Google开设的一门基于TensorFlow完成任务的在线课程,课程短小精悍,包括4章(入门ML/DL,DNN,CNN,RNN)、6个小作业(以ipynb的形式,十分方便友好)和1个大作业(开发实时摄像头应用)。
有ML/DL基础的同学,视频很快可以过完,因此课程精华在于其实战项目,很有意思。作为G家的课程,算是TensorFlow比较权威的学习tutorial了。
课程链接:这里
作业链接:这里
以下是本人的课程作业(一)的代码
用IPython.display
来可视化一些样本数据:
from IPython.display import display, Image
def visualize(folders):
for folder_path in folders:
fnames = os.listdir(folder_path)
random_index = np.random.randint(len(fnames))
fname = fnames[random_index]
display(Image(filename=os.path.join(folder_path, fname)))
print("train_folders")
visualize(train_folders)
print("test_folders")
visualize(test_folders)
使用matplotlib.pyplot
可视化样本:
def visualize_datasets(datasets):
for dataset in datasets:
with open(dataset, 'rb') as f:
letter = pickle.load(f)
sample_idx = np.random.randint(len(letter))
sample_image = letter[sample_idx, :, :]
fig = plt.figure()
plt.imshow(sample_image)
break
visualize_datasets(train_datasets)
visualize_datasets(test_datasets)
检查样本是否平衡(不同样本的数量差不多):
def check_dataset_is_balanced(datasets, notation=None):
print(notation)
for label in datasets:
with open(label, 'rb') as f:
ds = pickle.load(f)
print("label {} has {} samples".format(label, len(ds)))
check_dataset_is_balanced(train_datasets, "training set")
check_dataset_is_balanced(test_datasets, "test set")
统计训练集、测试集和验证集出现重复的样本:
import hashlib
def count_duplicates(dataset1, dataset2):
hashes = [hashlib.sha1(x).hexdigest() for x in dataset1]
dup_indices = []
for i in range(0, len(dataset2)):
if hashlib.sha1(dataset2[i]).hexdigest() in hashes:
dup_indices.append(i)
return len(dup_indices)
data = pickle.load(open('notMNIST.pickle', 'rb'))
print(count_duplicates(data['test_dataset'], data['valid_dataset']))
print(count_duplicates(data['valid_dataset'], data['train_dataset']))
print(count_duplicates(data['test_dataset'], data['train_dataset']))
使用50、100、1000和5000个和全部训练样本来训练一个off-the-shelf模型,可以借助sklearn.linear_model
中的Logistic Regression方法。
def train_and_predict(X_train, y_train, X_test, y_test):
lr = LogisticRegression()
X_train = X_train.reshape(X_train.shape[0], 28*28)
lr.fit(X_train, y_train)
X_test = X_test.reshape(X_test.shape[0], 28*28)
print(lr.score(X_test, y_test))
def main():
X_train = data["train_dataset"]
y_train = data["train_labels"]
X_test = data["test_dataset"]
y_test = data["test_labels"]
for size in [50, 100, 1000, 5000, None]:
train_and_predict(X_train[:size], y_train[:size], X_test[:size], y_test[:size])
main()