分类任务：从MNIST说起（一）MNIST数据集介绍

我把自己的原码放在这里了，用Jupyter notebook编写的，包含该系列的所有代码，有需要的同学请自取！！

CV界的HelloWorld：MNIST手写字识别的简单实现

关于MNIST数据集

MNIST数据集是由美国高中生和人口调查局员工手写的70000个数字的图片，每张图片都用其代表的数字标记。但凡有人想到了新的分类算法，都会先用MNIST检测效果。

MNIST的下载

MNIST的下载非常简单，可以直接通过Scikit-Learn的API下载：

# 导包、创建文件路径、编写图片保存的函数
import sklearn
import numpy as np
import os
np.random.seed(42)
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    # 本篇博客输出的所有图片，都会自动保存到代码下的image文件夹。
    plt.savefig(path, format=fig_extension, dpi=resolution)


# 下载MNIST就这么两行代码
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, as_frame=False)

Scikit加载的数据集一般是dict（字典）结构，包括：

DESCR键：描述数据集
data键：包含一个数组，每个实例为一行，每个特征为一列
target键：包含一个带有标记的数组

我们可以查看一下：

>>>mnist['DESCR']

输出：

"Author: Yann LeCun, Corinna Cortes, Christopher J.C. Burges \nSource: MNIST Website - Date unknown \nPlease cite: \n\nThe MNIST database of handwritten digits with 784 features, raw data available at: http://yann.lecun.com/exdb/mnist/. It can be split in a training set of the first 60,000 examples, and a test set of 10,000 examples \n\nIt is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting. The original black and white (bilevel) images from NIST were size normalized to fit in a 20x20 pixel box while preserving their aspect ratio. The resulting images contain grey levels as a result of the anti-aliasing technique used by the normalization algorithm. the images were centered in a 28x28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28x28 field. \n\nWith some classification methods (particularly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass. If you do this kind of pre-processing, you should report it in your publications. The MNIST database was constructed from NIST's NIST originally designated SD-3 as their training set and SD-1 as their test set. However, SD-3 is much cleaner and easier to recognize than SD-1. The reason for this can be found on the fact that SD-3 was collected among Census Bureau employees, while SD-1 was collected among high-school students. Drawing sensible conclusions from learning experiments requires that the result be independent of the choice of training set and test among the complete set of samples. Therefore it was necessary to build a new database by mixing NIST's datasets. \n\nThe MNIST training set is composed of 30,000 patterns from SD-3 and 30,000 patterns from SD-1. Our test set was composed of 5,000 patterns from SD-3 and 5,000 patterns from SD-1. The 60,000 pattern training set contained examples from approximately 250 writers. We made sure that the sets of writers of the training set and test set were disjoint. SD-1 contains 58,527 digit images written by 500 different writers. In contrast to SD-3, where blocks of data from each writer appeared in sequence, the data in SD-1 is scrambled. Writer identities for SD-1 is available and we used this information to unscramble the writers. We then split SD-1 in two: characters written by the first 250 writers went into our new training set. The remaining 250 writers were placed in our test set. Thus we had two sets with nearly 30,000 examples each. The new training set was completed with enough examples from SD-3, starting at pattern # 0, to make a full set of 60,000 training patterns. Similarly, the new test set was completed with SD-3 examples starting at pattern # 35,000 to make a full set with 60,000 test patterns. Only a subset of 10,000 test images (5,000 from SD-1 and 5,000 from SD-3) is available on this site. The full 60,000 sample training set is available.\n\nDownloaded from openml.org."

翻译一下：

"作者：Yann LeCun、Corinna Cortes、Christopher J.C. Burges
来源：MNIST 网站
具有 784 个特征的手写数字 MNIST 数据库，原始数据可在：http://yann.lecun.com/exdb/mnist/ 获得。它可以拆分为第一个训练集60,000 个示例和一个包含 10,000 个示例的测试集 \n\n它是 NIST 提供的更大集合的一个子集。这些数字已经过尺寸标准化，并在固定尺寸的图像中居中。对于需要的人来说，这是一个很好的数据库尝试在现实世界数据上学习技术和模式识别方法，同时在预处理和格式化上花费最少的精力。来自 NIST 的原始黑白（双级）图像经过尺寸标准化以适合 20x20 像素框，同时保持其纵横比。由于归一化算法使用的抗锯齿技术，生成的图像包含灰度级。图像以 28x28 为中心通过计算像素的质心，并平移图像以将该点定位在 28x28 场的中心。 \n\n使用一些分类方法（特别是基于模板的方法，例如 SVM 和 K-最近邻），当数字以边界框而不是质心为中心时，错误率会有所提高。如果你做这种预处理，你应该在你的出版物中报告它。 MNIST 数据库是从 NIST 的 NIST 构建的，最初指定 SD-3 作为他们的训练集，SD-1 作为他们的测试集。但是，SD-3 比 SD-1 更清晰、更容易识别。其原因在于，SD-3 是在人口普查局员工中收集的，而 SD-1 是在高中生中收集的。从学习实验中得出合理的结论要求结果独立于在完整样本集中选择训练集和测试。因此有必要通过混合 NIST 的数据集来构建一个新的数据库。 \n\nMNIST 训练集由来自 SD-3 的 30,000 个模式和来自 SD-1 的 30,000 个模式组成。我们的测试集由来自 SD-3 的 5,000 个模式和来自 SD-1 的 5,000 个模式组成。 60,000 个模式训练集包含来自大约 250 位作者的示例。我们确保训练集和测试集的作者集是不相交的。 SD-1 包含由 500 位不同作者编写的 58,527 位数字图像。与 SD-3 不同，其中来自每个写入器的数据块按顺序出现，而 SD-1 中的数据是加扰的。 SD-1 的作者身份可用，我们使用此信息来解读作者。然后我们将 SD-1 分成两部分：前 250 位作者编写的字符进入我们的新训练集。剩下的 250 名作者被放置在我们的测试集中。因此，我们有两组，每组有近 30,000 个示例。新的训练集由 SD-3 中的足够示例完成，从模式 #0 开始，以制作完整的 60,000 个训练模式集。同样，新的测试集是用 SD-3 示例完成的，从模式 #35,000 开始，以构成具有 60,000 个测试模式的完整集。此站点上仅提供 10,000 张测试图像的子集（SD-1 中的 5,000 张和 SD-3 中的 5,000 张）。完整的 60,000 个样本训练集可用。
从 openml.org 下载。"

再来看一下data：

>>>type(mnist['data'])

输出：

numpy.ndarray

显然是一个numpy数组。查看一下这个数组各维度的形状：

mnist['data'].shape

(70000,784)

说明：一共有70000张图片，每张图片有784个特征，因为图片是$28\times28$个像素，每个特征代表一个像素点。
想要查看图片也很简单，只需要拿出一张图片再reshape到二维，就可以imshow输出了。

以第一张图片为例：

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

X, y = mnist["data"], mnist["target"]

some_digit = X[0]     #拿出第一张图片
some_digit_image = some_digit.reshape(28, 28)  #reshape到二维
plt.imshow(some_digit_image, cmap=mpl.cm.binary)
plt.axis("off")

save_fig("first digit")
plt.show()

输出图片如下：

看起来好像像5呢（？）

我们再输出一下他的标签确认一下：

>>>y[0]

'5'

标签就是5没错。
但有一个小问题，就是这里的标签是字符，我们一般更习惯于用数字的，所以我们把标签y全部转化成数字整型：

y = y.astype(np.uint8)

我们再多输出几张图片看看：

# 编写输出大图的函数
def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    # 这相当于： n_rows = ceil(len(instances) / images_per_row):
    n_rows = (len(instances) - 1) // images_per_row + 1

    # 我们可以附加空图像以填充网格的末尾：
    n_empty = n_rows * images_per_row - len(instances)
    padded_instances = np.concatenate([instances, np.zeros((n_empty, size * size))], axis=0)

    # 重建数组，使其组织为包含 28×28 图像的网格：
    image_grid = padded_instances.reshape((n_rows, images_per_row, size, size))

    # 合并轴 0,2（垂直图像网格轴和垂直图像轴）,1,3(水平轴)
    # 首先需要使用 transpose() 移动想要组合的轴，然后再重塑：
    big_image = image_grid.transpose(0, 2, 1, 3).reshape(n_rows * size,
                                                         images_per_row * size)
    # 大图已经拼好了，输出：
    plt.imshow(big_image, cmap = mpl.cm.binary, **options)
    plt.axis("off")
    
    
plt.figure(figsize=(9,9))  # 创建一个9×9的画布
example_images = X[:100]   # 选取前100张
plot_digits(example_images, images_per_row=10)
save_fig("many_digits_plot")
plt.show()

输出图片如下：