Python機(jī)器學(xué)習(xí)及實(shí)踐——基礎(chǔ)篇：無(wú)監(jiān)督學(xué)習(xí)經(jīng)典模型（特征降維）

特征降維不僅可以重構(gòu)有效的低維度特征向量，同時(shí)也為數(shù)據(jù)展現(xiàn)提供了可能。在特征降維的方法種，主成分分析（Principal Component Analysis, PCA）是最為經(jīng)典和實(shí)用的特征降維技術(shù)，特別是輔助圖像識(shí)別方法有突出的表現(xiàn)。

1.主成分分析

線(xiàn)性相關(guān)矩陣秩計(jì)算樣例

            
              import numpy as np

# 初始化一個(gè)2*2的線(xiàn)性相關(guān)矩陣
M = np.array([[1, 2], [2, 4]])
# 計(jì)算2*2線(xiàn)性相關(guān)矩陣的秩
print(np.linalg.matrix_rank(M, tol=None))

PCA的思想是首先把原來(lái)的特征空間做了映射，使得新的映射后特征空間數(shù)據(jù)彼此正交。這樣一來(lái)，通過(guò)主成分分析就盡可能保留下具備區(qū)分性的低維數(shù)據(jù)特征。

應(yīng)用案例：手寫(xiě)體數(shù)字圖像識(shí)別

顯示手寫(xiě)體數(shù)字圖片經(jīng)PCA壓縮后的二維空間分布

            
              #!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
@File  : PCAdigits.py
@Author: Xinzhe.Pang
@Date  : 2019/7/23 20:05
@Desc  : 
"""
import pandas as pd
import numpy as np

# 從互聯(lián)網(wǎng)讀入手寫(xiě)體圖片識(shí)別任務(wù)的訓(xùn)練數(shù)據(jù)
digits_train = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tra',
                           header=None)
digits_test = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/optdigits/optdigits.tes',
                          header=None)
# 分割訓(xùn)練數(shù)據(jù)的特征向量和標(biāo)記
X_digits = digits_train[np.arange(64)]
y_digits = digits_train[64]

# 從sklearn.decomposition導(dǎo)入PCA
from sklearn.decomposition import PCA

# 初始化一個(gè)可以將高維度特征向量（64維）壓縮到2個(gè)維度的PCA
estimator = PCA(n_components=2)
X_pca = estimator.fit_transform(X_digits)

# 顯示10類(lèi)手寫(xiě)體數(shù)字圖像經(jīng)過(guò)PCA壓縮后的2維空間分布
from matplotlib import pyplot as plt


def plot_pca_scatter():
    colors = ['black', 'blue', 'purple', 'yellow', 'white', 'red', 'lime', 'cyan', 'orange', 'gray']

    for i in range(len(colors)):
        px = X_pca[:, 0][y_digits.as_matrix() == i]
        py = X_pca[:, 1][y_digits.as_matrix() == i]
        plt.scatter(px, py, c=colors[i])

    plt.legend(np.arange(0, 10).astype(str))
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
    plt.show()


plot_pca_scatter()

使用原始像素特征和經(jīng)PCA壓縮重建的低維特征，在相同配置的支持向量機(jī)（分類(lèi)）模型上分別進(jìn)行圖像識(shí)別。

            
              # 對(duì)訓(xùn)練數(shù)據(jù)、測(cè)試數(shù)據(jù)進(jìn)行特征向量（圖片像素）與分類(lèi)目標(biāo)的分割。
X_train = digits_train[np.arange(64)]
y_train = digits_train[64]
X_test = digits_test[np.arange(64)]
y_test = digits_test[64]

# 導(dǎo)入基于線(xiàn)性核的支持向量機(jī)分類(lèi)器
from sklearn.svm import LinearSVC

# 使用默認(rèn)參數(shù)的LinearSVC,對(duì)原始64維像素特征的訓(xùn)練數(shù)據(jù)進(jìn)行建模，并在測(cè)試數(shù)據(jù)上做出預(yù)測(cè)
svc = LinearSVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

# 使用PCA將64維圖像數(shù)據(jù)壓縮到20個(gè)維度
estimator = PCA(n_components=20)

# 利用訓(xùn)練特征決定（fit）20個(gè)正交維度的方向，并轉(zhuǎn)化（transform）原訓(xùn)練特征
pca_X_train = estimator.fit_transform(X_train)

# 對(duì)測(cè)試特征進(jìn)行同樣處理
pca_X_test = estimator.transform(X_test)

# 使用默認(rèn)參數(shù)的LinearSVC對(duì)壓縮后的20維特征的訓(xùn)練數(shù)據(jù)進(jìn)行建模，并在測(cè)試數(shù)據(jù)上進(jìn)行預(yù)測(cè)
pca_svc = LinearSVC()
pca_svc.fit(pca_X_train, y_train)
pca_y_pred = pca_svc.predict(pca_X_test)

原始像素特征與PCA壓縮重建的低維特征，在相同配置的支持向量機(jī)（分類(lèi)）模型上識(shí)別性能的差異。

            
              # 從sklearn.matrics導(dǎo)入classification_report用于更加細(xì)致的分類(lèi)性能分析
from sklearn.metrics import classification_report

# 對(duì)使用原始圖像高維像素特征訓(xùn)練的支持向量機(jī)分類(lèi)器的性能做出評(píng)估
print(svc.score(X_test, y_test))
print(classification_report(y_test, y_pred, target_names=np.arange(10).astype(str)))

# 對(duì)使用PCA壓縮重建的低維圖像特征訓(xùn)練的支持向量機(jī)分類(lèi)器的性能做出評(píng)估
print(pca_svc.score(pca_X_test, y_test))
print(classification_report(y_test, pca_X_test, target_names=np.arange(10).astype(str)))

降維/壓縮問(wèn)題則是選取數(shù)據(jù)具有代表性的特征，在保持?jǐn)?shù)據(jù)多樣性（Variance）的基礎(chǔ)上，規(guī)避掉大量的特征冗余和噪聲，不過(guò)這個(gè)過(guò)程也很有可能會(huì)損失一些有用的模式信息。

更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主

微信掃碼或搜索：z360901061

微信掃一掃加我為好友

QQ號(hào)聯(lián)系： 360901061

您的支持是博主寫(xiě)作最大的動(dòng)力，如果您喜歡我的文章，感覺(jué)我的文章對(duì)您有幫助，請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧，狠狠點(diǎn)擊下面給點(diǎn)支持吧，站長(zhǎng)非常感激您！手機(jī)微信長(zhǎng)按不能支付解決辦法：請(qǐng)將微信支付二維碼保存到相冊(cè)，切換到微信，然后點(diǎn)擊微信右上角掃一掃功能，選擇支付二維碼完成支付。

【本文對(duì)您有幫助就好】元

2元

5元

10元

20元

自定義