E資格_機械学習

122

はじめに

　本レポートはラビットチャレンジの機械学習レポートである

線形回帰モデル

線形回帰モデルとは、説明変数（独立変数）に対して目的変数（従属変数）が線形またはそれから近い値で表されるモデルのこと。
image.png

例えば、広告費を説明変数、売上高を目的変数とした場合、広告費と売上高の関係が線形であると仮定すると、線形回帰モデルを用いて広告費と売上高の関係を予測することができる。

線形回帰モデルは、教師あり学習の予測タスクに分類され、パラメーターは最小二乗法で推定することが多い。
$y = β 0 + β 1 x + ϵ$
ここで、 $y$ は応答変数、 $x$ は説明変数、 $β_{0}$ は切片、 $β_{1}$ は傾き、 $ϵ$ は誤差項を表す。

$ϵ$ を残差平方和といい、最小二乗法で求める。
予測と実際の値のズレの残差平方和
$S S E = \sum_{i = 0}^{n} (ϵ_{i})^{2} = \sum_{i = 0}^{n} ({\hat{y}}_{i} - β 0 - β 1 x)^{2}$
SSEはβ0とβ1の二次関数であるので、これをβ0とβ1それぞれで
偏微分し、イコール0とした連立方程式を解くことによって、
それらの推定値が得られる。
$\frac{\partial S S E}{\partial β 0} = 0$

$\frac{\partial S S E}{\partial β 1} = 0$

線形回帰_レポート課題

犯罪率0.3、部屋数4の物件はいくらになるか？

      #ドライブをマウント
from google.colab import drive
drive.mount('/content/drive')

#必要なライブラリをインポート
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# ボストンの住宅価格データをロードする
boston = pd.read_csv("/content/drive/MyDrive/ラビットチャレンジ/機械学習/housing.data", sep="\s+", header=None)
# 特徴量と目的変数を取得する
X = boston.iloc[:, [0, 5]]  #犯罪率と部屋数
print(X)
y = boston.iloc[:, -1]      #価格
print(y)
# データをトレーニング用とテスト用に分割する
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 線形回帰モデルを作成してトレーニングする
model = LinearRegression()
model.fit(X_train, y_train)

#犯罪率0.3、部屋数4の物件価格を予測する
model.predict([[0.3, 4]])

実行結果

                 0      5
0    0.00632  6.575
1    0.02731  6.421
2    0.02729  7.185
3    0.03237  6.998
4    0.06905  7.147
..       ...    ...
501  0.06263  6.593
502  0.04527  6.120
503  0.06076  6.976
504  0.10959  6.794
505  0.04741  6.030

[506 rows x 2 columns]
0      24.0
1      21.6
2      34.7
3      33.4
4      36.2
       ... 
501    22.4
502    20.6
503    23.9
504    22.0
505    11.9
Name: 13, Length: 506, dtype: float64
array([3.51998865])

上記より、犯罪率0.3,部屋数4の物件価格は約3,519USDと予測される。

非線形回帰モデル

非線形回帰モデルとは

非線形回帰モデルは、回帰分析の一種で、説明変数と目的変数の関係が線形でない場合に使用されるモデル。従属変数と独立変数との関係が直線的ではない場合に適用される。

1次元の基底関数に基づく非線形回帰は、1~9次元の多項式やガウス型基底がある
2次元の基底関数に基づく非線形回帰は、2次元ガウス型基底関数がある。

また、深層学習で使われるニューラルネットワークも非線形回帰モデルの1つである。

パラメーターの推定

非線形回帰において、尤度関数を最大化することで最適なパラメーターを求める。
${\hat{θ}}_{MLE} = argmax L (θ | x)$

過学習と、それを抑える方法

過学習（overfitting）は、機械学習モデルが訓練データに過剰に適合し、テストデータに対して性能が低下する現象のこと。モデルが訓練データに対して複雑すぎる場合や、訓練データに含まれるノイズや外れ値に過剰に適合する場合に発生する。

過学習を抑える方法に正則化があり、代表的なものにL1正則化とL2正則化がある。
・L1正則化は、ペナルティが重み係数の絶対値の和として表されスパース性を持つ。
$L 1 (w) = L o s s (w) + λ * | | w | | 1$
・L2正則化は、ペナルティがパラメータの二乗和として表される。
$L 2 (w) = L o s s (w) + λ / 2 * w^{2}$
image.png

非線形回帰_演習レポート

      #ドライブのマウント
from google.colab import drive
drive.mount('/content/drive')

#必要なライブラリのインポート
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

#seaborn設定
sns.set()
#背景変更
sns.set_style("darkgrid", {'grid.linestyle': '--'})
#大きさ(スケール変更)
sns.set_context("paper")

      n=100

def true_func(x):
    z = 1-48*x+218*x**2-315*x**3+145*x**4
    return z 

def linear_func(x):
    z = x
    return z 


# 真の関数からデータ生成
data = np.random.rand(n).astype(np.float32)
data = np.sort(data)
target = true_func(data)

# 　ノイズを加える
noise = 0.5 * np.random.randn(n) 
target = target  + noise

# ノイズ付きデータを描画

plt.scatter(data, target)

plt.title('NonLinear Regression')
plt.legend(loc=2)

      from sklearn.linear_model import LinearRegression

clf = LinearRegression()
data = data.reshape(-1,1)
target = target.reshape(-1,1)
clf.fit(data, target)

p_lin = clf.predict(data)

plt.scatter(data, target, label='data')
plt.plot(data, p_lin, color='darkorange', marker='', linestyle='-', linewidth=1, markersize=6, label='linear regression')
plt.legend()
print(clf.score(data, target))
>0.3824904075958734 #決定係数が0.38

image.png

決定係数が0.38であることと、描写された図より線形回帰ではこのデータをうまく説明できていない。
非線形回帰を試してみる。

      from sklearn.kernel_ridge import KernelRidge

clf = KernelRidge(alpha=0.0002, kernel='rbf')
clf.fit(data, target)

p_kridge = clf.predict(data)

plt.scatter(data, target, color='blue', label='data')

plt.plot(data, p_kridge, color='orange', linestyle='-', linewidth=3, markersize=6, label='kernel ridge')
plt.legend()
print(clf.score(data, target))
>0.8456228232350901

image.png

決定係数0.845と描写された図より、非線形回帰でデータを説明できている。

ロジスティック回帰モデル

ロジスティック回帰とは

ロジスティック回帰とは、分類モデルの一種であり、目的変数が2値（0か1）であるときに適用される。
シグモイド関数の出力を0から1の間の確率値として解釈し、この確率値が0.5よりも大きい場合は1、小さい場合は0として分類する。
$f (x) = \frac{1}{1 + e^{- x}}$

$P (Y = k | X) = \frac{e^{β_{k} X}}{\sum_{j = 1}^{K} e^{β_{j} X}}$
ロジスティック回帰は最尤推定によってパラメーターを求め、分類精度を評価するための指標として、正解率や適合率、再現率、F1スコアなどがある。

ロジスティック回帰_レポート課題

タイタニックの乗客で、30歳の男性は生き残れるか？

      #ライブラリのインポート
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# titanic data csvファイルの読み込み
titanic_df = pd.read_csv('/content/drive/MyDrive/ラビットチャレンジ/study_ai_ml_google/data/titanic_train.csv')

# ファイルの先頭部を表示し、データセットを確認する
titanic_df.head(5)

image.png

      #予測に不要と考えられる列（性別・年齢以外）をドロップ 
titanic_df.drop(['PassengerId','Pclass', 'Name', 'SibSp','Parch','Ticket','Fare','Cabin','Embarked'], axis=1, inplace=True)

#Ageカラムの欠損値を平均値で補完
titanic_df['AgeFill'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())

#Genderに女性0 男性1をセット
titanic_df['Gender'] = titanic_df['Sex'].map({'female': 0, 'male': 1}).astype(int)

#データを確認
titanic_df.head(6)

image.png

      #年齢と性別だけのリストを作成
data2 = titanic_df.loc[:, ["AgeFill", "Gender"]].values

#生死フラグのみのリストを作成
label2 =  titanic_df.loc[:,["Survived"]].values

#ロジスティック回帰で生存予測のモデルの実装
from sklearn.linear_model import LogisticRegression
model2 = LogisticRegression()
model2.fit(data2, label2)

#30歳男性の生存予測
model2.predict([[30,1]])
model2.predict_proba([[30,1]])

print(model2.predict([[30,1]]))
print(model2.predict_proba([[30,1]]))

出力結果

      [0]
[[0.80668102 0.19331898]]

生存率は約19.3%であるため、タイタニックに乗船した30歳男性は死亡すると予測できる。

主成分分析

主成分分析とは

主成分分析とは、教師なし学習の一つで、たくさんの変数を少ない変数に置き換え要約することでデータを理解しやすくする分析手法のことである。主成分分析では、データを1〜3つの変数（＝主成分）に置き換えることが多い。主成分とはデータの特徴を表す要素のことで、「第一主成分、第二主成分・・・」という形で表現する。

主成分分析は、次元削減やノイズの除去、データの可視化に利用される。

主成分分析のやり方

①元のデータから共分散行列を計算し、その共分散行列の固有ベクトル（特徴量）と固有値（その特徴量に対応する分散）を求める
②固有値が大きい順に固有ベクトルを取り出す
③それらを新しい座標軸としてデータを射影することで、主成分を抽出する。

主成分分析_レポート課題

32次元データを2次元に圧縮した時の判別の制度を確認

      import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
%matplotlib inline

data_breast_cancer = load_breast_cancer()

# Pandasによるデータの表示
df_target = pd.DataFrame(data_breast_cancer["target"], columns=["target"])
df_data = pd.DataFrame(data_breast_cancer["data"], columns=data_breast_cancer["feature_names"])
cancer_df = pd.concat([df_target, df_data], axis=1)
print('cancer df shape: {}'.format(cancer_df.shape))

      cancer df shape: (569, 32)
id    diagnosis    radius_mean    texture_mean    perimeter_mean    area_mean    smoothness_mean    compactness_mean    concavity_mean    concave points_mean    ...    radius_worst    texture_worst    perimeter_worst    area_worst    smoothness_worst    compactness_worst    concavity_worst    concave points_worst    symmetry_worst    fractal_dimension_worst
0    842302    M    17.99    10.38    122.80    1001.0    0.11840    0.27760    0.30010    0.14710    ...    25.380    17.33    184.60    2019.0    0.16220    0.66560    0.7119    0.2654    0.4601    0.11890
1    842517    M    20.57    17.77    132.90    1326.0    0.08474    0.07864    0.08690    0.07017    ...    24.990    23.41    158.80    1956.0    0.12380    0.18660    0.2416    0.1860    0.2750    0.08902
2    84300903    M    19.69    21.25    130.00    1203.0    0.10960    0.15990    0.19740    0.12790    ...    23.570    25.53    152.50    1709.0    0.14440    0.42450    0.4504    0.2430    0.3613    0.08758
3    84348301    M    11.42    20.38    77.58    386.1    0.14250    0.28390    0.24140    0.10520    ...    14.910    26.50    98.87    567.7    0.20980    0.86630    0.6869    0.2575    0.6638    0.17300
4    84358402    M    20.29    14.34    135.10    1297.0    0.10030    0.13280    0.19800    0.10430    ...    22.540    16.67    152.20    1575.0    0.13740    0.20500    0.4000    0.1625    0.2364    0.07678
...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...
564    926424    M    21.56    22.39    142.00    1479.0    0.11100    0.11590    0.24390    0.13890    ...    25.450    26.40    166.10    2027.0    0.14100    0.21130    0.4107    0.2216    0.2060    0.07115
565    926682    M    20.13    28.25    131.20    1261.0    0.09780    0.10340    0.14400    0.09791    ...    23.690    38.25    155.00    1731.0    0.11660    0.19220    0.3215    0.1628    0.2572    0.06637
566    926954    M    16.60    28.08    108.30    858.1    0.08455    0.10230    0.09251    0.05302    ...    18.980    34.12    126.70    1124.0    0.11390    0.30940    0.3403    0.1418    0.2218    0.07820
567    927241    M    20.60    29.33    140.10    1265.0    0.11780    0.27700    0.35140    0.15200    ...    25.740    39.42    184.60    1821.0    0.16500    0.86810    0.9387    0.2650    0.4087    0.12400
568    92751    B    7.76    24.54    47.92    181.0    0.05263    0.04362    0.00000    0.00000    ...    9.456    30.37    59.16    268.6    0.08996    0.06444    0.0000    0.0000    0.2871    0.07039

      # 目的変数の抽出
y = cancer_df.diagnosis.apply(lambda d: 1 if d == 'M' else 0)

# 説明変数の抽出
X = cancer_df.loc[:, 'radius_mean':]

# 学習用とテスト用でデータを分離
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# 標準化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ロジスティック回帰で学習
logistic = LogisticRegressionCV(cv=10, random_state=0)
logistic.fit(X_train_scaled, y_train)

# 検証
print('Train score: {:.3f}'.format(logistic.score(X_train_scaled, y_train)))
print('Test score: {:.3f}'.format(logistic.score(X_test_scaled, y_test)))
print('Confustion matrix:\n{}'.format(confusion_matrix(y_true=y_test, y_pred=logistic.predict(X_test_scaled))))

>Train score: 0.988
>Test score: 0.972
>Confustion matrix:
>[[89  1]
> [ 3 50]]

      # 次元数2まで圧縮
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.fit_transform(X_test_scaled)
print('X_train_pca shape: {}'.format(X_train_pca.shape))
# X_train_pca shape: (426, 2)

# 寄与率
print('explained variance ratio: {}'.format(pca.explained_variance_ratio_))
# explained variance ratio: [ 0.43315126  0.19586506]

# ロジスティック回帰で学習
logistic = LogisticRegressionCV(cv=10, random_state=0)
logistic.fit(X_train_pca, y_train)

# 検証
print('Train score: {:.3f}'.format(logistic.score(X_train_pca, y_train)))
print('Test score: {:.3f}'.format(logistic.score(X_test_pca, y_test)))
print('Confustion matrix:\n{}'.format(confusion_matrix(y_true=y_test, y_pred=logistic.predict(X_test_pca))))

>X_train_pca shape: (426, 2)
>explained variance ratio: [0.46662306 0.17180794]
>Train score: 0.965
>Test score: 0.916
>Confustion matrix:
>[[83  7]
> [ 5 48]]

      # 散布図にプロット
temp = pd.DataFrame(X_train_pca)
temp['Outcome'] = y_train.values
b = temp[temp['Outcome'] == 0]
m = temp[temp['Outcome'] == 1]
plt.scatter(x=b[0], y=b[1], marker='o') # 良性は○でマーク
plt.scatter(x=m[0], y=m[1], marker='^') # 悪性は△でマーク
plt.xlabel('PC 1') # 第1主成分をx軸
plt.ylabel('PC 2') # 第2主成分をy軸

image.png

2次元まで次元圧縮をしても、精度は90%以上を保っている。
散布図を見ると、丸い青の点と三角のオレンジの点が少し混ざっているところがあるものの、境界線が引けそうな分布に分かれている.

サポートベクターマシン

サポートベクターマシンとは

サポートベクターマシン（SVM）は、教師あり学習の分類に利用される機械学習手法の一つ。最も近いデータ点（サポートベクター）を基準に、データ点を超平面で分離することで、データの分類を行う。
判別する境界とデータとの距離をマージンと言い、マージン最大化でパラメータを推定する。
image.png

またあらかじめ誤分類を許すことで汎化性能を高めることを考えたマージンをソフトマージンという。

〇マージン最大化
image.png

カーネルトリック

下画像のような線形分離はできないが、次元を上げることによってデータを分類することが可能な時、本来の特徴量をある関数 Φ に代入して、それを高次元空間に射影してから、SVM を適用するという作業を行うことでデータの分類が可能になる。
image.png

この時、高次元の特徴量に対して計算量が膨大になるため、カーネルトリックという手法を使って計算量を減らしている。

主成分分析_レポート演習

      #ライブラリのインポート
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import svm
from sklearn.datasets import make_blobs
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import matplotlib.pyplot as plt

      # 散布図にデータとモデルの予測を描画する関数を作成
def plot_boundary(model, X, Y, target, xlabel, ylabel):
    cmap_dots = ListedColormap([ "#1f77b4", "#ff7f0e", "#2ca02c"])
    cmap_fills = ListedColormap([ "#c6dcec", "#ffdec2", "#cae7ca"])
    plt.figure(figsize=(5, 5))
    if model:
        XX, YY = np.meshgrid(
            np.linspace(X.min()-1, X.max()+1, 200),
            np.linspace(Y.min()-1, Y.max()+1, 200))
        pred = model.predict(np.c_[XX.ravel(), YY.ravel()]).reshape(XX.shape)
        plt.pcolormesh(XX, YY, pred, cmap=cmap_fills, shading="auto")
        plt.contour(XX, YY, pred, colors="gray") 
    plt.scatter(X, Y, c=target, cmap=cmap_dots)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.show()

線形のSVMモデル

      # 訓練・テスト用のデータセットを作成
X, y = make_blobs(
    random_state=4,
    n_features=2,
    centers=3,
    cluster_std=2,
    n_samples=500) 

# 訓練データ、テストデータに分ける
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) 

# 線形のSVMで学習モデルを作る
model = svm.SVC(kernel="linear")
model.fit(X_train, y_train)

# 正解率を調べる
pred = model.predict(X_test)
score = accuracy_score(y_test, pred)
print("正解率:", score*100, "%")

# この学習モデルの分類の様子を描画する
df = pd.DataFrame(X_test)
plot_boundary(model, df[0], df[1], y_test, "df [0]", "df [1]")

image.png

非線形のSVMモデル(RBFカーネル)

      # ガウスカーネル法のSVMで学習モデルを作る（訓練データで）
model = svm.SVC(kernel="rbf", gamma="scale")
model.fit(X_train, y_train)

# 正解率を調べる（テストデータで）
pred = model.predict(X_test)
score = accuracy_score(y_test, pred)
print("正解率:", score*100, "%")

# この学習モデルの分類の様子を描画する（テストデータで）
df = pd.DataFrame(X_test)
plot_boundary(model, df[0], df[1], y_test, "df [0]", "df [1]")