线性回归
y = ax + b
advertising 数据分析
数据集
数据的读取
手写读取数据
f = file(path) x = [] y = [] for i, d in enumerate(f): if i == 0: continue d = d.strip() if not d: continue d = map(float, d.split(',')) x.append(d[1:-1]) y.append(d[-1]) print x print y x = np.array(x) y = np.array(y)
使用 Python自带库
import csv f = file(path, 'rb') print f d = csv.reader(f) for line in d: print line f.close()
使用numpy:
p = np.loadtxt(path, delimiter=',', skiprows=1) print p
完整读取代码
import numpy as np import matplotlib.pyplot as plt import pandas as pd path = 'advertising.csv' data = pd.read_csv(path) # TV、Radio、Newspaper、Sales x = data[['TV', 'Radio', 'Newspaper']] y = data['Sales']
数据显示
绘制1
plt.plot(data['TV'], y, 'ro', label='TV') plt.plot(data['Radio'], y, 'g^', label='Radio') plt.plot(data['Newspaper'], y, 'mv', label='Newspaer') plt.legend(loc='lower right') plt.grid() plt.show()
绘制2
plt.figure(figsize=(9,12)) plt.subplot(311) # 分成 3*1 占用第一行 plt.plot(data['TV'], y, 'ro') plt.title('TV') plt.grid() plt.subplot(312) plt.plot(data['Radio'], y, 'g^') plt.title('Radio') plt.grid() plt.subplot(313) plt.plot(data['Newspaper'], y, 'b*') plt.title('Newspaper') plt.grid() plt.tight_layout() plt.show()
数据拆分
样本数据只有一份,所以需要一部分作为训练数据,剩下的部分作为测试数据。
x_train,y_train 为训练数据。 x_test, y_test 为测试数据。
这里使用 sklearn 库
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1) # print x_train, y_train
还可以指定训练数据大小
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=1) # print x_train, y_train
为了避免每次 python 都随机选择训练数据和测试数据,导致每次结果都不一样,这里 random_state 指定为1, 以便于测试。train_size 设置训练数据和测试数据的比例,可以不设置。
线性模型
linreg = LinearRegression()
利用线性模型做拟合,得到模型:
model = linreg.fit(x_train, y_train) print model
得到系数:
print linreg.coef_ [ 0.04647758 0.18534601 -0.00231602]
因为 y= ax1 + bx2 + cx3 + b
所以是三个系数
得到截距:
print linreg.intercept_
利用得到的模型预测:
y_hat = linreg.predict(np.array(x_test))
求 均方误差(Mean Squared Error, MSE):
mse = np.average((y_hat - np.array(y_test)) ** 2) # Mean Squared Error rmse = np.sqrt(mse) # Root Mean Squared Error print mse, rmse 1.97304562023 1.40465142303
将预测值和真实值画到一张图上,看一下预测效果:
t = np.arange(len(x_test)) plt.plot(t, y_test, 'r-', linewidth=2, label='Test') plt.plot(t, y_hat, 'g-', linewidth=2, label='Predict') plt.legend(loc='upper right') plt.grid() plt.show()
红色线为测试数据, 绿色线为根据模型到预测数据
完整代码
import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression if __name__ == "__main__": path = 'advertising.csv' data = pd.read_csv(path) # TV、Radio、Newspaper、Sales x = data[['TV', 'Radio', 'Newspaper']] y = data['Sales'] x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1) # print x_train, y_train linreg = LinearRegression() model = linreg.fit(x_train, y_train) y_hat = linreg.predict(np.array(x_test)) mse = np.average((y_hat - np.array(y_test)) ** 2) # Mean Squared Error rmse = np.sqrt(mse) # Root Mean Squared Error print mse, rmse t = np.arange(len(x_test)) plt.plot(t, y_test, 'r-', linewidth=2, label='Test') plt.plot(t, y_hat, 'g-', linewidth=2, label='Predict') plt.legend(loc='upper right') plt.grid() plt.show()
分析
我们发现 newspaper 和 y 似乎没有关系, 所以直接去掉这一列重新分析,代码如下:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
if __name__ == "__main__":
path = 'advertising.csv'
data = pd.read_csv(path) # TV、Radio、Newspaper、Sales
x = data[['TV', 'Radio']]
y = data['Sales']
#print x
#print y
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.9, random_state=1)
linreg = LinearRegression()
model = linreg.fit(x_train, y_train)
y_hat = linreg.predict(np.array(x_test))
mse = np.average((y_hat - np.array(y_test)) ** 2) # Mean Squared Error
rmse = np.sqrt(mse) # Root Mean Squared Error
print mse, rmse
1.92627604187 1.38790346994
问题
train_size 设置 0.9, 结果反而不好了
2.4823957548 1.5755620441
你的训练数据越多,那么这个直线就会去尝试拟合更多的点。
这个直线就会摆动。摆动,就导致了w和b的变化。
两点就能确定直线了。点越多,直线拟合效果越差。
- mse, rmse 如何解释?
均方差,均方根