機器學習之sklearn中的pipeline

如圖所示,利用pipeline我們可以方便的減少代碼量同時讓機器學習的流程變得直觀,

例如我們需要做如下操作,容易看出,訓練測試集重複了代碼,

 1
2
3
4
5
6
7
8
9
10
11
12
 vect = CountVectorizer()
tfidf = TfidfTransformer()
clf = SGDClassifier()

vX = vect.fit_transform(Xtrain)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)

# Now evaluate all steps on test set
vX = vect.fit_transform(Xtest)
tfidfX = tfidf.fit_transform(vX)
predicted = clf.fit_predict(tfidfX)

利用pipeline,上面代碼可以抽象為,

 1
2
3
4
5
6
7
8
 pipeline = Pipeline([
(
(
(
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)

注意,pipeline最後一步如果有predict()方法我們才可以對pipeline使用fit_predict(),同理,最後一步如果有transform()方法我們才可以對pipeline使用fit_transform()方法。

使用pipeline做cross validation

看如下案例,即先對輸入手寫數字的數據進行PCA降維,再通過邏輯回歸預測標籤。其中我們通過pipeline對
PCA的降維維數n_components和邏輯回歸的正則項C大小做交叉驗證,主要步驟有:

  1. 依次實例化各成分對像如
  2. 以(name, object)的tuble為元素組裝pipeline如
  3. 初始化CV參數如
  4. 實例化CV對像如
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
 import
import
from
from
from
logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()
pipe = Pipeline(steps=[(

digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

# Prediction
n_components = [
Cs = np.logspace(-

pca.fit(X_digits)
estimator = GridSearchCV(pipe,
dict
estimator.fit(X_digits, y_digits)

plt.figure(
plt.clf()
plt.axes([
plt.plot(pca.explained_variance_, linewidth=
plt.axis(
plt.xlabel(
plt.ylabel(
plt.axvline(
estimator.best_estimator_.named_steps[
linestyle=
label=
plt.legend(prop=
plt.show()

自定義transformer

我們可以如下自定義transformer(來自

1
2
3
4
5
6
7
8
9
10
11
12
 from

class

def
self.

def
return

def
return

另外,我們也可以對每個feature單獨處理,例如下面的這個比較大的流水線(來自

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
 pipeline = Pipeline([
(
(
(
(
])),
(
(
(
(
])),
(
(
(
(
])),
(
(
(
(
(
(
])),
(
(
(
(
]))
])),
(
(
(
(
(
(
(
(
(
])),
(
])
 1
2
3
4
5
6
7
8
 class

def
hours = DataFrame(X[
return

def
return
 1
2
3
4
5
6
7
8
9
10
11
 class

def
self.model = model

def
self.model.fit(*args, **kwargs)
return

def
return

FeatureUnion

sklearn.pipeline.FeatureUnion — scikit-learn 0.19.1 documentation

 1
2
3
4
5
6
7
8
9
10
11
12
 pipeline = Pipeline([
(
(
(
(
(
])),
(
(
])),
(
])

整個

下面的例子中,使用FeatureUnion結合PCA降維後特徵以及選擇原特徵中的幾個作為特徵組合再餵給SVM分類,最後用grid_search 做了pca的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
 from
from
from
from
from
from

iris = load_iris()

X, y = iris.data, iris.target

print

# This dataset is way too high-dimensional. Better do PCA:
pca = PCA()

# Maybe some original features where good, too?
selection = SelectKBest()

# Build estimator from PCA and Univariate selection:

svm = SVC(kernel=

# Do grid search over k, n_components and C:

pipeline = Pipeline([(
FeatureUnion([(
selection)])), (
svm)])

param_grid =
features__pca__n_components=[
features__univ_select__k=[
svm__C=[

grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose=
grid_search.fit(X, y)

grid_search.best_estimator_
grid_search.best_params_
grid_search.best_score_

Leave a Comment

Your email address will not be published.