如圖所示,利用pipeline我們可以方便的減少代碼量同時讓機器學習的流程變得直觀,
例如我們需要做如下操作,容易看出,訓練測試集重複了代碼,
1 2 3 4 5 6 7 8 9 10 11 12
vect = CountVectorizer() tfidf = TfidfTransformer() clf = SGDClassifier() vX = vect.fit_transform(Xtrain) tfidfX = tfidf.fit_transform(vX) predicted = clf.fit_predict(tfidfX) # Now evaluate all steps on test set vX = vect.fit_transform(Xtest) tfidfX = tfidf.fit_transform(vX) predicted = clf.fit_predict(tfidfX)
利用pipeline,上面代碼可以抽象為,
1 2 3 4 5 6 7 8
pipeline = Pipeline([ ( ( ( ]) predicted = pipeline.fit(Xtrain).predict(Xtrain) # Now evaluate all steps on test set predicted = pipeline.predict(Xtest)
注意,pipeline最後一步如果有predict()方法我們才可以對pipeline使用fit_predict(),同理,最後一步如果有transform()方法我們才可以對pipeline使用fit_transform()方法。
使用pipeline做cross validation
看如下案例,即先對輸入手寫數字的數據進行PCA降維,再通過邏輯回歸預測標籤。其中我們通過pipeline對 PCA的降維維數n_components和邏輯回歸的正則項C大小做交叉驗證,主要步驟有:
依次實例化各成分對像如
以(name, object)的tuble為元素組裝pipeline如
初始化CV參數如
實例化CV對像如
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
import import from from from logistic = linear_model.LogisticRegression() pca = decomposition.PCA() pipe = Pipeline(steps=[( digits = datasets.load_digits() X_digits = digits.data y_digits = digits.target # Prediction n_components = [ Cs = np.logspace(- pca.fit(X_digits) estimator = GridSearchCV(pipe, dict estimator.fit(X_digits, y_digits) plt.figure( plt.clf() plt.axes([ plt.plot(pca.explained_variance_, linewidth= plt.axis( plt.xlabel( plt.ylabel( plt.axvline( estimator.best_estimator_.named_steps[ linestyle= label= plt.legend(prop= plt.show()
我們可以如下自定義transformer(來自
1 2 3 4 5 6 7 8 9 10 11 12
from class def self. def return def return
另外,我們也可以對每個feature單獨處理,例如下面的這個比較大的流水線(來自
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
pipeline = Pipeline([ ( ( ( ( ])), ( ( ( ( ])), ( ( ( ( ])), ( ( ( ( ( ( ])), ( ( ( ( ])) ])), ( ( ( ( ( ( ( ( ( ])), ( ])
1 2 3 4 5 6 7 8
class def hours = DataFrame(X[ return def return
1 2 3 4 5 6 7 8 9 10 11
class def self.model = model def self.model.fit(*args, **kwargs) return def return
FeatureUnion
sklearn.pipeline.FeatureUnion — scikit-learn 0.19.1 documentation
1 2 3 4 5 6 7 8 9 10 11 12
pipeline = Pipeline([ ( ( ( ( ( ])), ( ( ])), ( ])
整個
下面的例子中,使用FeatureUnion結合PCA降維後特徵以及選擇原特徵中的幾個作為特徵組合再餵給SVM分類,最後用grid_search 做了pca的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
from from from from from from iris = load_iris() X, y = iris.data, iris.target print # This dataset is way too high-dimensional. Better do PCA: pca = PCA() # Maybe some original features where good, too? selection = SelectKBest() # Build estimator from PCA and Univariate selection: svm = SVC(kernel= # Do grid search over k, n_components and C: pipeline = Pipeline([( FeatureUnion([( selection)])), ( svm)]) param_grid = features__pca__n_components=[ features__univ_select__k=[ svm__C=[ grid_search = GridSearchCV(pipeline, param_grid=param_grid, verbose= grid_search.fit(X, y) grid_search.best_estimator_ grid_search.best_params_ grid_search.best_score_