python - Why is ShuffleSplit more/less random than train_test_split (with random_state=None)? -
consider following 2 options presented:
#!/usr/bin/env python3 # -*- coding: utf-8 -*- #sklearn.__version__ 17.1 #python --version 3.5.2, anaconda 4.1.1 (64-bit) #ipdb> typeerror: __init__() got unexpected keyword argument 'n_splits' #none #> <string>(1)<module>() import numpy np sklearn.datasets import load_boston #from sklearn.model_selection import train_test_split, cross_val_score #from sklearn.model_selection import shufflesplit sklearn.cross_validation import train_test_split, cross_val_score sklearn.cross_validation import shufflesplit sklearn.ensemble import gradientboostingregressor # define feature matrix , target variable x, y = load_boston().data, load_boston().target # create algorithm object (gradient boosting) gbr = gradientboostingregressor(n_estimators=100, random_state=0) #==================================================== # option b #==================================================== #shuffle = shufflesplit(n_splits=10, train_size=0.75, random_state=0) shuffle = shufflesplit(n=x.shape[0], n_iter=10, train_size=0.75, random_state=0) cross_val = cross_val_score(gbr, x, y, cv=shuffle) print('------------------------------------------') print('individual performance: ', cross_val) print('===============================================') print('option b: average performance: ', cross_val.mean()) print('===============================================') # --> different performance in every iteration because of different training # , test sets. #==================================================== # option c #==================================================== individual_results = [] iterations = np.arange(1, 11) in iterations: # randomly split data train , test xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.25, random_state=none) # train gbr 10x on new data set gbr.fit(xtrain, ytrain) score = gbr.score(xtrain, ytrain) individual_results.append(score) avg_score = sum(individual_results)/len(iterations) print('------------------------------------------') print(individual_results) print('===============================================') print('option c: average performance: ', avg_score) print('===============================================')
here copy of output:
individual performance: [ 0.77535372 0.81760604 0.87146377 0.94041114 0.92648961 0.87761488 0.82843891 0.81833855 0.90167889 0.90014986] =============================================== option b: average performance: 0.865754537049 =============================================== ------------------------------------------ [0.98094508160609573, 0.97773541952198795, 0.98076500920740906, 0.98313150025465956, 0.98097867267357952, 0.97918425360465322, 0.97923641784508919, 0.9785058355467865, 0.98173521302711486, 0.97866493105257402] =============================================== option c: average performance: 0.980088233434 ===============================================
can explain why shufflesplit function in option b presents more random results train_test_split function (with random_state=none) in option c?
the score calculated on xtrain
instead of xtest
in option c
with
score = gbr.score(xtest, ytest)
the scores now
[0.806, 0.906, 0.903, 0.836, 0.871, 0.920, 0.902, 0.901, 0.914, 0.916]
Comments
Post a Comment