python - Why is ShuffleSplit more/less random than train_test_split (with random

python - Why is ShuffleSplit more/less random than train_test_split (with random_state=None)? -

consider following 2 options presented:

#!/usr/bin/env python3 # -*- coding: utf-8 -*-  #sklearn.__version__ 17.1 #python --version 3.5.2, anaconda 4.1.1 (64-bit)  #ipdb> typeerror: __init__() got unexpected keyword argument 'n_splits' #none #> <string>(1)<module>()  import numpy np sklearn.datasets import load_boston #from sklearn.model_selection import train_test_split, cross_val_score #from sklearn.model_selection import shufflesplit sklearn.cross_validation import train_test_split, cross_val_score sklearn.cross_validation import shufflesplit sklearn.ensemble import gradientboostingregressor  # define feature matrix , target variable x, y = load_boston().data, load_boston().target  # create algorithm object (gradient boosting) gbr = gradientboostingregressor(n_estimators=100, random_state=0)  #==================================================== # option b #==================================================== #shuffle = shufflesplit(n_splits=10, train_size=0.75, random_state=0) shuffle = shufflesplit(n=x.shape[0], n_iter=10, train_size=0.75, random_state=0) cross_val = cross_val_score(gbr, x, y, cv=shuffle) print('------------------------------------------') print('individual performance: ', cross_val) print('===============================================') print('option b: average performance: ', cross_val.mean()) print('===============================================') # --> different performance in every iteration because of different training # , test sets.   #==================================================== # option c #==================================================== individual_results = [] iterations = np.arange(1, 11)  in iterations:     # randomly split data train , test     xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.25,                                                     random_state=none)     # train gbr 10x on new data set     gbr.fit(xtrain, ytrain)     score = gbr.score(xtrain, ytrain)     individual_results.append(score)  avg_score = sum(individual_results)/len(iterations) print('------------------------------------------') print(individual_results) print('===============================================') print('option c: average performance: ', avg_score) print('===============================================')

here copy of output:

individual performance:  [ 0.77535372  0.81760604  0.87146377  0.94041114  0.92648961  0.87761488   0.82843891  0.81833855  0.90167889  0.90014986] =============================================== option b: average performance:  0.865754537049 =============================================== ------------------------------------------ [0.98094508160609573, 0.97773541952198795, 0.98076500920740906, 0.98313150025465956, 0.98097867267357952, 0.97918425360465322, 0.97923641784508919, 0.9785058355467865, 0.98173521302711486, 0.97866493105257402] =============================================== option c: average performance:  0.980088233434 ===============================================

can explain why shufflesplit function in option b presents more random results train_test_split function (with random_state=none) in option c?

the score calculated on xtrain instead of xtest in option c

with

score = gbr.score(xtest, ytest)

the scores now

[0.806, 0.906, 0.903, 0.836, 0.871, 0.920, 0.902, 0.901, 0.914, 0.916]

Search This Blog

Alcombright

python - Why is ShuffleSplit more/less random than train_test_split (with random_state=None)? -

Comments

Post a Comment

Popular posts from this blog

c# SetCompatibleTextRenderingDefault must be called before the first -

c++ - Fill runtime data at compile time with templates -

C#.NET Oracle.ManagedDataAccess ConfigSchema.xsd -