r - Creating a function to replace NAs in the column that calculate the session -


i have data frame, sample dataframe below:

#sample data frame    clientid actual_time           session 1         2016-11-01 00:00:00   1              2         2016-11-01 00:05:00   1 3         2016-11-01 00:35:01   2 4         2016-11-01 00:40:00   na 5         2016-11-01 01:10:01   na          6  b        2016-11-01 01:00:00   1 7  b        2016-11-01 01:05:00   1 8  b        2016-11-01 01:30:00   1 9  b        2016-11-01 01:40:00   1 10 b        2016-11-01 01:50:00   na 11 c        2016-11-01 02:00:00   na 12 c        2016-11-01 02:35:00   na 13 c        2016-11-01 04:35:00   na 

i fill nas in column ’session’ values logic defined as:

  • for same “clientid”, if time difference between 2 subsequent row >= 30 minutes, newer row in new session (which equals session of older row plus 1); if time difference between 2 subsequent row < 30 minutes, both row in same session same session number.
  • the session number cumulative number starting 1, i.e., new clientid, session number starts 1.

after na filled up, data frame like:

#sample data frame (result)    clientid actual_time           session 1         2016-11-01 00:00:00   1              2         2016-11-01 00:05:00   1 3         2016-11-01 00:35:01   2 4         2016-11-01 00:40:00   2 5         2016-11-01 01:10:00   3          6  b        2016-11-01 01:00:00   1 7  b        2016-11-01 01:05:00   1 8  b        2016-11-01 01:30:00   1 9  b        2016-11-01 01:40:00   1 10 b        2016-11-01 01:50:00   1 11 c        2016-11-01 02:00:00   1 12 c        2016-11-01 02:35:00   2 13 c        2016-11-01 04:35:00   3 

i have tried:

df<-data.frame(clientid=c(rep('a',5),rep('b',5),rep('c',3)),        actual_time=as.posixct(c("2016-11-01 00:00:00","2016-11-01 00:05:00","2016-11-01 00:35:01","2016-11-01 00:40:00","2016-11-01 01:10:01",                        "2016-11-01 01:00:00","2016-11-01 01:05:00","2016-11-01 01:30:00","2016-11-01 01:40:00","2016-11-01 01:50:00",                        "2016-11-01 02:00:00","2016-11-01 02:35:00","2016-11-01 04:35:00")),        session=c(1,1,2,na,na,1,1,1,1,na,na,na,na))    my_session<- function(df){    (i in 2:(dim(df)[1])){     if(is.na(df$session[i])){       if (df$clientid[i]==df$clientid[i-1]){         if(as.numeric(difftime(df$actual_time[i],                                 df$actual_time[i-1], asia/taipei,units =     "mins"))>30){           df$session[i]<- df$session[i-1]+1         }else{df$session[i]<- df$session[i-1]}       }else{df$session[i]<- 1}     }   }    return(df) }  df2<-my_session(df) 

the function did work. however, slow since actual dataframe 8 million rows(a 4g csv file).

i think time consumed when running loop. there way write function fill in nas without loop?

i'll propose data.table approach should scale lot better existing function.

library(data.table) dt <- as.data.table(df) # or setdt(df) dt[, session := cumsum(difftime(actual_time, shift(actual_time,                 fill = min(actual_time)), units = "mins") > 30) +1l,      = clientid] 

what does: counts, group of clientid, cumulative number of times 2 actual_times differ more 30 minutes. of course have make sure data ordered actual time.

the resulting table looks this:

 #   clientid         actual_time session  #1:        2016-11-01 00:00:00       1  #2:        2016-11-01 00:05:00       1  #3:        2016-11-01 00:35:01       2  #4:        2016-11-01 00:40:00       2  #5:        2016-11-01 01:10:01       3  #6:        b 2016-11-01 01:00:00       1  #7:        b 2016-11-01 01:05:00       1  #8:        b 2016-11-01 01:30:00       1  #9:        b 2016-11-01 01:40:00       1 #10:        b 2016-11-01 01:50:00       1 #11:        c 2016-11-01 02:00:00       1 #12:        c 2016-11-01 02:35:00       2 #13:        c 2016-11-01 04:35:00       3 

Comments

Popular posts from this blog

php - How to add and update images or image url in Volusion using Volusion API -

javascript - jQuery UI Splitter/Resizable for unlimited amount of columns -

javascript - IE9 error '$'is not defined -