r - Creating a function to replace NAs in the column that calculate the session -
i have data frame, sample dataframe below:
#sample data frame clientid actual_time session 1 2016-11-01 00:00:00 1 2 2016-11-01 00:05:00 1 3 2016-11-01 00:35:01 2 4 2016-11-01 00:40:00 na 5 2016-11-01 01:10:01 na 6 b 2016-11-01 01:00:00 1 7 b 2016-11-01 01:05:00 1 8 b 2016-11-01 01:30:00 1 9 b 2016-11-01 01:40:00 1 10 b 2016-11-01 01:50:00 na 11 c 2016-11-01 02:00:00 na 12 c 2016-11-01 02:35:00 na 13 c 2016-11-01 04:35:00 na
i fill nas in column ’session’ values logic defined as:
- for same “clientid”, if time difference between 2 subsequent row >= 30 minutes, newer row in new session (which equals session of older row plus 1); if time difference between 2 subsequent row < 30 minutes, both row in same session same session number.
- the session number cumulative number starting 1, i.e., new clientid, session number starts 1.
after na filled up, data frame like:
#sample data frame (result) clientid actual_time session 1 2016-11-01 00:00:00 1 2 2016-11-01 00:05:00 1 3 2016-11-01 00:35:01 2 4 2016-11-01 00:40:00 2 5 2016-11-01 01:10:00 3 6 b 2016-11-01 01:00:00 1 7 b 2016-11-01 01:05:00 1 8 b 2016-11-01 01:30:00 1 9 b 2016-11-01 01:40:00 1 10 b 2016-11-01 01:50:00 1 11 c 2016-11-01 02:00:00 1 12 c 2016-11-01 02:35:00 2 13 c 2016-11-01 04:35:00 3
i have tried:
df<-data.frame(clientid=c(rep('a',5),rep('b',5),rep('c',3)), actual_time=as.posixct(c("2016-11-01 00:00:00","2016-11-01 00:05:00","2016-11-01 00:35:01","2016-11-01 00:40:00","2016-11-01 01:10:01", "2016-11-01 01:00:00","2016-11-01 01:05:00","2016-11-01 01:30:00","2016-11-01 01:40:00","2016-11-01 01:50:00", "2016-11-01 02:00:00","2016-11-01 02:35:00","2016-11-01 04:35:00")), session=c(1,1,2,na,na,1,1,1,1,na,na,na,na)) my_session<- function(df){ (i in 2:(dim(df)[1])){ if(is.na(df$session[i])){ if (df$clientid[i]==df$clientid[i-1]){ if(as.numeric(difftime(df$actual_time[i], df$actual_time[i-1], asia/taipei,units = "mins"))>30){ df$session[i]<- df$session[i-1]+1 }else{df$session[i]<- df$session[i-1]} }else{df$session[i]<- 1} } } return(df) } df2<-my_session(df)
the function did work. however, slow since actual dataframe 8 million rows(a 4g csv file).
i think time consumed when running loop. there way write function fill in nas without loop?
i'll propose data.table
approach should scale lot better existing function.
library(data.table) dt <- as.data.table(df) # or setdt(df) dt[, session := cumsum(difftime(actual_time, shift(actual_time, fill = min(actual_time)), units = "mins") > 30) +1l, = clientid]
what does: counts, group of clientid, cumulative number of times 2 actual_time
s differ more 30 minutes. of course have make sure data ordered actual time.
the resulting table looks this:
# clientid actual_time session #1: 2016-11-01 00:00:00 1 #2: 2016-11-01 00:05:00 1 #3: 2016-11-01 00:35:01 2 #4: 2016-11-01 00:40:00 2 #5: 2016-11-01 01:10:01 3 #6: b 2016-11-01 01:00:00 1 #7: b 2016-11-01 01:05:00 1 #8: b 2016-11-01 01:30:00 1 #9: b 2016-11-01 01:40:00 1 #10: b 2016-11-01 01:50:00 1 #11: c 2016-11-01 02:00:00 1 #12: c 2016-11-01 02:35:00 2 #13: c 2016-11-01 04:35:00 3
Comments
Post a Comment