r - Grouping and summarising when the columns returned are not known in advance -
i have dataframe --say x -- feeds function returns subset depending on value of column x$id.
this subset y includes column y$room contains different mix of values depending on x$id value.
the subset spread tidyr , values of y$room become columns.
resulting extended df --say ext_y-- must grouped column y_ext$visit , summary statistics should calculated remaining columns special function.
the obvious problem these columns not known in advance , therefore can not defined names within function.
the alternative of using indexes of columns instead of names not seem work dplyr, when group_by involved.
do have ideas how problem tackled?
the dataframe has many thousands rows, give glimpse:
> tail(y) id visit room value 11940 14 2 living room 19 11941 14 2 living room 16 11942 14 2 living room 15 11943 14 2 living room 22 11944 14 2 living room 25 11945 14 2 living room 20 > unique(x$id) [1] 14 20 41 44 46 54 64 74 104 106 > unique(x$visit) [1] 0 1 2 > unique(x$room) [1] "bedroom" "living room" "family room" "study room" "den" [6] "tv room" "office" "hall" "kitchen" "dining room" > summary(x$value) min. 1st qu. median mean 3rd qu. max. 2.000 2.750 7.875 17.410 16.000 1775.000
for given id spread() of tidyr returns subset of room values in x. e.g. id = 54:
> y<- out > y$row <- 1 : nrow(y) > y_ext <- spread(y, room, value) > head(y_ext) id visit row bedroom family room living room 1 14 0 1 6.00 na na 2 14 0 2 6.00 na na 3 14 0 3 2.75 na na 4 14 0 4 2.75 na na 5 14 0 5 2.75 na na 6 14 0 6 2.75 na na
now, must compose function groups result visit , summarises columns returned each group in following form:
visit bedroom family room living room 1 0 na 2.79 3.25 2 1 na na 4.53 3 2 4.19 3.77 na
as mentioned above, not know in advance columns returned given id , complicates problem. of course short cut check , find out each id columns returned , create if structure directs each id appropriate code, not elegant, afraid.
hope helped give better picture.
alright, interesting enough me made sample data myself:
nsamples <- 50 allrooms <- c("living", "dining", "bedroom", "master", "family", "garage", "office") set.seed(8675309) df <- data_frame( id = sample(1:5, nsamples, true) , visit = sample(1:3, nsamples, true) , room = sample(allrooms, nsamples, true) , value = round(rnorm(nsamples, 20, 5)) )
the way see it, there 3 approaches, in ascending order of reasonabality. first option, follow basic layout. here, splitting df
id
, spreading instructed, using summarise_all
summation, removing need identify room names explicitly.
df %>% split(.$id) %>% lapply(function(x){ x %>% select(-id) %>% mutate(row = 1:n()) %>% spread(room, value) %>% select(-row) %>% group_by(visit) %>% summarise_all(sum, na.rm = true) })
this returns following (note unique columns):
$`1` # tibble: 3 × 6 visit bedroom dining garage master office <int> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 0 27 27 0 0 2 2 22 19 0 20 23 3 3 0 0 0 27 0 $`2` # tibble: 3 × 6 visit bedroom dining family living office <int> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 15 0 0 0 17 2 2 0 14 42 30 0 3 3 15 13 18 0 20 $`3` # tibble: 3 × 6 visit bedroom dining living master office <int> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 24 0 36 0 28 2 2 0 0 15 30 0 3 3 0 25 21 0 15 $`4` # tibble: 3 × 7 visit bedroom dining garage living master office <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 0 0 23 20 0 24 2 2 0 28 22 0 0 0 3 3 24 0 36 0 16 0 $`5` # tibble: 3 × 8 visit bedroom dining family garage living master office <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 23 0 0 21 0 16 0 2 2 44 14 41 0 26 0 18 3 3 21 19 0 0 25 19 0
however, because had add row in spread
work (without it, there not unique entries), spread
doesn't help. can same thing lot more if summarising first, so:
df %>% split(.$id) %>% lapply(function(x){ x %>% select(-id) %>% group_by(visit, room) %>% summarise(sum = sum(value)) %>% spread(room, sum, 0) })
note gives 0
rooms no visits because of last 0
fill
argument. if rather returns na
, can leave default.
finally, unclear why want separately in first place. may make far more sense in 1 big group_by
, handle missings needed after fact. wit, here lot less code same summaries.
df %>% group_by(id, visit, room) %>% summarise(sum = sum(value)) %>% spread(room, sum)
gives
id visit bedroom dining family garage living master office * <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 na 27 na 27 na na na 2 1 2 22 19 na na na 20 23 3 1 3 na na na na na 27 na 4 2 1 15 na na na na na 17 5 2 2 na 14 42 na 30 na na 6 2 3 15 13 18 na na na 20 7 3 1 24 na na na 36 na 28 8 3 2 na na na na 15 30 na 9 3 3 na 25 na na 21 na 15 10 4 1 na na na 23 20 na 24 11 4 2 na 28 na 22 na na na 12 4 3 24 na na 36 na 16 na 13 5 1 23 na na 21 na 16 na 14 5 2 44 14 41 na 26 na 18 15 5 3 21 19 na na 25 19 na
if want filter down 1 id
, use filter
after fact, remove columns na
entries. (note, save output once, pass through last 2 lines once each id of interest, e.g., when printing)
df %>% group_by(id, visit, room) %>% summarise(sum = sum(value)) %>% spread(room, sum) %>% filter(id == 1) %>% select_if(function(col) mean(is.na(col)) != 1)
gives
id visit bedroom dining garage master office <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 na 27 27 na na 2 1 2 22 19 na 20 23 3 1 3 na na na 27 na
Comments
Post a Comment