r - Grouping and summarising when the columns returned are not known in advance -

i have dataframe --say x -- feeds function returns subset depending on value of column x$id.

this subset y includes column y$room contains different mix of values depending on x$id value.

the subset spread tidyr , values of y$room become columns.
resulting extended df --say ext_y-- must grouped column y_ext$visit , summary statistics should calculated remaining columns special function.

the obvious problem these columns not known in advance , therefore can not defined names within function.

the alternative of using indexes of columns instead of names not seem work dplyr, when group_by involved.

do have ideas how problem tackled?

the dataframe has many thousands rows, give glimpse:

       > tail(y)            id visit        room value      11940 14     2 living room    19      11941 14     2 living room    16      11942 14     2 living room    15      11943 14     2 living room    22      11944 14     2 living room    25      11945 14     2 living room    20       > unique(x$id)     [1]  14  20  41  44  46  54  64  74 104 106      > unique(x$visit)     [1] 0 1 2      > unique(x$room)      [1] "bedroom"      "living room"  "family  room" "study room"   "den"               [6] "tv room"      "office"       "hall"         "kitchen"      "dining room"       > summary(x$value)          min.  1st qu.   median     mean  3rd qu.     max.          2.000    2.750    7.875   17.410   16.000 1775.000

for given id spread() of tidyr returns subset of room values in x. e.g. id = 54:

  > y<- out   > y$row <- 1 : nrow(y)   > y_ext <- spread(y, room, value)   > head(y_ext)        id visit row bedroom family  room living room      1 14     0   1    6.00           na          na      2 14     0   2    6.00           na          na      3 14     0   3    2.75           na          na      4 14     0   4    2.75           na          na      5 14     0   5    2.75           na          na      6 14     0   6    2.75           na          na

now, must compose function groups result visit , summarises columns returned each group in following form:

         visit    bedroom    family room   living room       1   0         na            2.79         3.25       2   1         na             na          4.53       3   2         4.19           3.77        na

as mentioned above, not know in advance columns returned given id , complicates problem. of course short cut check , find out each id columns returned , create if structure directs each id appropriate code, not elegant, afraid.

hope helped give better picture.

alright, interesting enough me made sample data myself:

nsamples <- 50  allrooms <-   c("living", "dining", "bedroom", "master", "family", "garage", "office")  set.seed(8675309)  df <-   data_frame(     id = sample(1:5, nsamples, true)     , visit = sample(1:3, nsamples, true)     , room = sample(allrooms, nsamples, true)     , value = round(rnorm(nsamples, 20, 5))   )

the way see it, there 3 approaches, in ascending order of reasonabality. first option, follow basic layout. here, splitting df id, spreading instructed, using summarise_all summation, removing need identify room names explicitly.

df %>%   split(.$id) %>%   lapply(function(x){     x %>%       select(-id) %>%       mutate(row = 1:n()) %>%       spread(room, value) %>%       select(-row) %>%       group_by(visit) %>%       summarise_all(sum, na.rm = true)   })

this returns following (note unique columns):

$`1` # tibble: 3 × 6   visit bedroom dining garage master office   <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 1     1       0     27     27      0      0 2     2      22     19      0     20     23 3     3       0      0      0     27      0  $`2` # tibble: 3 × 6   visit bedroom dining family living office   <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 1     1      15      0      0      0     17 2     2       0     14     42     30      0 3     3      15     13     18      0     20  $`3` # tibble: 3 × 6   visit bedroom dining living master office   <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 1     1      24      0     36      0     28 2     2       0      0     15     30      0 3     3       0     25     21      0     15  $`4` # tibble: 3 × 7   visit bedroom dining garage living master office   <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 1     1       0      0     23     20      0     24 2     2       0     28     22      0      0      0 3     3      24      0     36      0     16      0  $`5` # tibble: 3 × 8   visit bedroom dining family garage living master office   <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 1     1      23      0      0     21      0     16      0 2     2      44     14     41      0     26      0     18 3     3      21     19      0      0     25     19      0

however, because had add row in spread work (without it, there not unique entries), spread doesn't help. can same thing lot more if summarising first, so:

df %>%   split(.$id) %>%   lapply(function(x){     x %>%       select(-id) %>%       group_by(visit, room) %>%       summarise(sum = sum(value)) %>%       spread(room, sum, 0)   })

note gives 0 rooms no visits because of last 0 fill argument. if rather returns na, can leave default.

finally, unclear why want separately in first place. may make far more sense in 1 big group_by , handle missings needed after fact. wit, here lot less code same summaries.

df %>%   group_by(id, visit, room) %>%   summarise(sum = sum(value)) %>%   spread(room, sum)

gives

      id visit bedroom dining family garage living master office *  <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 1      1     1      na     27     na     27     na     na     na 2      1     2      22     19     na     na     na     20     23 3      1     3      na     na     na     na     na     27     na 4      2     1      15     na     na     na     na     na     17 5      2     2      na     14     42     na     30     na     na 6      2     3      15     13     18     na     na     na     20 7      3     1      24     na     na     na     36     na     28 8      3     2      na     na     na     na     15     30     na 9      3     3      na     25     na     na     21     na     15 10     4     1      na     na     na     23     20     na     24 11     4     2      na     28     na     22     na     na     na 12     4     3      24     na     na     36     na     16     na 13     5     1      23     na     na     21     na     16     na 14     5     2      44     14     41     na     26     na     18 15     5     3      21     19     na     na     25     19     na

if want filter down 1 id, use filter after fact, remove columns na entries. (note, save output once, pass through last 2 lines once each id of interest, e.g., when printing)

df %>%   group_by(id, visit, room) %>%   summarise(sum = sum(value)) %>%   spread(room, sum) %>%   filter(id == 1) %>%   select_if(function(col) mean(is.na(col)) != 1)

gives

     id visit bedroom dining garage master office   <int> <int>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl> 1     1     1      na     27     27     na     na 2     1     2      22     19     na     20     23 3     1     3      na     na     na     27     na

Search This Blog

Alcombright

r - Grouping and summarising when the columns returned are not known in advance -

Comments

Post a Comment

Popular posts from this blog

c# SetCompatibleTextRenderingDefault must be called before the first -

c++ - Fill runtime data at compile time with templates -

C#.NET Oracle.ManagedDataAccess ConfigSchema.xsd -