Monday, 25 September 2017

r - Using dplyr window functions to calculate percentiles



I have a working solution but am looking for a cleaner, more readable solution that perhaps takes advantage of some of the newer dplyr window functions.



Using the mtcars dataset, if I want to look at the 25th, 50th, 75th percentiles and the mean and count of miles per gallon ("mpg") by the number of cylinders ("cyl"), I use the following code:



library(dplyr)
library(tidyr)


# load data
data("mtcars")

# Percentiles used in calculation
p <- c(.25,.5,.75)

# old dplyr solution
mtcars %>% group_by(cyl) %>%
do(data.frame(p=p, stats=quantile(.$mpg, probs=p),

n = length(.$mpg), avg = mean(.$mpg))) %>%
spread(p, stats) %>%
select(1, 4:6, 3, 2)

# note: the select and spread statements are just to get the data into
# the format in which I'd like to see it, but are not critical


Is there a way I can do this more cleanly with dplyr using some of the summary functions (n_tiles, percent_rank, etc.)? By cleanly, I mean without the "do" statement.




Thank you


Answer



If you're up for using purrr::map, you can do it like this!



library(tidyverse)

mtcars %>%
tbl_df() %>%
nest(-cyl) %>%

mutate(Quantiles = map(data, ~ quantile(.$mpg)),
Quantiles = map(Quantiles, ~ bind_rows(.) %>% gather())) %>%
unnest(Quantiles)

#> # A tibble: 15 x 3
#> cyl key value
#>
#> 1 6 0% 17.8
#> 2 6 25% 18.6
#> 3 6 50% 19.7

#> 4 6 75% 21
#> 5 6 100% 21.4
#> 6 4 0% 21.4
#> 7 4 25% 22.8
#> 8 4 50% 26
#> 9 4 75% 30.4
#> 10 4 100% 33.9
#> 11 8 0% 10.4
#> 12 8 25% 14.4
#> 13 8 50% 15.2

#> 14 8 75% 16.2
#> 15 8 100% 19.2


Created on 2018-11-10 by the reprex package (v0.2.1)



One nice thing about this approach is the output is tidy, one observation per row.


No comments:

Post a Comment

casting - Why wasn&#39;t Tobey Maguire in The Amazing Spider-Man? - Movies &amp; TV

In the Spider-Man franchise, Tobey Maguire is an outstanding performer as a Spider-Man and also reprised his role in the sequels Spider-Man...