Update: Most of the data frame functions in purrr have been
deprecated in favour of a new family of functions in dplyr. The intent
is to better separate the responsibilities of packages in the
tidyverse. First of all map()
now always returns a list. It no
longer preserves the data frame type. Secondly, all slice- and
rows-based functions are now deprecated. Mapping a column is now
handled by the colwise family of dplyr functions,
e.g. dplyr::mutate_all()
, dplyr::summarise_if()
, etc. Unlike the
_each()
variants which only accept expressions wrapped in funs()
,
the new colwise family accepts regular functions as well as additional
arguments to be passed on. The syntax is thus pretty close to purrr's:
mtcars %>% group_by(cyl) %>% mutate_all(scale, center = FALSE)
/Update
purrr
was finally
released on CRAN
last week. This package is focused on working with lists (and data
frames by the same token). However it is not a DSL for lists in the
way dplyr is a DSL for data frames. It aims at creating a "better
standard lib" focused on functional programming. Purrr should feel
like R programming and bring out the elegance of the language. That
said, purrr can be a nice companion to your dplyr pipelines especially
when you need to apply a function to many columns. In this post I show
how purrr's functional tools can be applied to a dplyr workflow.
dplyr provides mutate_each()
and summarise_each()
for the purpose
of mapping functions but I find that they are not as easy to use as
the rest of the interface. This is mostly because there is no easy way
to map a function to parts of your data frame. It's all columns or
nothing. Also, they introduce a custom notation for lambda functions that
can be a bit cumbersome. These are two areas where purrr shines in
comparison. And since the interface has been designed with pipes in
mind, purrr's functions integrate dplyr pipelines quite well.
Mapping to columns conditionally
One of my favourite functions in purrr is map_if()
. It accepts a
predicate function or a logical vector that specifies which columns
should be mapped with a function. This makes it easy to apply a
function conditionally, as in the following snippet where we transform
all factors to a character vector:
library("purrr")
library("dplyr")
data(diamonds, package = "ggplot2")
diamonds %>% map_if(is.factor, as.character) %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
#> $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
#> $ cut : chr "Ideal" "Premium" "Good" "Premium" ...
#> $ color : chr "E" "E" "E" "I" ...
#> $ clarity: chr "SI2" "SI1" "VS1" "VS2" ...
#> $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
#> $ table : num 55 61 65 58 58 57 57 55 61 61 ...
#> $ price : int 326 326 327 334 335 336 336 337 337 338 ...
#> $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
#> $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
#> $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Mapping to specific columns
While cleaning a dataset, it is common to apply the same
transformation to many variables. For example, reversing a scale or
shifting it to zero. Instead of writing a long mutate()
call with
those transformations, I prefer to do it in one go.
This can be done with map_at()
which takes a vector of column
positions or column names. For example, let's assume you have written
two functions reverse_scale()
and shift_to_zero()
that should be
applied to specific variables. You record those variables in character
vectors just before starting the dplyr/purrr pipeline, and then add
the relevant map_at()
calls.
to_reverse_vars <- c(
"cyl", "am", "vs",
"gear", "carb"
)
to_zero_vars <- c(
"cyl", "gear", "carb"
)
mtcars %>%
select(-disp) %>%
map_at(to_reverse_vars, reverse_scale) %>%
map_at(to_zero_vars, shift_to_zero)
Expanding one column to many with lmap()
lmap()
's story starts with
the mysterious tweet
and
the gist
that show up when you google "hadley monads". While I'm not sure I
really understand how it is monadic, lmap()
is quite useful
to extend a data frame without having to deal with binds, merges or
having to define new column names.
Let's say you have a numeric variable that you want to discretise for data exploration or modelling (for example, to use as pivot in a ggplot facetting). There are several ways to cut a vector into pieces. Ideally, the cutpoints should be derived from theory, but it's often not possible or too time consuming to do so. In this case, I like to create different categorisations and check if the results are consistent (and investigate when they are not). Let's define two cutting functions, one that tries to create categories with equal sample sizes while the other just uses equal ranges to determine cutpoints.
cut_equal_sizes <- function(x, n = 3, ...) {
ggplot2::cut_number(x, n, ...)
}
cut_equal_ranges <- function(x, n = 3, ...) {
cut(x, n, include.lowest = TRUE, ...)
}
It'd be nice to "grow" the data frame at specific numeric columns in
such a way that that two news discretised variables appear just next
to them with appropriate column names. lmap()
is adapted to this
because instead of applying a function to the vectors contained in a
data frame, it applies it to subsets of size 1 of that data
frame. This has several advantages:
You get the name of the vector as an attribute of the enclosing data frame.
The usual mapping tools work on columns, so when you return a list or a data frame of vectors, they'll try to stick these inside a list-column, which is not what we want in this case. By comparison,
lmap()
gives a data frame to a function and expects a data frame in return and has no problem dealing with it when it has more than one column.
Let's write a function to be mapped in such a way. This function doesn't work with vectors but with vectors enclosed in a data frame. It takes and returns a data frame.
cut_categories <- function(x, n = 3) {
# Record the name of the enclosed vector
name <- names(x)
# Create the new columns
x$cat_n <- cut_equal_sizes(x[[1]], n)
x$cat_r <- cut_equal_ranges(x[[1]], n)
# Adjusting the names of the new columns
names(x)[2:3] <- paste0(name, "_", n, names(x)[2:3])
x
}
Then we just add a lmap()
call to our data cleaning pipeline:
to_discretise_vars <- c(
"mpg", "disp", "drat",
"wt", "qsec"
)
mtcars %>% lmap_at(to_discretise_vars, cut_categories) %>% str()
#> Classes 'tbl_df', 'tbl' and 'data.frame': 32 obs. of 21 variables:
#> $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
#> $ mpg_3cat_n : Factor w/ 3 levels "[10.4,16.7]",..: 2 2 3 2 2 2 1 3 3 2 ...
#> $ mpg_3cat_r : Factor w/ 3 levels "[10.4,18.2]",..: 2 2 2 2 2 1 1 2 2 2 ...
#> $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
#> $ disp : num 160 160 108 258 360 ...
#> $ disp_3cat_n: Factor w/ 3 levels "[71.1,146]","(146,293]",..: 2 2 1 2 3 2 3 2 1 2 ...
#> $ disp_3cat_r: Factor w/ 3 levels "[70.7,205]","(205,338]",..: 1 1 1 2 3 2 3 1 1 1 ...
#> $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
#> $ drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
#> $ drat_3cat_n: Factor w/ 3 levels "[2.76,3.17]",..: 2 2 2 1 1 1 2 2 3 3 ...
#> $ drat_3cat_r: Factor w/ 3 levels "[2.76,3.48]",..: 2 2 2 1 1 1 1 2 2 2 ...
#> $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
#> $ wt_3cat_n : Factor w/ 3 levels "[1.51,2.81]",..: 1 2 1 2 2 2 3 2 2 2 ...
#> $ wt_3cat_r : Factor w/ 3 levels "[1.51,2.82]",..: 1 2 1 2 2 2 2 2 2 2 ...
#> $ qsec : num 16.5 17 18.6 19.4 17 ...
#> $ qsec_3cat_n: Factor w/ 3 levels "[14.5,17]","(17,18.6]",..: 1 1 3 3 1 3 1 3 3 2 ...
#> $ qsec_3cat_r: Factor w/ 3 levels "[14.5,17.3]",..: 1 1 2 2 1 3 1 2 3 2 ...
#> $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
#> $ am : num 1 1 1 0 0 0 0 0 0 0 ...
#> $ gear : num 4 4 4 3 3 3 3 4 4 4 ...
#> $ carb : num 4 4 1 1 2 1 4 2 2 4 ...
The data frame comes out of the pipeline with the new discretised variables nicely arranged and named.
Mapping a function within groups
purrr is also able to deal with dplyr groupings. The groups can be
defined with either dplyr::by_group()
or purrr::slice_rows()
. To
apply a function to all columns within groups, just combine a mapping
function with the by_slice()
adverb:
mtcars %>%
slice_rows("cyl") %>%
by_slice(map, ~ .x / sum(.x))