The future of R syntax?

15 February 2016

Following Romain François's example, I spent last week playing with the definition of the R grammar. I focused on four changes that I think would improve existing R idioms: creating lists with bare square brackets; a compact lambda notation; labelled blocks of code; and of course implementing natively the pipe operator. While none of these changes are strictly necessary, they make the language more comfortable to use and nicer to look at. I provide working implementations for all of them in the brackets, brackets-lambda, labelled and pipe branches at https://github.com/lionel-/r-source.

Bare Square Brackets

Advanced treatments of R programming stress that R is a functional language. This essentially means that functions are first-class citizens and that you can pass them as arguments to other functions. This makes it possible to have the apply family of functions in base R or the map family in purrr. By the same token, this makes lists extremely useful in R. They can contain any kind of objects and you can use functional programming techniques to manipulate them with expressive idioms. In addition, since lists elements are associated with names, they can directly map to the arguments of a function call via do.call() or purrr::invoke(), another key idiom of functional programming in R.

Despite their importance in the R language, lists do not benefit from as much syntax sugar as in other languages. Hence my first change to the R syntax: creating lists with bare square brackets:

[3, 4, letters]
#> [[1]]
#> [1] 3
#>
#> [[2]]
#> [1] 4
#>
#> [[3]]
#>  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
#> [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"

[3, 4, letters] %>% map_lgl(is.double)
#> [1]  TRUE  TRUE FALSE

This can greatly improve code clarity. Compare dense nested list constructs such as

list(
  list(1, 2),
  list(3, 4)
)

to the much lighter and cleaner

[[1, 2], [3, 4]]

An important use case that would also benefit from this syntax is when a function needs some additional arguments in the form of a list. Think of the contrasts argument of lm() or the args argument of ggplot2's stat_function(). They both involve passing a list of arguments, which bloats the calls and makes scripts heavier to read. The bare brackets notation is a bit lighter:

mtcars$cyl <- as.factor(mtcars$cyl)

# Specific contrast for the predictor `cyl`
lm(disp ~ cyl + am, data = mtcars, contrasts = [cyl = contr.sum])

I also have a feeling that bare brackets may be useful to come up with clean creative syntax in DSLs. Like any syntax construct in R, the square brackets are represented as a plain text function. For example instead of mtcars[["cyl"]], you can write `[[`(mtcars, "cyl"). The string for bare brackets is `[] ` and allows you to redefine its functionality as follows:

`[]` <- function(...) "hello"

[3, 4]
#> [1] "hello"

By the same token, DSLs could capture bare brackets and give them some specific meaning.

Finally, some additional syntax rule could allow for list comprehensions by looking up the for keyword inside bare brackets. This would enable this kind of python-style code:

# List comprehension:
[sum(x)^2 for x in mtcars]

# Equivalent to the following map:
mtcars %>% map(function(x) sum(x)^2)

However I think that's going a step too far as the functional version is much more R-like.

Lambda Notation

In R functions can be created, given names and passed around. But a common idiom involves creating anonymous functions (lambda functions) on the fly. As the full syntax for defining a function can be cumbersome in those situations, many languages such as Scala, Haskell, F-Sharp, Python, and even C++ support a compact notation for creating lambdas. Given the importance of lambda functions in R (as in the apply family of functions), it would be particularly nice to provide an elegant notation for creating them. The second syntax update, relies on the bare square brackets notation for that purpose.

The notation is based on the rightward assignment ->, an operator that is barely used in practice because it's a bit confusing. Bare square brackets followed by -> followed by any R expression create a function in place:

[x] -> 3 * x
#> [x] -> 3 * x

([x] -> 3 * x)(5)
#> [1] 15

lapply(cars, [col] -> max(col / sum(col)))
#> $speed
#> [1] 0.03246753
#>
#> $dist
#> [1] 0.05583993

This notation supports variadic lambdas by supplying dots:

variadic <- [...] -> {
  sum <- ..1 + ..2
  sum * 3
}

variadic(3, 4)
# [1] 21


variadic2 <- [x, y, ...] -> length(list(...))

variadic2("a", "b", 1, 2, 3)
# [1] 3

Thanks to operator precedence and the left associativity of ->, usual R rules for assignment apply. The following snippet assigns the lambda first to byproduct, then to fun.

fun <- [x] -> x -> byproduct

Labelled Blocks

In R, code is data. When a function is called, its arguments are usually evaluated and assigned to the parameter. But functions can also request to see the code used to compute that value in the form of a quoted expression. This capacity to capture code is invaluable to creating intuitive sublanguages like dplyr or ggplot2. The third change that I introduce to R's syntax focuses on the subset of DSLs that manipulate blocks of code, such as the great testthat package.

Currently, blocks of code are passed to a function via curly brackets:

test_that("my code works", {
  ...
})

Wouldn't it be nicer to have the same syntax as function definitions, for loops and if-else branches? That's the purpose of this second syntax change. It allows you to write:

test_that("my code works") {
  ...
}

That's a fairly cosmetic change and admittedly not earth shattering. However, it makes the language a bit nicer and easthetically pleasing. This syntax would be a particulary nice for alternative ways of defining functions. For example, the type-checked functions of the ensurer package would look a bit more natural:

type_checked <- function_(a ~ integer, b ~ character) {
  some_call(a)
  other_call(b)
}

To make this work in the most R-like possible way, I decided to let the function call be any expression. This mirrors the syntax of regular function calls which may be embedded in arbitrary ways. In the following snippet, russian_dolls() returns a list whose first element is a function that returns a function that returns 3:

russian_dolls()[[1]]()()
#> 3

This kind of constructs are also possible with labelled blocks:

my_block[[1]]()() {
  code
}

The only requirement is that the end result of the expression be a function that accepts at least one argument (the block of code). This means that test_that() would be implemented in this way:

test_that <- function(desc) {
  force(desc)

  function(code) {
    test_code(desc, substitute(code), env = parent.frame())
    invisible()
  }
}

Then,

test_that("my code works") {
  check_equal(A, B)
  check_identical(C, D)
}

# Is actually equivalent to
(function(code) {
    test_code(desc, substitute(code), env = parent.frame())
    invisible()
 })({
   check_equal(A, B)
   check_identical(C, D)
 })

In addition to expressions, simple labels are of course allowed:

label {
  line1
  line2
}

For instance this would fit well with the Nimble DSL for specifying Bugs models. Simple labels work a bit differently than expressions though. Here, instead of looking for a function named label, the parser will look for label{}. This makes it possible to use the same identifier for a regular function call and a labelled block:

label <- function() 3
`label{}` <- function(code) 4

label()
#> [1] 3

label {
  anything
}
#> [1] 4

Finally, note that contrary to other labelled blocks such as function definitions, the opening curly bracket must be on the same line as its identifier. Otherwise it would be ambiguous whether we have a labelled block or two expressions separated by a newline:

label
{code}

This slight inconsistency is the price to pay for that syntax extension.

Piping Operator

This is of course the syntax update that many people of the R community are waiting for. A native piping operator. Some of the most popular R packages are based on piped interface: dplyr who popularised magrittr, but also ggplot2. The latter uses a custom non-functional pipeline by overloading the + operator but the sequel ggvis does rely on functional piping. I provide for testing purposes two versions of a native pipe operator, |> and >>.

Given the popularity of the pipe, having native support for it in R's syntax would be a huge progress. Besides the obvious aesthetic concern (though you do get accustomed to %>% with time) native handling of the pipe would improve error recovery. Here is how a traceback currently looks like with magrittr's pipe:

fail <- function(...) stop("fail")
mtcars %>% lapply(fail) %>% unlist()
#> Error in FUN(X[[i]], ...) (from #1) : fail

traceback()
#> 15: stop("fail") at #1
#> 14: FUN(X[[i]], ...)
#> 13: lapply(., fail)
#> 12: function_list[[1L]](value)
#> 11: unlist(.)
#> 10: function_list[[1L]](value)
#> 9: withVisible(function_list[[1L]](value))
#> 8: freduce(value, `_function_list`)
#> 7: Recall(function_list[[1L]](value), function_list[-1L])
#> 6: freduce(value, `_function_list`)
#> 5: `_fseq`(`_lhs`)
#> 4: eval(expr, envir, enclos)
#> 3: eval(quote(`_fseq`(`_lhs`)), env, env)
#> 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
#> 1: mtcars %>% lapply(fail) %>% unlist()

This ugly traceback includes all the steps where magrittr manipulates the unevaluated code. Here is the same traceback with native support:

mtcars |> lapply(fail) |> unlist()
#> Error in FUN(X[[i]], ...) : fail

traceback()
#> 4: stop("fail") at #1
#> 3: FUN(X[[i]], ...)
#> 2: lapply(mtcars, fail)
#> 1: unlist(mtcars |> lapply(fail))

The _ character is also legalised so it can become the placeholder in pipelines. The same rules as with magrittr's placeholder apply:

mtcars |>
  list(_, _) |>
  identical(list(mtcars, mtcars))
#> [1] TRUE

mtcars |>
  list(list(_, _)) |>
  identical(list(mtcars, list(mtcars, mtcars)))
#> [1] TRUE

I actually provide two implementations of the pipe. The first creates a classic binary operator that calls a special primitive function. These are a class of core R function that do not evaluate their arguments, which allows them to manipulate quoted code before evaluation.

The second implementation, called by the >> operator, directly manipulate the parse tree. This means that you cannot redefine >>. R will always transform the expression object >> call() to call(object) and you'll never get a chance to call the operator manually with prefix notation. Such syntax transformation applies to a few operators in R, like the rightward assignment op -> or the double starred exponentiation **. By contrast, the first operator |> does accept to be redefined and called with prefix notation.

I think the first implementation is more natural in the R language and consistent with most operators. On the other hand, manipulating the parse tree ensures that the placeholder _ will always act consistently as a shortcut for the LHS. This would avoid the conflicts that arise with the . placeholder which is currently used for different conflicting purposes in dplyr, magrittr and purrr. Thus there are pros and cons for both approaches.

Could this get into R Core?

R Core has gotten the reputation of being a bit conservative, which is only fair considering the responsibility that weighs on their shoulders.

I think that contrarily to proposals for integrating optional type checking in the syntax, all four of these syntax changes clearly fit the spirit of R as a dynamic, functional language. When it makes sense, they can be manipulated like first class citizens through prefix notation like other language constructs. They shouldn't disturb any existing R code and they improve currently used R idioms rather than invent new ones. So I think there is a chance that R core could consider some of them.

More testing is needed to assess the consequences in terms of performance and backward compatibility, though I didn't find any problem from my limited testing. One point of contention might be that the bare brackets and labelled blocks increase the number of shift-reduce conflicts during parser generation. I guess many of those could be fixed by refactoring the grammar a bit, or adding precedence and association directives to some production rules. But core members will probably feel a bit nervous about applying non-trivial changes to that fundamental part of the R code that basically didn't change since the first available revision in 1997. It's probably ok to ignore these conflicts however. There's currently 81 of them and Bison, the parser generator, seems to be doing a very good job of automatically resolving the ambiguities.

My plan is to get community feedback on Twitter before proposing the changes to R core. In case they are interested in some them, I'll run a comprehensive test on CRAN packages to make sure that the new syntax doesn't break anything.

So, could R 4.0 look like this?

 test_that("new syntax works") {

   data <- list(mtcars, 1, 2, list(3, mtcars, 4))
   expected <- lapply(data, function(x) is.list(x) || is.double(x))

   mtcars |>
     [1, 2, [3, _, 4]] |>
     map([x] -> is.list(x) || is.double(x)) |>
     check_equal(expected)

 }
Tags: rstats, syntax
Posts: