The goal of this package is to
offer an alternative way of expressing common
operations with data.table
without sacrificing the performance optimizations that it offers. The
foundation for the data manipulation verbs is the dplyr
package, which also advocates the piping operator from the magrittr
package. The rlang
package powers most of this package’s functionality, which means that tidy
evaluation can also be supported. There are other resources that
provide comprehensive descriptions of these packages, so they will not
be explained here.
Even though using data manipulation verbs can improve expressiveness
in some cases, this is not always true, so using the traditional
data.table
syntax might still be preferable in many
situations. Unfortunately, it is not entirely straightforward to use the
verbs without considerable knowledge of the syntax understood by
data.table
, as well as its optimizations
and special
symbols. We will return to the importance of this at the end.
All verbs defined in this package essentially parse the input given
to them and build expressions that data.table
understands.
These expressions are used in a data.table
frame,
i.e. in a call similar to DT[i, j, by, ...]
where
DT
is a data.table
. The frame’s
main elements are the 3 shown: i
, j
, and
by
. Let us first list the available single-table verbs and
their effects in the frame (two-table joining verbs have their
own vignette):
i
:
arrange
/order_by
filter
filter_on
filter_sd
where
max_by
min_by
j
:
select
mutate
mutate_sd
transmute
transmute_sd
group_by
to specify by
key_by
to specify keyby
distinct
specifies j
and
by
The where
verb is equivalent to dplyr
’s
filter
, but is always lazy, i.e. it expects another verb to
be used afterwards. The order_by
verb is simply an alias
for arrange
.
Some verbs can/must be combined with others, which means that some
expressions must be built lazily. It is possible to work in an
entirely lazy manner (see below),
but, as of version 0.3.0, it is no longer necessary (do note that this
makes table.express
conflict with dtplyr
).
Nevertheless, in order to enable some of the expressions to be built
lazily, the order in which the verbs are used becomes
important:
group_by
/key_by
should be used
first.i
should be used next, usually
specifying whether they should evaluate the frame or not.j
should be used last.The examples here explain this in more detail with the
mtcars
data:
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> ---
#> 31: 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
#> 32: 21.4 4 121 109 4.11 2.780 18.60 1 1 4 2
The foundation for this package is building expressions that are
almost entirely delegated to data.table
. In order to
explicitly show when an expression is being built and subsequently
evaluated, we can use 3 delimiters:
start_expr
end_expr
chain
These also serve as visual reminders that we are not dealing directly
with data.table
s during the process. We capture the input
data.table
and start the process with
start_expr
, and evaluate the final expression with
end_expr
. Using chain
is equivalent to calling
end_expr
immediately followed by
start_expr
.
In order to print more information regarding the expression-building
process, we can either set
options(table.express.verbose = TRUE)
, or pass
.verbose = TRUE
to start_expr
or
chain
.
Let’s start with a lazy example to get an idea of how the built
expressions look like. The arrange
/order_by
verbs add an expression with order
to the frame, and let
data.table
handle it as usual:
#> .DT_[order(mpg, -cyl)]
We see here that the built expression includes a .DT_
pronoun. When the expression is evaluated, the captured
data.table
is assigned to the evaluation environment as
said pronoun. When evaluating eagerly, this whole process happens in the
background:
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
#> 2: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
#> ---
#> 31: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 32: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Even though selecting a subset of columns is a common operation, it
may be undesirable to do so when working with data.table
s
because it always leads to deep data copies.
Given that data.table
supports modification by reference,
these copies are necessary. For example, the following would require
copying all columns between mpg
and disp
,
since any of them could be modified by reference afterwards.
Therefore, it might be preferable to use mutation verbs to remove columns that are not of interest.
With that said, the select
verb can be used as
usual:
#> mpg am
#> <num> <num>
#> 1: 21.0 1
#> 2: 21.0 1
#> ---
#> 31: 15.0 1
#> 32: 21.4 1
To maintain consistency, even single columns are kept as
data.table
s:
#> mpg
#> <num>
#> 1: 21.0
#> 2: 21.0
#> ---
#> 31: 15.0
#> 32: 21.4
Calls to tidyselect
’s helpers or to :
are
handled specially internally:
#> mpg cyl disp gear carb
#> <num> <num> <num> <num> <num>
#> 1: 21.0 6 160 4 4
#> 2: 21.0 6 160 4 4
#> ---
#> 31: 15.0 8 301 5 8
#> 32: 21.4 4 121 4 2
#> mpg am
#> <num> <num>
#> 1: 21.0 1
#> 2: 21.0 1
#> ---
#> 31: 15.0 1
#> 32: 21.4 1
Tidy evaluation and the .parse
argument can also aid in
cases where certain parts of the frame were computed
programmatically:
#> mpg am
#> <num> <num>
#> 1: 21.0 1
#> 2: 21.0 1
#> ---
#> 31: 15.0 1
#> 32: 21.4 1
The transmute
verb simply wraps everything in its
ellipsis in a call to list
and assigns the expression to
the frame’s j
.
#> foo bar
#> <num> <num>
#> 1: 42.0 403.42879
#> 2: 42.0 403.42879
#> ---
#> 31: 30.0 2980.95799
#> 32: 42.8 54.59815
Even though select
can be used for the same (given the
way data.table
handles j
), the simplicity of
transmute
’s internals makes it preferable when no
tidyselect
helpers are needed, since it avoids extra
function calls and can reduce overhead significantly in special
circumstances.
Because of the way data.table
handles j
,
transmute
could be used to summarize columns. However, the
summarize
/summarise
verbs can be more
expressive, and they check whether the summary value actually has length
1.
The mutate
verb builds an expression with
:=
in order to perform assignment by reference by
default.
#> .DT_[, `:=`(mpg = mpg/2, hp = log(hp))]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 10.5 6 160 4.700480 3.90 2.620 16.46 0 1 4 4
#> 2: 10.5 6 160 4.700480 3.90 2.875 17.02 0 1 4 4
#> ---
#> 31: 7.5 8 301 5.814131 3.54 3.570 14.60 0 1 5 8
#> 32: 10.7 4 121 4.691348 4.11 2.780 18.60 1 1 4 2
It can also support expressions that already contain
:=
:
new_vars <- c("x", "y")
DT %>%
copy %>%
mutate(!!new_vars := .(1, 2), .unquote_names = FALSE) %>% {
invisible(print(.))
}
#> mpg cyl disp hp drat wt qsec vs am gear carb x
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 1
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 1
#> ---
#> 31: 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8 1
#> 32: 21.4 4 121 109 4.11 2.780 18.60 1 1 4 2 1
#> y
#> <num>
#> 1: 2
#> 2: 2
#> ---
#> 31: 2
#> 32: 2
The where
/filter
verbs work with the
i
part of the frame:
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
#> 2: 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
#> mean_mpg
#> <num>
#> 1: 18.5
As mentioned in the beginning, where
can be used as a
lazy version of filter
, letting us add other clauses to the
same frame.
The helper verb filter_sd
can be used to apply the same
conditions to many columns, and it can use a special pronoun
.COL
while specifying the expression, lambdas as formulas,
as well as tidyselect
helpers to choose
.SDcols
:
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 2: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
#> 2: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
#> .DT_[mpg > 0 & am > 0]
#> .DT_[mpg > 0 & cyl > 0 & disp > 0 & hp > 0 & drat > 0 & wt >
#> 0 & qsec > 0 & vs > 0 & am > 0 & gear > 0 & carb > 0]
In order to enable combination of filtering helpers with other verbs,
the .expr
parameter can be used to indicate that the
expression should be kept, delaying evaluation until a verb that sets
j
is used:
DT %>%
copy %>%
filter_sd(c("vs", "am"), .COL > 0, .expr = TRUE) %>%
mutate(some_flag = TRUE) %>% {
invisible(print(.))
}
#> mpg cyl disp hp drat wt qsec vs am gear carb some_flag
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <lgcl>
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 NA
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 NA
#> ---
#> 31: 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8 NA
#> 32: 21.4 4 121 109 4.11 2.780 18.60 1 1 4 2 TRUE
The filter_on
verb can be used to build an expression
that specifies the on
argument of the frame. It accepts
key-value pairs where each key is a column in the data, and each value
is the corresponding value that the column should have to match:
#> .DT_[list(6, 0), on = c("cyl", "am")]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
#> 2: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
#> 3: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
#> 4: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
In order to support functionality similar to
data.table
’s
DT[, lapply(.SD, fun), .SDcols = c("...")]
syntax, 2 verbs
are provided: mutate_sd
and transmute_sd
.
Starting with mutate_sd
, it modifies columns in
.SDcols
by reference, and columns that are not part of
.SDcols
are kept:
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 1318815734 403.42879 160 110 3.90 2.620 16.46 0 1 4 4
#> 2: 1318815734 403.42879 160 110 3.90 2.875 17.02 0 1 4 4
#> ---
#> 31: 3269017 2980.95799 301 335 3.54 3.570 14.60 0 1 5 8
#> 32: 1967441884 54.59815 121 109 4.11 2.780 18.60 1 1 4 2
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#> 2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#> ---
#> 31: 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8
#> 32: 21.4 4 121 109 4.11 2.780 18.60 1 1 4 2
It would also be possible to remove several columns with
mutate_sd
:
#> disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 160 110 3.90 2.620 16.46 0 1 4 4
#> 2: 160 110 3.90 2.875 17.02 0 1 4 4
#> ---
#> 31: 301 335 3.54 3.570 14.60 0 1 5 8
#> 32: 121 109 4.11 2.780 18.60 1 1 4 2
Additionally, mutate_sd
supports the special
.COL
pronoun that symbolizes the column that should be
modified, as well as lambdas as formulas:
#> .DT_[, `:=`(c("cyl", "hp", "vs", "am", "gear", "carb"), lapply(.SD,
#> as.integer)), .SDcols = c("cyl", "hp", "vs", "am", "gear",
#> "carb")]
#> .DT_[, `:=`(c("cyl", "hp", "vs", "am", "gear", "carb"), lapply(.SD,
#> as.integer)), .SDcols = c("cyl", "hp", "vs", "am", "gear",
#> "carb")]
On the other hand, transmute_sd
never modifies by
reference, and supports special expressions to “build” what is chosen as
.SDcols
. These expressions can use tidyselect
helpers, as well as another special pronoun: .COLNAME
:
#> disp drat
#> <num> <num>
#> 1: 320 7.80
#> 2: 320 7.80
#> ---
#> 31: 602 7.08
#> 32: 242 8.22
#> disp drat
#> <num> <num>
#> 1: 320 7.80
#> 2: 320 7.80
#> ---
#> 31: 602 7.08
#> 32: 242 8.22
#> .DT_[, .transmute_matching(.SD, .which = rlang::quo(~is.numeric(.x)),
#> .hows = rlang::quos(~.x * 2))]
Both verbs also support a list of functions for multiple
transformations, but mutate_sd
performs pairwise
transformations by default, whereas transmute_sd
performs
all combinations:
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 21 6 160 110 3.90 2.620 16.46 0 -1 4 4
#> 2: 21 6 160 110 3.90 2.875 17.02 0 -1 4 4
#> ---
#> 31: 15 8 301 335 3.54 3.570 14.60 0 -1 5 8
#> 32: 21 4 121 109 4.11 2.780 18.60 1 -1 4 2
#> min.mpg min.am max.mpg max.am mean.mpg mean.am
#> <num> <num> <num> <num> <num> <num>
#> 1: 10.4 0 33.9 1 20.09062 0.40625
Since data.table
already supports this by means of its
by
parameter, the group_by
verb simply parses
its input and assigns it accordingly:
#> gear N
#> <num> <int>
#> 1: 4 12
#> 2: 3 15
#> 3: 5 5
The key_by
verb does the same but sets the key of the
input in order to sort:
#> Key: <gear>
#> gear N
#> <num> <int>
#> 1: 3 15
#> 2: 4 12
#> 3: 5 5
Using group_by
or key_by
by itself would
not be useful. Therefore, they are entirely lazy, and simply return the
expression builder that is used in the background. Using other verbs
will normally lead to the evaluation of the expression, though some
verbs can delay evaluation, e.g. where
and other verbs that
set i
when their respective .expr
parameters
are set to TRUE
.
The distinct
verb is a shortcut for:
select(.SD[1])
with a by
clause if
.keep = TRUE
..SDcols
if
.keep
is a character vector.unique(.SD)
with specific .SDcols
if
.keep = FALSE
.#> vs am mpg cyl disp hp drat wt qsec gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0 1 21.0 6 160 110 3.90 2.620 16.46 4 4
#> 2: 1 1 22.8 4 108 93 3.85 2.320 18.61 4 1
#> 3: 1 0 21.4 6 258 110 3.08 3.215 19.44 3 1
#> 4: 0 0 18.7 8 360 175 3.15 3.440 17.02 3 2
#> vsam mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 1 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4
#> 2: 2 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
#> 3: 0 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
#> vs am
#> <num> <num>
#> 1: 0 1
#> 2: 1 1
#> 3: 1 0
#> 4: 0 0
When working lazily, i.e. with
start_expr
/chain
/end_expr
, there
are special considerations that need to be taken into account.
Some data.table
idioms benefit from nesting expressions,
for instance this
idiom to find, by groups, rows where a column has its extrema. The
helper function nest_expr
can help in these cases. When
used inside a supported verb, it eagerly
builds a functional chain with %>%
, where the currently
captured data.table
enters as .
For example, the linked idiom could be expressed as:
DT %>%
start_expr(.verbose = TRUE) %>%
where(nest_expr(.start = FALSE, {
.[, .I[mpg == max(mpg)], by = vs]$V1
})) %>%
end_expr
#> Nesting the result of evaluating the following functional chain:
#> .DT_ %>% {
#> .[, .I[mpg == max(mpg)], by = vs]$V1
#> }
#> Expression after where.ExprBuilder(...):
#> .DT_[.NEST_0_]
#> Using captured data.table for evaluation.
#> Evaluating:
#> .DT_[.NEST_0_]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
#> 2: 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
Nevertheless, this specific logic has been encapsulated in the
max_by
/min_by
verbs.
By default, the verbs defined in this package automatically start a
new frame whenever they want to define one of
i
/j
/by
, but the current
expression’s frame has already specified it; otherwise they add to the
current frame. More complex expressions are thus supported by
automatically chaining data.table
frames:
DT %>%
start_expr %>%
select(mean_mpg = mean(mpg)) %>%
where(hp > 50L) %>%
group_by(vs, am, gear) %>%
order_by(gear, -vs, am) %>%
filter(mean_mpg > 20) %T>%
print %>%
end_expr %>% {
invisible(print(., nrows = 10L))
}
#> .DT_[hp > 50L, .select_matching(, mean_mpg = mean(mpg), .negate = FALSE),
#> by = list(vs, am, gear)][order(gear, -vs, am)][mean_mpg >
#> 20]
#> vs am gear mean_mpg
#> <num> <num> <num> <num>
#> 1: 1 0 3 20.33333
#> 2: 1 0 4 21.05000
#> 3: 1 1 4 28.03333
#> 4: 0 1 4 21.00000
#> 5: 1 1 5 30.40000
If we wanted to be explicit about chaining whenever possible (see below), we could set
options(table.express.chain = FALSE)
, which would lead to a
warning being shown whenever a part of the query is replaced.
The automatic chaining mentioned above is enough in most situations. For example the following chains lead to the same result, and therefore have the same semantics:
#> [1] 42.0 42.0 45.6 42.8 48.8 45.6 64.8 60.8 67.8 43.0 54.6 52.0 60.8 42.8
#> [1] 42.0 42.0 45.6 42.8 48.8 45.6 64.8 60.8 67.8 43.0 54.6 52.0 60.8 42.8
However, these two chains have different semantics:
#> mpg
#> <num>
#> 1: 42.0
#> 2: 42.0
#> ---
#> 13: 60.8
#> 14: 42.8
#> Empty data.table (0 rows and 1 cols): mpg
As mentioned above,
chain
can be used to chain expressions by evaluating the
current one with end_expr
, and immediately capturing the
resulting data.table
to start building a new expression.
This can be helpful in situations where automatic chaining (or lack
thereof) can lead to a change in the expression’s semantics:
#> .DT_[mpg > 40, list(mpg = mpg * 2)]
#> Empty data.table (0 rows and 1 cols): mpg
#> mpg
#> <num>
#> 1: 42.0
#> 2: 42.0
#> ---
#> 13: 60.8
#> 14: 42.8
In the following cases, the mentioned verbs use the captured
data.table
eagerly during expression building, even when
working lazily:
filter_sd
when .SDcols
is a
tidyselect
call or a predicate using
.COL
.mutate_sd
when .SDcols
is a
tidyselect
call or a predicate using
.COL
.nest_expr
in:
anti_join
’s y
.inner_join
’s y
.right_join
’s y
.filter
or where
.max_by
and min_by
, since they are based on
nest_expr
.This can lead to unexpected results if we don’t keep in mind the expression that is built:
#> mpg cyl disp
#> <num> <num> <num>
#> 1: 21.0 6 160
#> 2: 21.0 6 160
#> ---
#> 31: 15.0 8 301
#> 32: 21.4 4 121
DT %>%
start_expr %>%
select(mpg:disp) %>%
transmute(ans = sqrt(mpg)) %>%
filter_sd(contains("m"), .COL > 0)
#> .DT_[, .select_matching(.SD, mpg:disp, .negate = FALSE)][mpg >
#> 0 & am > 0, list(ans = sqrt(mpg))]
The select
gets rid of am
, but
filter_sd
sees the columns of DT
before any
expression has been evaluated. Explicit chaining can help in these
cases, capturing intermediate results:
DT %>%
start_expr %>%
select(mpg:disp) %>%
chain %>%
transmute(ans = sqrt(mpg)) %>%
filter_sd(contains("m"), .COL > 0)
#> .DT_[mpg > 0, list(ans = sqrt(mpg))]
Many of the verbs in this package try to help make code more concise
whenever possible. However, some of the added convenience requires
helper functions that aid with non-standard evaluation, which doesn’t
always play nicely with the data.table
optimizations
mentioned in the beginning. Therefore, it is worth keeping the built
expressions as close to traditional data.table
syntax as
possible. The verbs try to stick to this principle, but in order to
build a simple output expression, they also require a simple input.
Let us illustrate this point with transmute_sd
, which
can be used to compute summaries:
DT %>%
start_expr %>%
group_by(am, vs) %>%
transmute_sd(c("mpg", "disp"), mean) %>%
frame_append(verbose = TRUE) %T>%
print %>%
end_expr
#> .DT_[, c(lapply(.SD, mean)), by = list(am, vs), .SDcols = c(c("mpg",
#> "disp")), verbose = TRUE]
#> Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
#> forder.c received 32 rows and 2 columns
#> forderReuseSorting: opt=0, took 0.000s
#> 0.000s elapsed (0.001s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
#> forder.c received a vector type 'integer' length 4
#> forderReuseSorting: opt=0, took 0.000s
#> 0.000s elapsed (0.000s cpu)
#> lapply optimization changed j from 'c(lapply(.SD, mean))' to 'list(mean(mpg), mean(disp))'
#> GForce optimized j to 'list(gmean(mpg), gmean(disp))' (see ?GForce)
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> This gmean took (narm=FALSE) ... gather took ... 0.000s
#> 0.000s
#> This gmean took (narm=FALSE) ... gather took ... 0.000s
#> 0.000s
#> gforce eval took 0.000
#> 0.000s elapsed (0.000s cpu)
#> am vs mpg disp
#> <num> <num> <num> <num>
#> 1: 1 0 19.75000 206.2167
#> 2: 1 1 28.37143 89.8000
#> 3: 0 1 20.74286 175.1143
#> 4: 0 0 15.05000 357.6167
We can see, for example,
lapply optimization changed j ...
. Changing the expressions
just a bit, without affecting its semantics, still hinders
data.table
’s ability to recognize cases it can
optimize:
DT %>%
start_expr %>%
group_by(am, vs) %>%
transmute_sd(c("mpg", "disp"), mean(.COL)) %>%
frame_append(verbose = TRUE) %T>%
print %>%
end_expr
#> .DT_[, .transmute_matching(.SD, .which = rlang::quo(c("mpg",
#> "disp")), .hows = rlang::quos(mean(.COL))), by = list(am,
#> vs), verbose = TRUE]
#> Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
#> forder.c received 32 rows and 2 columns
#> forderReuseSorting: opt=0, took 0.000s
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
#> forder.c received a vector type 'integer' length 4
#> forderReuseSorting: opt=0, took 0.000s
#> 0.000s elapsed (0.000s cpu)
#> lapply optimization is on, j unchanged as '.transmute_matching(.SD, .which = rlang::quo(c("mpg", "disp")), .hows = rlang::quos(mean(.COL)))'
#> GForce is on, but not activated for this query; left j unchanged (see ?GForce)
#> Old mean optimization is on, left j unchanged.
#> Making each group and running j (GForce FALSE) ... The result of j is a named list. It's very inefficient to create the same names over and over again for each group. When j=list(...), any names are detected, removed and put back after grouping has completed, for efficiency. Using j=transform(), for example, prevents that speedup (consider changing to :=). This message may be upgraded to warning in future.
#>
#> collecting discontiguous groups took 0.000s for 4 groups
#> eval(j) took 0.001s for 4 calls
#> 0.001s elapsed (0.002s cpu)
#> am vs mpg disp
#> <num> <num> <num> <num>
#> 1: 1 0 19.75000 206.2167
#> 2: 1 1 28.37143 89.8000
#> 3: 0 1 20.74286 175.1143
#> 4: 0 0 15.05000 357.6167
Corroborated by the statements j unchanged
,
GForce FALSE
, etc. Nevertheless, a different syntax can
enable additional functionality, for instance column renaming:
#> am vs avg.mpg avg.disp min.mpg min.disp
#> <num> <num> <num> <num> <num> <num>
#> 1: 1 0 19.75000 206.2167 15.0 120.3
#> 2: 1 1 28.37143 89.8000 21.4 71.1
#> 3: 0 1 20.74286 175.1143 17.8 120.1
#> 4: 0 0 15.05000 357.6167 10.4 275.8
Thus, in order to decide which syntax to use, we must be aware of
what can be optimized by data.table
, but as a general rule
of thumb, the simpler the better.
DT %>%
start_expr %>%
group_by(am, vs) %>%
transmute_sd(c("mpg", "disp"), .(min, max)) %>%
frame_append(verbose = TRUE) %T>%
print %>%
end_expr
#> .DT_[, c(min = lapply(.SD, min), max = lapply(.SD, max)), by = list(am,
#> vs), .SDcols = c(c("mpg", "disp")), verbose = TRUE]
#> Finding groups using forderv ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
#> forder.c received 32 rows and 2 columns
#> forderReuseSorting: opt=0, took 0.000s
#> 0.000s elapsed (0.000s cpu)
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu)
#> Getting back original order ... forderReuseSorting: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
#> forder.c received a vector type 'integer' length 4
#> forderReuseSorting: opt=0, took 0.000s
#> 0.000s elapsed (0.000s cpu)
#> lapply optimization changed j from 'c(min = lapply(.SD, min), max = lapply(.SD, max))' to 'list(min(mpg), min(disp), max(mpg), max(disp))'
#> GForce optimized j to 'list(gmin(mpg), gmin(disp), gmax(mpg), gmax(disp))' (see ?GForce)
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.000
#> gforce assign high and low took 0.000
#> gforce eval took 0.000
#> 0.001s elapsed (0.001s cpu)
#> am vs mpg disp mpg disp
#> <num> <num> <num> <num> <num> <num>
#> 1: 1 0 15.0 120.3 26.0 351
#> 2: 1 1 21.4 71.1 33.9 121
#> 3: 0 1 17.8 120.1 24.4 258
#> 4: 0 0 10.4 275.8 19.2 472