Following on parts 1, 2 & 3—yes, a series—we arrive to part 4 revisiting Base R. See part 1 for the rationale, in case you’re wondering Whyyyy?
A typical question going back to Base
from the tidyverse
: How do I join datasets? What do I use instead of bind_rows()
and bind_cols()
? Easy, rbind() and cbind(), yes, r for rows and c for cols, because base is concise.
By rows
If we have a couple of data frames with the same variables (columns), then using rbind()
binds/glues/stitches the data frames one after the other.
example_df1 <- data.frame(record = 1:24, treatment = rep(LETTERS[1:3], each = 8)) example_df2 <- data.frame(record = 25:48, treatment = rep(LETTERS[4:6], each = 8)) example_df3 <- data.frame(record = 49:72) # This one works example_bound <- rbind(example_df1, example_df2) # This one doesn't as they don't have the same variables example_bound <- rbind(example_df1, example_df3) # If we redefine the data frame we can join more than two data frames example_df3 <- data.frame(record = 49:72, treatment = rep(LETTERS[7:9], each = 8)) example_bound <- rbind(example_df1, example_df2, example_df3)
Of course we can use pipes too:
example_df1 |> rbind(example_df2) -> example_bound2
By columns
If we have a couple of data frames with the same number of rows (cases), then using cbind()
binds/glues/stitches the data frames side by side.
example_df4 <- data.frame(record = 1:24, treat1 = rep(LETTERS[1:3], each = 8)) example_df5 <- data.frame(treat2 = rep(LETTERS[4:5], 12), meas = rnorm(24)) example_cbound <- cbind(example_df4, example_df5) example_cbound record treat1 treat2 meas 1 1 A D -2.1158479 2 2 A E 0.7784022 3 3 A D -0.0112054 4 4 A E -0.1986594 ...
When you are working with data frames you get pretty much what you’d expect in dplyr. However, if you are not working with data frames but, instead, you’re dealing with vectors you end up with matrices, in which all elements have the same type. Coercing different types may produce unexpected results
# Binding columns x <- 1:26 y <- sqrt(x) example_1 <- cbind(x, y) # What do we get? is.matrix(example_1) [1] TRUE example_1 x y [1,] 1 1.000000 [2,] 2 1.414214 [3,] 3 1.732051 [4,] 4 2.000000 ... # Perhaps unexpected result. Variable x # was coerced to character example_2 <- cbind(x, letters) example_2 x letters [1,] "1" "a" [2,] "2" "b" [3,] "3" "c" [4,] "4" "d" ...
By one or more indices
When you have data frames with one or more variables “in common” the function to use is merge()
, which may work like left_join()
and right_join()
in dplyr
.
merge(x, y, by =) # which you can read as merge(left, right, by = )
Think of x
as left and y
as right. Using all.x = TRUE
extra rows will be added to the output, one for each row in x
that has no matching row in y
. Using all.y = TRUE
extra rows will be added to the output, one for each row in y
that has no matching row in x
.
As an example, I have two data frames with a tree id (ids
) and a derived variable (first tree ring to achieve a technical threshold for microfibril angle and modulus of elasticity). I would like to join them by ids:
head(firstmfa) ids assess 1 DM001 3 2 DM002 5 3 DM003 4 4 DM004 6 5 DM005 5 6 DM006 7 head(firstmoe) ids ring 1 DM001 8 2 DM002 8 3 DM003 8 4 DM004 8 5 DM005 9 6 DM006 12 # Merging keeping all observations gendata <- merge(firstmfa, firstmoe, by = 'ids', all = TRUE)
Another example using more than one joining variable. Actual wood density (in kg/m3) and microfibril angle (in degrees) assessments per tree ring, joined by tree code and ring number
> head(densdataT) ids ring density 1 DM001 1 NA 2 DM001 2 NA 3 DM001 3 327.96 4 DM001 4 325.37 5 DM001 5 336.59 6 DM001 6 360.82 ... > head(mfadataT) ids ring mfa 1 DM001 1 NA 2 DM001 2 NA 3 DM001 3 31.93 4 DM001 4 31.70 5 DM001 5 33.21 6 DM001 6 27.98 assess <- merge(densdataT, mfadataT, by = c('tree', 'ring'), all = TRUE)