General use

The basics

Notice too that joyn::merge masks base::merge, which is ideal.

library(joyn)
library(data.table)
x1 = data.table(id = c(1L, 1L, 2L, 3L, NA_integer_),
                t  = c(1L, 2L, 1L, 2L, NA_integer_),
                x  = 11:15)

y1 = data.table(id = c(1,2, 4),
                y  = c(11L, 15L, 16))


x2 = data.table(id = c(1, 4, 2, 3, NA),
                t  = c(1L, 2L, 1L, 2L, NA_integer_),
                x  = c(16, 12, NA, NA, 15))


y2 = data.table(id = c(1, 2, 5, 6, 3),
                yd = c(1, 2, 5, 6, 3),
                y  = c(11L, 15L, 20L, 13L, 10L),
                x  = c(16:20))


y3 <- data.table(id = c("c","b", "c", "a"),
                 y  = c(11L, 15L, 18L, 20L))

x3 <- data.table(id=c("c","b", "d"),
                 v=8:10,
                 foo=c(4,2, 7))

x4 = data.table(id1 = c(1, 1, 2, 3, 3),
                id2 = c(1, 1, 2, 3, 4),
                t   = c(1L, 2L, 1L, 2L, NA_integer_),
                x   = c(16, 12, NA, NA, 15))


y4 = data.table(id  = c(1, 2, 5, 6, 3),
                id2 = c(1, 1, 2, 3, 4),
                y   = c(11L, 15L, 20L, 13L, 10L),
                x   = c(16:20))

Summary table

You can see in the examples below that by default joyn displays a small summary table.

# Natural join. i.e., using as key all the variables that appear in both tables.
# without display the resulting table.
merge(x1, y1)
#> > removing key variables `id` from yvars
#> 
#> -- JOYn Report --
#> 
#>    report n percent
#> 1:      x 2   33.3%
#> 2:  x & y 3     50%
#> 3:      y 1   16.7%
#> 4:  total 6    100%
#> ---------------------------------------------------------- End of JOYn report --
#>    id  t  x  y report
#> 1:  1  1 11 11  x & y
#> 2:  1  2 12 11  x & y
#> 3:  2  1 13 15  x & y
#> 4:  3  2 14 NA      x
#> 5:  4 NA NA 16      y
#> 6: NA NA 15 NA      x

You can use the reporting format of the report variables, by setting the argument reporttype to numeric, similar to Stata’s output.

code report meaning
1 x Only available in x table
2 y Only available in y table
3 x & y Matching obs available in both tables
4 NA updated NAs in x updated with actual values in variables with same names in y
5 value updated Actual values and NAs in x updated with actual values in variables with same names in y
6 not updated Actual values and NAs in x are NOT updated with actual values in y

Key variables

Since the object returned is a data.table, we display the results in the console by concatenating a pair of squared brackets at the end of the call. That is merge(x1, y1)[] instead of merge(x1, y1).

# Natural join.  display the results.
merge(x1, y1)[]
#> > removing key variables `id` from yvars
#> 
#> -- JOYn Report --
#> 
#>    report n percent
#> 1:      x 2   33.3%
#> 2:  x & y 3     50%
#> 3:      y 1   16.7%
#> 4:  total 6    100%
#> ---------------------------------------------------------- End of JOYn report --
#>    id  t  x  y report
#> 1:  1  1 11 11  x & y
#> 2:  1  2 12 11  x & y
#> 3:  2  1 13 15  x & y
#> 4:  3  2 14 NA      x
#> 5:  4 NA NA 16      y
#> 6: NA NA 15 NA      x

# Making explicit the key variables (strongly suggested)
merge(x1, y1, by = "id")[]
#> > removing key variables `id` from yvars
#> -- JOYn Report --
#> 
#>    report n percent
#> 1:      x 2   33.3%
#> 2:  x & y 3     50%
#> 3:      y 1   16.7%
#> 4:  total 6    100%
#> ---------------------------------------------------------- End of JOYn report --
#>    id  t  x  y report
#> 1:  1  1 11 11  x & y
#> 2:  1  2 12 11  x & y
#> 3:  2  1 13 15  x & y
#> 4:  3  2 14 NA      x
#> 5:  4 NA NA 16      y
#> 6: NA NA 15 NA      x

If the key variables have different names in both tables, you can create the equivalency as an element of by vector.

# joining by id1 in x and id in y
merge(x4, y4, by = c("id1 = id"))
#> > removing key variables `keyby1` from yvars
#> i variables `id2` and `x` in table y are ignored because arguments `update_NAs`
#> and `update_values` are FALSE.
#> 
#> -- JOYn Report --
#> 
#>    report n percent
#> 1:  x & y 5   71.4%
#> 2:      y 2   28.6%
#> 3:  total 7    100%
#> ---------------------------------------------------------- End of JOYn report --
#>    id1 id2  t  x  y report
#> 1:   1   1  1 16 11  x & y
#> 2:   1   1  2 12 11  x & y
#> 3:   2   2  1 NA 15  x & y
#> 4:   3   3  2 NA 10  x & y
#> 5:   3   4 NA 15 10  x & y
#> 6:   5  NA NA NA 20      y
#> 7:   6  NA NA NA 13      y

# joining by id1 in x and id in y, and id2  in both
merge(x4, y4, by = c("id1 = id", "id2"))
#> > removing key variables `keyby1` and `id2` from yvars
#> i variable `x` in table y is ignored because arguments `update_NAs` and
#> `update_values` are FALSE.
#> 
#> -- JOYn Report --
#> 
#>    report n percent
#> 1:      x 2     25%
#> 2:  x & y 3   37.5%
#> 3:      y 3   37.5%
#> 4:  total 8    100%
#> ---------------------------------------------------------- End of JOYn report --
#>    id1 id2  t  x  y report
#> 1:   1   1  1 16 11  x & y
#> 2:   1   1  2 12 11  x & y
#> 3:   2   2  1 NA NA      x
#> 4:   2   1 NA NA 15      y
#> 5:   3   3  2 NA NA      x
#> 6:   3   4 NA 15 10  x & y
#> 7:   5   2 NA NA 20      y
#> 8:   6   3 NA NA 13      y

Match type

Match type refers to the correspondence that exists between the observations of the joining tables. Following Stata’s convention, we could have four different match types, one-to-one (1:1) , one-to-many (1:m), many-to-one (m:1), and many-to-many (m:m) joins. The default is m:m, following general R’s practice, but its use is highly discouraged. We recommend you always specify the match type when joining tables to ensure the output is correct. Indeed, if you don’t care about match types or you don’t think it is necessary to use them for your particular needs, you might be fine without joyn.

Using the same wording of the Stata manual

Suppose you think your data is uniquely identified by variable id, it is not . If you used match_type = "1:1" you will get and error, letting you know that something is not right.

# Merging correctly but getting error because something is not right in the data
merge(x3, y3, by = "id", match_type = "1:1")
#> x table y is not uniquely identified by `id`
#> Error: match type inconsistency
#> i you could use `return_report = TRUE` in `joyn::is_id()`
#>     to see where the problem is

# Merging wrongly but getting NO errors because  you did not use match_type
merge(x3, y3, by = "id")
#> > removing key variables `id` from yvars
#> 
#> -- JOYn Report --
#> 
#>    report n percent
#> 1:      x 1     20%
#> 2:  x & y 3     60%
#> 3:      y 1     20%
#> 4:  total 5    100%
#> ---------------------------------------------------------- End of JOYn report --
#>    id  v foo  y report
#> 1:  a NA  NA 20      y
#> 2:  b  9   2 15  x & y
#> 3:  c  8   4 11  x & y
#> 4:  c  8   4 18  x & y
#> 5:  d 10   7 NA      x

Join type

Join type refers to observations that are kept after the join. To avoid confusion with argument match_type, I decided not to name this argument as join_type but as keep. At the end, you want to specify which observations you want to keep. This argument plays the role of mimicking the behavior of dplyr’s functions left_join, right_join, inner_join, and full_join, the default.