| Title: | Exploratory Analysis of Relationships Between Variables |
| Version: | 0.0.1 |
| Maintainer: | Braylin Alexander Jiménez Reynoso <braylinjr1511@gmail.com> |
| Description: | Provides tools to explore and summarize relationship patterns between variables across one or multiple datasets. The package relies on efficient sampling strategies to estimate pairwise associations and supports quick exploratory data analysis for large or heterogeneous data sources. |
| License: | GPL (≥ 3) |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Suggests: | testthat (≥ 3.0.0), tibble |
| Config/testthat/edition: | 3 |
| NeedsCompilation: | no |
| Packaged: | 2025-12-07 22:37:54 UTC; Braylin Jimenez |
| Author: | Braylin Alexander Jiménez Reynoso [aut, cre] |
| Repository: | CRAN |
| Date/Publication: | 2025-12-11 19:30:02 UTC |
Infer relationship types between variables in two datasets using sampling
Description
This function compares selected variables from two data frames and infers their relational structure (e.g., one-to-one, many-to-one). It uses random sampling—either automatic or user-defined—to estimate match behavior, uniqueness patterns, and missingness characteristics. The goal is to help diagnose potential join keys or detect unrelated fields without performing full-table comparisons.
Usage
joinless(
x,
y,
x_vars = NULL,
y_vars = NULL,
conf = 0.95,
error = 0.05,
n_x = NULL,
n_y = NULL,
max_vars = 20,
ignore = character(0),
missingness_tol = 0.1,
type_coerce = TRUE,
seed = NULL,
verbose = FALSE,
info = FALSE
)
Arguments
x, y |
Data frames. Input datasets to be compared. |
x_vars, y_vars |
Character vectors specifying the column names to compare.
If |
conf |
Numeric. Confidence level used to compute automatic sample sizes
(default: |
error |
Numeric. Margin of error used in sample size calculation
(default: |
n_x, n_y |
Optional fixed sample sizes for |
max_vars |
Integer. Maximum number of variables to compare per dataset.
Defaults to |
ignore |
Character vector of relation types to exclude from the output. By default, no types are excluded. |
missingness_tol |
Numeric. Maximum tolerated proportion of
missing/problematic values within a variable (default: |
type_coerce |
Logical. If |
seed |
Optional integer. Random seed to make the sampling reproducible. |
verbose |
Logical. If |
info |
Logical. If |
Details
Relationship inference is determined using:
-
Match rate: proportion of keys in
xfound iny -
Key uniqueness: frequency distribution of non-missing values
Based on these, relationships are classified as:
-
"one-one" -
"many-one" -
"one-many" -
"many-many" -
"unrelated"(very low or zero match rate) -
"null"(missingness above tolerance) -
"error_type"(incompatible types and coercion disabled)
Value
A data frame summarizing the inferred relationship between every
variable pair.
If info = FALSE, the output contains:
-
x_var: variable name inx -
y_var: variable name iny -
relation_type: inferred relationship
If info = TRUE, additional columns include:
-
n_used: sample size used -
match_rate: proportion of sampled values fromxfound iny -
null_rate_x,null_rate_y: missingness/problematic rates -
type_x,type_y: underlying storage types -
notes: diagnostic messages
Examples
df1 <- data.frame(
id = 1:5,
value = 1:5
)
df2 <- data.frame(
id = 3:7,
value = 3:7
)
joinless(df1, df2, x_vars = "id", y_vars = "id")
joinless(df1, df2, conf = 0.99, error = 0.02, info = TRUE)
joinless(df1, df2, ignore = "unrelated")
Infer relationship types between one dataset and multiple counterparts
Description
This function is a convenience wrapper around joinless() that compares
a single dataset x against multiple datasets supplied in a list ys.
Internally it calls joinless() once per dataset and row-binds the results,
adding an extra column that identifies the target dataset.
Usage
joinless_multiple(
x,
ys,
x_vars = NULL,
y_vars = NULL,
dataset_names = NULL,
...
)
Arguments
x |
A data frame. The reference dataset to compare from. |
ys |
A named or unnamed list of data frames. Each element is treated
as a separate target dataset to compare |
x_vars |
Optional character vector of column names in |
y_vars |
Optional character vector of column names to use in each
target dataset. If not |
dataset_names |
Optional character vector with labels for each dataset
in |
... |
Additional arguments passed on to |
Details
For each dataset in ys, the function:
optionally restricts the variables in
xviax_vars,optionally restricts the variables in that dataset via
y_vars,calls
joinless()with the provided settings,tags the output with a dataset name.
When y_vars is not NULL, the function intersects y_vars with the
column names of each dataset in ys. This means that:
Variables listed in
y_varsbut missing in a given dataset are silently dropped for that dataset.If none of the variables in
y_varsexist in a particular dataset, that dataset is skipped and a warning is emitted.
This behavior avoids producing "error_type" rows solely due to missing
columns in some of the target datasets.
If all datasets are skipped (e.g., because none contain the requested
y_vars), the function returns an empty data frame.
Value
A data frame that row-binds the result of joinless() for all
target datasets that were processed. It contains all columns returned by
joinless() plus an additional column:
-
dataset: identifier of the target dataset (one per element ofys).
If all datasets are skipped, an empty data frame is returned.
Examples
df_base <- data.frame(id = 1:5, value = 1:5)
df_a <- data.frame(id = 3:7, value = 3:7)
df_b <- data.frame(id_alt = 1:5, value = 11:15)
# Compare the same key from df_base against multiple datasets
res <- joinless_multiple(
x = df_base,
ys = list(a = df_a, b = df_b),
x_vars = "id",
y_vars = c("id", "id_alt"),
info = TRUE
)