dtaudit provides diagnostic tools for data analysis
workflows built on data.table. Unlike assertion-based
packages that test whether data meets specific rules, dtaudit focuses on
interactive development and understanding what
happens to your data during joins, filters, and
transformations.
Suppose you have two tables: one with customer orders and another with product information. These are typical inputs to a join operation.
orders <- data.table(
order_id = 1:8,
customer = c("Alice", "Bob", "Alice", "Carol", "Bob", "Alice", "Dave", "Eve"),
product_id = c(101L, 102L, 101L, 103L, 104L, 102L, 105L, 106L),
amount = c(50, 30, 50, 75, 20, 35, 60, 45)
)
products <- data.table(
product_id = c(101L, 102L, 103L, 104L, 107L),
category = c("Electronics", "Books", "Clothing", "Books", "Food"),
price = c(25.0, 15.0, 37.5, 10.0, 8.0)
)Before joining, check that your join keys actually identify rows uniquely in each table.
validate_primary_keys(orders, "order_id")
#>
#> ============== Primary Key Validation ==============
#> Table: orders
#> Key column(s): order_id
#> -----------------------------------------------------
#> Total rows: 8
#> Unique key combinations: 8
#> Duplicate key combos: 0
#> -----------------------------------------------------
#> Result: YES - Keys uniquely identify all rows.
validate_primary_keys(products, "product_id")
#>
#> ============== Primary Key Validation ==============
#> Table: products
#> Key column(s): product_id
#> -----------------------------------------------------
#> Total rows: 5
#> Unique key combinations: 5
#> Duplicate key combos: 0
#> -----------------------------------------------------
#> Result: YES - Keys uniquely identify all rows.Use validate_join() to understand the relationship
between the two tables before joining. This function
doesn’t run the full outer join on the original tables, making it
typically lighter and faster than ad-hoc solutions. This tells you the
join type, match rates, and which keys are unmatched.
validate_join(orders, products, by = "product_id")
#>
#> ============== Join Validation Summary ==============
#> Tables: orders <--> products
#> Keys in orders: product_id
#> Keys in products: product_id
#> Relationship : many-to-one
#> Key(s) in orders [product_id] : (1 col)
#> Key(s) in products [product_id] : (1 col)
#> Rows in orders : 8
#> Distinct key combos in orders : 6
#> Rows in products : 5
#> Distinct key combos in products : 5
#> Overlapping distinct key combos : 4
#> Matched row pairs (cartesian) : 6
#> Match rate from orders : 75.00%
#> Match rate from products : 80.00%
#> Rows only in orders (no match in products) : 2
#> Rows only in products (no match in orders) : 1
#> -----------------------------------
#> Duplicates: orders=yes products=noThe output shows a many-to-one relationship (multiple orders per product), with some product IDs appearing only in one table. This is expected, but now you know exactly what the merge will look like.
You can also track numeric columns through the join with
stat.x and stat.y (or stat when
both tables share the same column name). This shows how much of a metric
falls in matched vs. unmatched rows — often more revealing than row
counts alone:
validate_join(orders, products, by = "product_id",
stat.x = "amount", stat.y = "price")
#>
#> ============== Join Validation Summary ==============
#> Tables: orders <--> products
#> Keys in orders: product_id
#> Keys in products: product_id
#> Relationship : many-to-one
#> Key(s) in orders [product_id] : (1 col)
#> Key(s) in products [product_id] : (1 col)
#> Rows in orders : 8
#> Distinct key combos in orders : 6
#> Rows in products : 5
#> Distinct key combos in products : 5
#> Overlapping distinct key combos : 4
#> Matched row pairs (cartesian) : 6
#> Match rate from orders : 75.00%
#> Match rate from products : 80.00%
#> Rows only in orders (no match in products) : 2
#> Rows only in products (no match in orders) : 1
#>
#> --- Stat: amount (orders), price (products) ---
#> Total amount in orders : 365
#> Matched amount in orders : 260 (71.23%)
#> Unmatched amount in orders : 105 (28.77%)
#> Total price in products : 95.5
#> Matched price in products : 87.5 (91.62%)
#> Unmatched price in products : 8 (8.38%)
#> -----------------------------------
#> Duplicates: orders=yes products=noHere 75% of order rows match, but only 71% of total order
amount — the unmatched orders carry disproportionately high
amounts. On the product side, 80% of rows match but 92% of
price, meaning the unmatched product is relatively cheap.
These insights help you decide whether unmatched keys are safe to
ignore.
After merging, you may want to filter your data.
filter_keep() and filter_drop() work like
standard subsetting but report how many rows (and optionally how much of
a statistic) you’re removing.
merged <- merge(orders, products, by = "product_id", all.x = TRUE)
# Keep only Electronics and Books, report dropped amount
result <- filter_keep(merged, category %in% c("Electronics", "Books"), stat = amount)
#> filter_keep(merged, category %in% c("Electronics", "Books"))
#> Dropped 3 of 8 rows (37.50%).
#> Dropped 180 of 365 for amount (49.32%).compare_datatables() gives you a structural comparison
between two tables — useful for checking what changed after a
transformation.
compare_datatables(orders, result)
#> 1. Number of rows
#> orders: 8 rows
#> result: 5 rows
#> Difference (orders - result): 3
#>
#> 2. Column names
#> Matching column names : 4
#> Only in orders: 0
#> Only in result: 2 (category, price)
#> Type mismatches : 0
#>
#> 3. Key columns used for matching
#> Key columns: order_id, customer, product_id (auto-detected)
#> Distinct key combinations in orders: 8
#> Distinct key combinations in result: 5
#> Matching key combinations: 5
#> Only in orders: 3
#> Only in result: 0
#>
#> 4. Numeric column discrepancies (absolute differences)
#> Comparing numeric columns after merging on keys.
#> Rows matched on keys: 5
#> column n min q25 median q75 max
#> <char> <int> <num> <num> <num> <num> <num>
#> 1: amount 5 0 0 0 0 0diagnose_strings() checks for common string issues:
missing values, empty strings, whitespace problems, non-ASCII
characters, and case inconsistencies.
diagnose_strings(orders$customer)
#>
#> =============== String Column Diagnosis ===============
#> Variable: orders$customer
#> --------------------------------------------------------
#> Total elements: 8
#> --------------------------------------------------------
#> Missing & Empty:
#> NA values: 0 (0.0%)
#> Empty strings: 0 (0.0%)
#> Whitespace-only: 0 (0.0%)
#> --------------------------------------------------------
#> Whitespace Issues:
#> Leading whitespace: 0
#> Trailing whitespace: 0
#> --------------------------------------------------------
#> Encoding:
#> Non-ASCII chars: 0
#> --------------------------------------------------------
#> Case Inconsistencies:
#> Variant groups: 0
#> Total variants: 0If your data should cover a continuous time range,
check_date_coverage() identifies gaps.
get_summary_table() produces a one-row-per-column
overview of your data, including types, missing counts, and descriptive
statistics.
get_summary_table(orders)
#> variable type n_unique missing most_frequent mean sd
#> <char> <char> <char> <char> <char> <char> <char>
#> 1: order_id numeric 8 0 <NA> 4.5 2.44948974278318
#> 2: customer character 5 0 Alice <NA> <NA>
#> 3: product_id numeric 6 0 <NA> 103 1.8516401995451
#> 4: amount numeric 7 0 <NA> 45.625 17.4104853464802
#> min q25 q50 q75 max example1 example2 example3
#> <char> <char> <char> <char> <char> <char> <char> <char>
#> 1: 1 2.75 4.5 6.25 8 1 2 3
#> 2: Alice <NA> <NA> <NA> Eve Alice Bob Alice
#> 3: 101 101.75 102.5 104.25 106 101 102 101
#> 4: 20 33.75 47.5 52.5 75 50 30 50When you apply a cleaning function, audit_clean()
reports exactly what changed:
firms <- c("Apple Inc.", "MICROSOFT CORP", "Alphabet LLC", "apple", NA)
audit_clean(firms, clean_firm_name)
#>
#> =============== String Cleaning Audit ===============
#> Variable: firms
#> Function: clean_firm_name
#> -----------------------------------------------------
#> Total elements: 5
#> NA values: 1
#> Changed: 4 (100.0% of non-NA)
#> Unchanged: 1
#> -----------------------------------------------------
#> Examples of changes (showing 4 of 4):
#> before after
#> <char> <char>
#> Apple Inc. APPLE
#> MICROSOFT CORP MICROSOFT
#> Alphabet LLC ALPHABET
#> apple APPLE
#>
#> Access cleaned vector with: result$cleanedSee the function reference for complete documentation:
validate_join(), validate_primary_keys(),
validate_var_relationship()compare_datatables()filter_keep(),
filter_drop()diagnose_nas(),
diagnose_strings(), audit_clean(),
get_summary_table(), summarize_vector(),
check_date_coverage()clean_var_names(),
clean_firm_name()embed_into_cartesian()