Introduction to dtaudit

dtaudit provides diagnostic tools for data analysis workflows built on data.table. Unlike assertion-based packages that test whether data meets specific rules, dtaudit focuses on interactive development and understanding what happens to your data during joins, filters, and transformations.

Example data

Suppose you have two tables: one with customer orders and another with product information. These are typical inputs to a join operation.

orders <- data.table(
  order_id   = 1:8,
  customer   = c("Alice", "Bob", "Alice", "Carol", "Bob", "Alice", "Dave", "Eve"),
  product_id = c(101L, 102L, 101L, 103L, 104L, 102L, 105L, 106L),
  amount     = c(50, 30, 50, 75, 20, 35, 60, 45)
)

products <- data.table(
  product_id = c(101L, 102L, 103L, 104L, 107L),
  category   = c("Electronics", "Books", "Clothing", "Books", "Food"),
  price      = c(25.0, 15.0, 37.5, 10.0, 8.0)
)

Step 1: Validate primary keys

Before joining, check that your join keys actually identify rows uniquely in each table.

validate_primary_keys(orders, "order_id")
#> 
#> ============== Primary Key Validation ==============
#> Table: orders
#> Key column(s): order_id
#> -----------------------------------------------------
#>   Total rows:              8
#>   Unique key combinations: 8
#>   Duplicate key combos:    0
#> -----------------------------------------------------
#> Result: YES - Keys uniquely identify all rows.
validate_primary_keys(products, "product_id")
#> 
#> ============== Primary Key Validation ==============
#> Table: products
#> Key column(s): product_id
#> -----------------------------------------------------
#>   Total rows:              5
#>   Unique key combinations: 5
#>   Duplicate key combos:    0
#> -----------------------------------------------------
#> Result: YES - Keys uniquely identify all rows.

Step 2: Validate the join

Use validate_join() to understand the relationship between the two tables before joining. This function doesn’t run the full outer join on the original tables, making it typically lighter and faster than ad-hoc solutions. This tells you the join type, match rates, and which keys are unmatched.

validate_join(orders, products, by = "product_id")
#> 
#> ============== Join Validation Summary ==============
#> Tables: orders <--> products
#> Keys in orders: product_id
#> Keys in products: product_id
#>   Relationship                               : many-to-one
#>   Key(s) in orders   [product_id]            : (1 col)
#>   Key(s) in products   [product_id]          : (1 col)
#>   Rows in orders                             : 8
#>   Distinct key combos in orders              : 6
#>   Rows in products                           : 5
#>   Distinct key combos in products            : 5
#>   Overlapping distinct key combos            : 4
#>   Matched row pairs (cartesian)              : 6
#>   Match rate from orders                     : 75.00%
#>   Match rate from products                   : 80.00%
#>   Rows only in orders (no match in products) : 2
#>   Rows only in products (no match in orders) : 1
#> -----------------------------------
#> Duplicates: orders=yes  products=no

The output shows a many-to-one relationship (multiple orders per product), with some product IDs appearing only in one table. This is expected, but now you know exactly what the merge will look like.

You can also track numeric columns through the join with stat.x and stat.y (or stat when both tables share the same column name). This shows how much of a metric falls in matched vs. unmatched rows — often more revealing than row counts alone:

validate_join(orders, products, by = "product_id",
              stat.x = "amount", stat.y = "price")
#> 
#> ============== Join Validation Summary ==============
#> Tables: orders <--> products
#> Keys in orders: product_id
#> Keys in products: product_id
#>   Relationship                               : many-to-one
#>   Key(s) in orders   [product_id]            : (1 col)
#>   Key(s) in products   [product_id]          : (1 col)
#>   Rows in orders                             : 8
#>   Distinct key combos in orders              : 6
#>   Rows in products                           : 5
#>   Distinct key combos in products            : 5
#>   Overlapping distinct key combos            : 4
#>   Matched row pairs (cartesian)              : 6
#>   Match rate from orders                     : 75.00%
#>   Match rate from products                   : 80.00%
#>   Rows only in orders (no match in products) : 2
#>   Rows only in products (no match in orders) : 1
#> 
#>   --- Stat: amount (orders), price (products) ---
#>   Total amount in orders      : 365
#>   Matched amount in orders    : 260  (71.23%)
#>   Unmatched amount in orders  : 105  (28.77%)
#>   Total price in products     : 95.5
#>   Matched price in products   : 87.5  (91.62%)
#>   Unmatched price in products : 8  (8.38%)
#> -----------------------------------
#> Duplicates: orders=yes  products=no

Here 75% of order rows match, but only 71% of total order amount — the unmatched orders carry disproportionately high amounts. On the product side, 80% of rows match but 92% of price, meaning the unmatched product is relatively cheap. These insights help you decide whether unmatched keys are safe to ignore.

Step 3: Filter with diagnostics

After merging, you may want to filter your data. filter_keep() and filter_drop() work like standard subsetting but report how many rows (and optionally how much of a statistic) you’re removing.

merged <- merge(orders, products, by = "product_id", all.x = TRUE)

# Keep only Electronics and Books, report dropped amount
result <- filter_keep(merged, category %in% c("Electronics", "Books"), stat = amount)
#> filter_keep(merged, category %in% c("Electronics", "Books"))
#>   Dropped 3 of 8 rows (37.50%).
#>   Dropped 180 of 365 for amount (49.32%).

Step 4: Compare before and after

compare_datatables() gives you a structural comparison between two tables — useful for checking what changed after a transformation.

compare_datatables(orders, result)
#> 1. Number of rows
#>    orders: 8 rows
#>    result: 5 rows
#>    Difference (orders - result): 3
#> 
#> 2. Column names
#>    Matching column names : 4 
#>    Only in orders: 0
#>    Only in result: 2 (category, price)
#>    Type mismatches  : 0 
#> 
#> 3. Key columns used for matching
#>    Key columns: order_id, customer, product_id (auto-detected)
#>    Distinct key combinations in orders: 8
#>    Distinct key combinations in result: 5
#>    Matching key combinations: 5
#>    Only in orders: 3
#>    Only in result: 0
#> 
#> 4. Numeric column discrepancies (absolute differences)
#>    Comparing numeric columns after merging on keys.
#>    Rows matched on keys: 5 
#>    column     n   min   q25 median   q75   max
#>    <char> <int> <num> <num>  <num> <num> <num>
#> 1: amount     5     0     0      0     0     0

Step 5: Check data quality

Missing values

# Introduce some NAs to demonstrate
merged_with_na <- copy(merged)
merged_with_na[sample(.N, 2), category := NA]

diagnose_nas(merged_with_na)
#> 2 of 6 columns have missing values
#>   variable                           n_na   pct_na
#>   category                              4    50.0%
#>   price                                 2    25.0%

String quality

diagnose_strings() checks for common string issues: missing values, empty strings, whitespace problems, non-ASCII characters, and case inconsistencies.

diagnose_strings(orders$customer)
#> 
#> =============== String Column Diagnosis ===============
#> Variable: orders$customer
#> --------------------------------------------------------
#> Total elements:        8
#> --------------------------------------------------------
#> Missing & Empty:
#>   NA values:           0 (0.0%)
#>   Empty strings:       0 (0.0%)
#>   Whitespace-only:     0 (0.0%)
#> --------------------------------------------------------
#> Whitespace Issues:
#>   Leading whitespace:  0
#>   Trailing whitespace: 0
#> --------------------------------------------------------
#> Encoding:
#>   Non-ASCII chars:     0
#> --------------------------------------------------------
#> Case Inconsistencies:
#>   Variant groups:      0
#>   Total variants:      0

Date coverage

If your data should cover a continuous time range, check_date_coverage() identifies gaps.

dates <- as.IDate(c("2024-01-15", "2024-02-20", "2024-04-10", "2024-05-05"))
check_date_coverage(dates, "2024-01-01", "2024-06-30")
#> Checking dates between 2024-01-01 and 2024-06-30 (by month )
#> There are 2 month periods missing. These are:
#> Mar-2024, Jun-2024

Step 6: Summarize columns

get_summary_table() produces a one-row-per-column overview of your data, including types, missing counts, and descriptive statistics.

get_summary_table(orders)
#>      variable      type n_unique missing most_frequent   mean               sd
#>        <char>    <char>   <char>  <char>        <char> <char>           <char>
#> 1:   order_id   numeric        8       0          <NA>    4.5 2.44948974278318
#> 2:   customer character        5       0         Alice   <NA>             <NA>
#> 3: product_id   numeric        6       0          <NA>    103  1.8516401995451
#> 4:     amount   numeric        7       0          <NA> 45.625 17.4104853464802
#>       min    q25    q50    q75    max example1 example2 example3
#>    <char> <char> <char> <char> <char>   <char>   <char>   <char>
#> 1:      1   2.75    4.5   6.25      8        1        2        3
#> 2:  Alice   <NA>   <NA>   <NA>    Eve    Alice      Bob    Alice
#> 3:    101 101.75  102.5 104.25    106      101      102      101
#> 4:     20  33.75   47.5   52.5     75       50       30       50

Auditing cleaning operations

When you apply a cleaning function, audit_clean() reports exactly what changed:

firms <- c("Apple Inc.", "MICROSOFT CORP", "Alphabet LLC", "apple", NA)
audit_clean(firms, clean_firm_name)
#> 
#> =============== String Cleaning Audit ===============
#> Variable: firms
#> Function: clean_firm_name
#> -----------------------------------------------------
#> Total elements:  5
#>   NA values:     1
#>   Changed:       4 (100.0% of non-NA)
#>   Unchanged:     1
#> -----------------------------------------------------
#> Examples of changes (showing 4 of 4):
#>          before     after
#>          <char>    <char>
#>      Apple Inc.     APPLE
#>  MICROSOFT CORP MICROSOFT
#>    Alphabet LLC  ALPHABET
#>           apple     APPLE
#> 
#> Access cleaned vector with: result$cleaned