Getting started with dqcheckrGUI

dqcheckrGUI is a point-and-click Shiny interface for dqcheckr. It lets you configure dataset quality checks, run them against incoming file deliveries, and browse historical results — without writing any R code.

The app runs entirely on your local machine. No internet connection, no server, and no cloud services are required.

Prerequisites

Install dqcheckr before installing dqcheckrGUI:

install.packages("dqcheckr")
devtools::install_github("mickmioduszewski/dqcheckrGUI")

Launching the app

dqcheckrGUI::run_app()

The app opens in your default browser at http://localhost:4321. It reads and writes files relative to the working directory you launched from, so set that to your project folder before calling run_app(). The terminal window must remain open while the app is running — closing it stops the app.

Alternatively, double-click launch.command (macOS), launch.sh (Linux), or launch.bat (Windows) from the app directory.

First-run setup

On first launch, if no dqcheckr.yml global config file exists in the working directory, the app opens directly to Global Config and prompts you to set two paths before doing anything else:

Click Save global config once these are set, then proceed to configure your first dataset.

Example configuration files

Example YAML configuration files using the Star Wars dataset are included with the package. Copy them to get started quickly:

file.copy(
  system.file("extdata/example_config", package = "dqcheckrGUI"),
  "config",
  recursive = TRUE
)

The interface

The app has a fixed sidebar on the left and a main panel that changes based on your selection.

┌──────────────────┬────────────────────────────────┐
│  SIDEBAR         │  MAIN PANEL                    │
│                  │                                │
│  Datasets        │  [content changes here]        │
│    customers ✓   │                                │
│    suppliers ⚠   │                                │
│    [+ New]       │                                │
│                  │                                │
│  ▶ Run           │                                │
│  ⏱ History       │                                │
│  ⚙ Global Config │                                │
└──────────────────┴────────────────────────────────┘

Status badges

Status badges appear in the sidebar, dataset panel, and history table:

Badge Meaning
✓ PASS All checks passed
⚠ WARN One or more warnings; no failures
✗ FAIL One or more checks failed
● RUNNING A check is currently in progress
— not run No runs recorded yet

Dataset panel

Clicking a dataset name in the sidebar opens its Dataset panel in the main area:

customer_accounts
──────────────────────────────────────────
Format: CSV   Location: data/incoming/
Config: config/customer_accounts.yml

[Edit config]          [Run check ▶]

Recent runs (last 5):
  2026-05-30   20260530.csv   ✓ PASS   0 failures
  2026-05-23   20260523.csv   ⚠ WARN   0 failures
  2026-05-16   20260516.csv   ✓ PASS   0 failures

[View all in History →]

Compare drift:  ☐ run 1  ☐ run 2   [Compare ▶]

Setting up a dataset — the 8-step wizard

Click + New dataset in the sidebar to open the configuration wizard. Navigate between steps using the Back and Next buttons, or click the numbered breadcrumb bar at the top. All values are preserved as you move back and forth. If you navigate away from the wizard with unsaved changes, the app will warn you before discarding them.

Step 1 — Dataset identity

Enter a short machine-readable name for the dataset. Names must start with a letter and contain only letters, numbers, and underscores (e.g. customer_accounts). This name is used as the config filename and passed directly to dqcheckr::run_dq_check().

An optional free-text description can also be added.

Step 2 — File location

Choose how files are identified for each run:

Folder scan (recommended) — point to a folder; the app picks the two most recently modified files automatically on each run. Use this for delivery processes that drop files into a fixed directory.

A Preview most recent file button shows the names and sizes of the two most recent files in the selected folder: “Current: 20260530.csv (2.4 MB) | Previous: 20260523.csv (2.3 MB)”.

Explicit file paths — list a current file and optionally a previous file by path. Useful when files are versioned by name. If no previous file is given, comparison checks (CP series) are skipped.

Step 3 — Format and structure

This step identifies the file’s format and column layout. The top of the screen shows a raw text preview of the first 50 lines of the file — the full file is never loaded.

Auto-detection: when a file is loaded, the app uses readr to detect the delimiter, encoding, quote character, and whether the first row is a header. Results are shown as an editable confirmation panel — you always make the final call, nothing is silently committed.

CSV files

Confirm or adjust:

  • Format (CSV or Fixed-Width)
  • Delimiter — comma, tab, semicolon, pipe, space, colon, or custom
  • Encoding — detected candidates are shown with confidence percentages
  • Quote character
  • Header row — whether the first row contains column names

A parsed preview updates live as you change these settings. If there is no header row, a column-naming panel appears alongside the preview where you can enter names for each column. Names must be valid R identifiers; a suggestion is offered if you enter something invalid.

Fixed-width (FWF) files

When FWF is selected, a visual ruler activates above the text preview.

  • Click anywhere in the text to place a column boundary at that character position.
  • Drag a boundary line left or right to adjust it — it snaps to character boundaries.
  • Double-click a boundary line to remove it.
  • An Auto-detect button attempts automatic boundary detection using readr::fwf_empty().

Below the ruler, a table shows the resulting column definitions (start position, width, name, type). Edit column names and types here. A validation badge confirms whether the column widths account for the full record length:

  • ✅ green — widths match the line length exactly
  • ⚠ amber — widths cover fewer characters than the line length (trailing columns unaccounted for)
  • ✗ red — widths exceed the line length (Next is disabled until fixed)

If the file has header rows to skip before the data begins, set Header rows to skip. Column names are pre-populated from the skipped header row if one is present.

Step 4 — Column classification

Review every detected column. For each column you can:

Step 5 — Column rules

Each column is shown as a collapsible card. Expand a column to add optional per-column validation rules.

Standard rules (always visible):

Rule Applies to Check
Allowed values character columns QC-09: flag any value not in the list
Min value numeric columns QC-10: flag values below the minimum
Max value numeric columns QC-10: flag values above the maximum

Advanced rules (click Advanced ▼ to reveal):

Rule Description
Regex pattern Flag values that do not match the pattern (QC-13). Click Test against sample to verify the pattern against the actual file before saving.
Max missing rate Override the dataset-level threshold for this column only
Max non-numeric rate Override for numeric columns only
Max missing rate change Override the comparison threshold for this column
Max mean shift Override the mean shift threshold for this column

A regex syntax error disables Next; a pattern that fails against sample values shows a warning but does not block you from proceeding.

Step 6 — Rule overrides

Override the global default thresholds for this dataset only. Each field is pre-filled with the current global default. Only fields whose value differs from the global default are written to the dataset config; fields left at their default are omitted so that a later change to the global config is automatically inherited.

Threshold Default What it controls
Max missing rate 0.05 Flag a column if > 5 % of values are blank
Max non-numeric rate 0.01 Flag a numeric column if > 1 % of values cannot be parsed
Min row count 0 (off) Fail if the delivery has fewer rows than this
Max row count change 10 % Warn if row count changes by > 10 % vs previous delivery
Max mean shift 20 % Warn if a numeric column mean shifts by > 20 %
Max missing change 2 pp Warn if missing rate changes by > 2 percentage points
Max non-numeric change 1 pp Warn if non-numeric rate changes by > 1 percentage point
Type inference threshold 0.90 A column is typed numeric if ≥ 90 % of values parse as numbers

Schema change flags control whether warnings are raised when columns are added, dropped, change type, or change order between deliveries.

Step 7 — Custom checks (optional)

Point to a plain R file that defines a custom_checks(df) function. The app validates the file immediately: it checks that the file exists, parses without syntax errors, and defines the expected function. A green badge confirms a valid file; a red badge shows the specific problem.

Leave this field blank to skip custom checks. See vignette("dqcheckr", package = "dqcheckr") for the custom checks function signature and return value.

Step 8 — Review and save

A summary of all settings is shown alongside a YAML preview of the configuration that will be written. Click Save config ✓ to write the file. A success notification confirms the path. The app then navigates to the dataset panel for the saved dataset.

For analysts who hand-edit YAML: the app preserves any keys you have added to the YAML file outside the wizard. On the next edit, those keys appear in the Step 8 preview under # preserved from original file and are written back unchanged. The wizard never silently drops hand-added config keys.


Running a quality check

Click ▶ Run in the sidebar (or Run check ▶ from a dataset panel).

  1. Select a dataset from the dropdown.
  2. The app validates the config before enabling the run button — it checks that the config file is readable, the data folder or files exist, and the output paths are writable. Any problem is shown as a red badge with the specific issue.
  3. Click ▶ Run check.

The check runs in a background process so the UI stays responsive. Progress is streamed to the log area in real time. When complete:

Status: ✓ PASS    [Open report ↗]   [View log]

  0 failures   0 warnings   22 passed
  Report: reports/customer_accounts_20260531_143022.html

Browsing history

Click ⏱ History in the sidebar to see all past runs across all datasets.

The table can be filtered by dataset name, status, or date using the filter row at the top of each column. Click Load more to page through older runs.

To open a report: click the Open link in the Report column to open that run’s HTML report in a new browser tab.

To compare two deliveries:

  1. Tick the checkbox next to exactly two runs from the same dataset.
  2. Click Compare drift ▶.

The drift report opens in a new tab, showing column-by-column changes between the two snapshots. The button is disabled with a tooltip if the selected rows are from different datasets.


Global configuration

Click ⚙ Global Config to set defaults that apply to all datasets unless overridden at the dataset level (Step 6).

Infrastructure paths — set the snapshot database path and report output directory. The snapshot database is created automatically on first run if it does not exist. If either path’s parent directory does not exist, a red validation message is shown — create the directory on disk first, then save.

Default rule thresholds — the same thresholds described in Step 6 above, applied globally. Dataset-level overrides take precedence.

Click Save global config to write changes.

Team use via a shared network drive

Multiple users can share the same configuration and run history by pointing their config_dir and snapshot database paths at a shared network folder. Each user runs the app locally; the shared drive holds the YAML configs and the SQLite snapshot database. Set these paths in Global Config on each user’s machine. SQLite handles concurrent reads safely; two simultaneous writes (unlikely in practice) are serialised by file locking.


Learn more

See vignette("dqcheckr", package = "dqcheckr") for a full description of every quality check (QC-01 to QC-14, SC-01/02, CP-01 to CP-08), per-column configuration options, custom checks, snapshot database schema, and the compare_snapshots() function used for drift reports.