---
title: "Auditing scripts and scoring risk"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Auditing scripts and scoring risk}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment  = "#>"
)
library(reproducr)
```

This vignette covers Tier 1 of the `reproducr` workflow in depth:
`audit_script()` and `risk_score()`. If you are new to the package, read
the [getting started](getting-started.html) vignette first.

## How `audit_script()` detects calls

`audit_script()` reads your R source files line by line and extracts every
*qualified* function call — one that uses the `::` or `:::` namespace operator.

```r
dplyr::filter(df, x > 0)      # detected  — pkg = "dplyr", fn = "filter"
filter(df, x > 0)              # not detected — package ambiguous
dplyr:::internal_fn()          # detected  — pkg = "dplyr", fn = "internal_fn"
```

Unqualified calls are intentionally ignored. The package cannot determine
which package a bare `filter()` call belongs to from source text alone, and
guessing would produce false positives. Using explicit namespacing (`pkg::fn`)
is itself a reproducibility best practice, and `audit_script()` rewards it.

### What gets skipped

The parser skips two things:

**Pure comment lines** — any line whose first non-whitespace character is `#`:

```r
# dplyr::filter(df, x > 0)    ← skipped entirely
  # also skipped
x <- dplyr::filter(df, x > 0)  ← detected
```

**Trailing inline comments** — the part of a line after ` #`:

```r
x <- 1  # dplyr::not_this()   ← "not_this" is NOT detected
```

### Single-file vs directory scan

```{r single-vs-dir, eval = FALSE}
# Single file
report <- audit_script("analysis.R")

# All scripts in a directory (recursive)
report <- audit_script("R/")

# Whole project
report <- audit_script(".")
```

When scanning a directory, `reproducr` automatically excludes `renv/`,
`packrat/`, `node_modules/`, and hidden directories (those starting with `.`)
so library source files do not pollute your results.

### Version resolution

`audit_script()` resolves the version of every detected package from one of
two sources, in order of preference:

1. **`renv.lock`** — if an `renv.lock` file exists in the working directory
   and `renv = TRUE` (the default), versions are read from the lockfile. This
   gives stable, reproducible version information in CI environments where the
   installed library may differ from the project's declared versions.

2. **Installed library** — if no `renv.lock` is present (or `renv = FALSE`),
   `audit_script()` calls `installed.packages()` to resolve versions from
   whatever is currently installed.

```{r version-resolution}
script <- tempfile(fileext = ".R")
writeLines(c(
  "x <- dplyr::filter(mtcars, cyl == 4)",
  "y <- ggplot2::ggplot(x, ggplot2::aes(mpg, wt))"
), script)

# renv = FALSE — use installed library (no renv.lock in tempdir)
report <- audit_script(script, renv = FALSE, verbose = FALSE)

# Version column — NA means the package is not installed
report$calls[, c("pkg", "fn", "pkg_version")]
```

### The `audit_report` object

The return value is a list of class `"audit_report"`. Its components are:

| Component | Type | Description |
|---|---|---|
| `calls` | `data.frame` | One row per detected call |
| `env` | `list` | R version, platform, OS, locale, timezone |
| `renv_used` | `logical` | Were versions from `renv.lock`? |
| `timestamp` | `POSIXct` | When the audit ran |
| `paths` | `character` | Files that were scanned |

```{r audit-object}
report <- audit_script(script, renv = FALSE, verbose = FALSE)

# Environment fingerprint
report$env

# Files scanned
report$paths

# Programmatic summary
s <- summary(report)
s$n_calls
s$calls_per_pkg
```

---

## How `risk_score()` works

`risk_score()` runs up to three independent checks on the calls detected by
`audit_script()`. Each check is self-contained — they can be run in any
combination.

### Check 1: `"changelog"` — the breaking-changes database

This is the most powerful check. `risk_score()` looks up every detected
`pkg::fn` call in an internal database of known cases where a package update
silently changed a function's behaviour without producing an error or warning.

For each match, it checks whether the installed (or locked) version falls
inside a *risk window* — a half-open interval `(from_ver, to_ver]`:

```
installed version > from_ver  AND  installed version <= to_ver
         ↑                                    ↑
   last "safe" version          first version where the
   (not inclusive)               breaking change applies
```

A version outside the window is not flagged, even if the function is in the
database. This avoids false positives for users on older or newer versions
where the specific change does not apply.

```{r changelog-check}
# Write a script that calls a function with a known breaking change
risky_script <- tempfile(fileext = ".R")
writeLines(c(
  "# dplyr 1.1.0 changed summarise() grouping behaviour",
  "x <- dplyr::group_by(mtcars, cyl)",
  "y <- dplyr::summarise(x, mean_mpg = mean(mpg))",
  "z <- stringr::str_c('a', NA)" # str_c NA-handling changed in 1.5.0
), risky_script)

report <- audit_script(risky_script, renv = FALSE, verbose = FALSE)
risks <- risk_score(report, methods = "changelog")
print(risks)
```

The database currently covers breaking changes in: `dplyr`, `tidyr`,
`ggplot2`, `readr`, `purrr`, `stringr`, `lubridate`, `broom`, `data.table`,
`lme4`, and base R (the R 3.6.0 RNG change and the R 4.0.0 `hclust()`
tie-breaking change). See the
[contributing to the database](contributing-to-the-database.html) vignette to
add new entries.

### Check 2: `"seed_check"` — missing `set.seed()`

This check finds every call to a stochastic function and verifies that a
`set.seed()` call appears within the 50 lines above it in the same file.

Stochastic functions covered:

```
stats::sample    stats::runif     stats::rnorm     stats::rbinom
stats::rpois     stats::rexp      stats::rgamma    stats::rbeta
stats::rcauchy   stats::rchisq    stats::rf        stats::rt
stats::rgeom     stats::rhyper    stats::rnbinom   stats::rweibull
base::sample     base::sample.int
```

```{r seed-check}
seed_script <- tempfile(fileext = ".R")
writeLines(c(
  "# First call — no seed above it",
  "x <- stats::rnorm(100)",
  "",
  "# Second call — seed present within 50 lines",
  "set.seed(237)",
  "y <- stats::rbinom(100, 1, 0.5)",
  "",
  "# Third call — seed is there but 60 lines away (beyond the window)",
  rep("z <- 1", 55),
  "w <- stats::runif(10)"
), seed_script)

report <- audit_script(seed_script, renv = FALSE, verbose = FALSE)
risks <- risk_score(report, methods = "seed_check")
as.data.frame(risks)[, c("line", "call", "risk", "description")]
```

The 50-line window is intentional: a `set.seed()` call at the top of a 500-line
script does not protect a stochastic call at the bottom, because code is
refactored, reordered, and split across files over time.

### Check 3: `"locale_check"` — locale-sensitive operations

This check flags functions whose output depends on the system locale:

```
base::sort      base::order     base::format
base::toupper   base::tolower   base::strftime
base::as.Date   base::sprintf
```

```{r locale-check}
locale_script <- tempfile(fileext = ".R")
writeLines(c(
  "x <- base::sort(c('banana', 'apple', 'cherry'))",
  "y <- base::format(3.14159, digits = 3)",
  "z <- base::strftime(Sys.time(), '%B')" # month name is locale-dependent
), locale_script)

report <- audit_script(locale_script, renv = FALSE, verbose = FALSE)
risks <- risk_score(report, methods = "locale_check")
as.data.frame(risks)[, c("call", "risk", "description")]
```

These are rated `"low"` risk because most analyses running on the same OS with
the same locale will produce identical results. The risk materialises when code
is moved to a server in a different country, or when a Docker container has a
different `LC_ALL` setting.

**Scenario — The international collaboration problem**

Your analysis runs correctly on your Brussels workstation. A collaborator in
the US runs the exact same code and gets different patient group orderings.

```r
sorted_ids <- base::sort(patient_ids)
# "é" sorts after "z" under LC_COLLATE=en_US.UTF-8
# but between "e" and "f" under LC_COLLATE=fr_BE.UTF-8
```

The downstream merge uses `sorted_ids` as a key. The groupings differ. Table 2
in the paper is different in the two labs — with no error thrown anywhere.
`reproducr` flags `base::sort()` as locale-sensitive so you know to pin the
locale explicitly:

```r
# Pin locale for reproducible sorting
Sys.setlocale("LC_COLLATE", "C")
sorted_ids <- base::sort(patient_ids)
```

---

## Combining checks and filtering

All three checks run by default. You can select any subset:

```{r combine-checks}
full_script <- tempfile(fileext = ".R")
writeLines(c(
  "x <- dplyr::summarise(mtcars, n = dplyr::n())",
  "y <- stats::rnorm(10)",
  "z <- base::sort(letters)"
), full_script)

report <- audit_script(full_script, renv = FALSE, verbose = FALSE)

# All checks
all_risks <- risk_score(report)

# Changelog only
changelog_risks <- risk_score(report, methods = "changelog")

# Seed and locale only
other_risks <- risk_score(report, methods = c("seed_check", "locale_check"))

nrow(all_risks)
nrow(changelog_risks)
nrow(other_risks)
```

Filter by minimum severity with `min_risk`:

```{r min-risk}
# Only items worth acting on immediately
high_only <- risk_score(report, min_risk = "high")

# Medium and above
medium_up <- risk_score(report, min_risk = "medium")

# Everything (default)
all_items <- risk_score(report, min_risk = "low")

c(high = nrow(high_only), medium_up = nrow(medium_up), all = nrow(all_items))
```

---

## Working with the results

`risk_score()` returns a `risk_report` object that inherits from `data.frame`,
so all standard data frame operations work directly:

```{r results-as-df}
risks <- risk_score(report)

# Standard subsetting
risks[risks$check == "seed_check", ]

# Count by risk level
table(risks$risk)

# Convert to plain data.frame (drops the extra class)
df <- as.data.frame(risks)
class(df)
```

You can pipe results into any tidy workflow:

```{r results-pipe, eval = FALSE}
library(dplyr)

risk_score(report) |>
  filter(risk == "high") |>
  select(call, line, description) |>
  arrange(line)
```

---

## Practical interpretation

**High risk** — take action before submitting. These are cases where the
function's output values are known to silently change between versions. At
minimum, pin the package version in your `renv.lock` and document the version
in your methods section.

**Medium risk** — review carefully. An argument may have been renamed,
deprecated, or a stochastic function lacks a seed. Your results may differ
across runs or environments.

**Low risk** — be aware. Locale-sensitive functions are unlikely to differ on
your development machine, but worth noting if the analysis will run on a
different OS or server.

**No risks detected** — all detected calls are either not in the breaking-
changes database, or outside any known risky version window, and no stochastic
or locale issues were found. This is a positive signal, not a guarantee —
the database does not cover every possible package.

```{r cleanup, include = FALSE}
unlink(c(script, risky_script, seed_script, locale_script, full_script))
```
