---
title: "Tracking how a literature changes between retrievals"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Tracking how a literature changes between retrievals}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

```{r setup}
library(scopusflow)
```

A literature is a moving target. Run the same search a few months apart and the
result will have grown, and perhaps lost a record that was re-indexed. This
article shows how to see exactly what changed and how to merge retrievals safely.
It runs offline: the baseline is the bundled `example_records`, and the later
retrieval is built from a synthetic entry list of the same shape the API returns.

## The baseline

```{r}
baseline <- example_records
nrow(baseline)
```

## A later retrieval

Months on, the search is repeated. Here we mimic that second pull: it keeps most
of the original records, drops one that was re-indexed and adds two new papers.

```{r}
later_raw <- list(entry = list(
  # carried over from the baseline
  list(`dc:identifier` = "SCOPUS_ID:85000000001", `prism:doi` = "10.1038/s41586-019-0001-1",
       `dc:title` = "Genome editing with CRISPR-Cas9: principles and applications",
       `prism:coverDate` = "2019-04-12"),
  list(`dc:identifier` = "SCOPUS_ID:85000000002", `prism:doi` = "10.1038/s41586-020-0002-2",
       `dc:title` = "Deep learning for medical image analysis: a review",
       `prism:coverDate` = "2020-02-20"),
  list(`dc:identifier` = "SCOPUS_ID:85000000006", `prism:doi` = "10.1103/PhysRevLett.116.061102",
       `dc:title` = "Observation of gravitational waves from a binary black hole merger",
       `prism:coverDate` = "2016-02-11"),
  # newly indexed since the baseline
  list(`dc:identifier` = "SCOPUS_ID:85000000007", `prism:doi` = "10.1126/science.abc1234",
       `dc:title` = "A room-temperature superconductor candidate",
       `prism:coverDate` = "2023-03-08"),
  list(`dc:identifier` = "SCOPUS_ID:85000000008", `prism:doi` = "10.1038/s41586-023-0008-8",
       `dc:title` = "Large language models for scientific discovery",
       `prism:coverDate` = "2023-06-01")
))
later <- scopus_records(later_raw, query = "illustrative later retrieval")
nrow(later)
```

## What changed

`scopus_diff_dois()` reports which DOIs were added, removed or unchanged between
the two retrievals, and prints the counts in each category.

```{r}
changes <- scopus_diff_dois(old = baseline, new = later)
changes
```

The newly indexed papers come back as `added`, the records present both times as
`unchanged`, and anything dropped from the later pull as `removed`. To act on one
category, filter the table.

```{r}
changes[changes$status == "added", ]
```

## Merging without duplicates

To keep a cumulative set across retrievals, combine them. `scopus_combine()`
renumbers the records and, with `dedupe = TRUE`, keeps each one once by 'Scopus'
identifier or DOI, so the records the two pulls share are not doubled.

```{r}
combined <- scopus_combine(baseline, later, dedupe = TRUE)
nrow(combined)
```

## Keeping a record of each pull

Saving each retrieval lets you compare against it next time. The `.rds` form
round-trips exactly.

```{r}
path <- file.path(tempdir(), "baseline.rds")
write_scopus_records(baseline, path)
identical(read_scopus_records(path), baseline)
```

In a live setting the later retrieval would come from the API rather than a
synthetic list, with everything else unchanged.

```{r eval = FALSE}
later <- scopus_fetch("TITLE-ABS-KEY(CRISPR)", field = "TITLE-ABS-KEY")
scopus_diff_dois(old = read_scopus_records(path), new = later)
```