quanteda.tidy extends the quanteda package with dplyr-style verbs for manipulating corpus objects. These functions operate on document variables (docvars) while preserving the text content and structure of quanteda objects.
Note that quanteda.tidy very different from tidytext. While tidytext converts text to data frames with one token per row, quanteda.tidy keeps your corpus intact and extends dplyr functions to work directly with quanteda objects.
The functions in quanteda.tidy are organized into four categories, following the dplyr documentation:
| Category | Function | Description |
|---|---|---|
| Rows | filter() |
Subset documents based on docvar conditions |
| Rows | slice(), slice_head(),
slice_tail() |
Subset documents by position |
| Rows | slice_sample() |
Randomly sample documents |
| Rows | slice_min(), slice_max() |
Select documents with min/max docvar values |
| Rows | arrange(), distinct() |
Reorder documents; keep unique documents |
| Columns | select() |
Keep or drop docvars by name |
| Columns | rename(), rename_with() |
Rename docvars |
| Columns | relocate() |
Change docvar column order |
| Columns | mutate(), transmute() |
Create or modify docvars |
| Columns | pull() |
Extract a single docvar as a vector |
| Columns | glimpse() |
Get a quick overview of the corpus |
| Groups of rows | add_count() |
Add count by group as a docvar |
| Groups of rows | add_tally() |
Add total count as a docvar |
| Pairs of data frames | left_join() |
Join corpus with external data frame |
These functions subset, reorder, or select documents based on their document variables or positions.
Use filter() to keep documents that match specified
conditions:
# Keep only Roosevelt's speeches
data_corpus_inaugural %>%
filter(President == "Roosevelt") %>%
summary()
## Corpus consisting of 5 documents, showing 5 documents:
##
## Text Types Tokens Sentences Year President FirstName Party
## 1905-Roosevelt 404 1079 33 1905 Roosevelt Theodore Republican
## 1933-Roosevelt 743 2057 85 1933 Roosevelt Franklin D. Democratic
## 1937-Roosevelt 725 1989 96 1937 Roosevelt Franklin D. Democratic
## 1941-Roosevelt 526 1519 68 1941 Roosevelt Franklin D. Democratic
## 1945-Roosevelt 275 633 27 1945 Roosevelt Franklin D. DemocraticUse slice() and its variants to select documents by
position:
# First 3 documents
slice(data_corpus_inaugural, 1:3)
## Corpus consisting of 3 documents and 4 docvars.
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
##
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
##
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
# First 10%
slice_head(data_corpus_inaugural, prop = 0.10)
## Corpus consisting of 6 documents and 4 docvars.
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
##
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
##
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
##
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
##
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
##
## 1809-Madison :
## "Unwilling to depart from examples of the most revered author..."
# Last 3 documents
slice_tail(data_corpus_inaugural, n = 3)
## Corpus consisting of 3 documents and 4 docvars.
## 2017-Trump :
## "Chief Justice Roberts, President Carter, President Clinton, ..."
##
## 2021-Biden :
## "Chief Justice Roberts, Vice President Harris, Speaker Pelosi..."
##
## 2025-Trump :
## "Thank you. Thank you very much, everybody. Wow. Thank you..."Random sampling:
set.seed(42)
slice_sample(data_corpus_inaugural, n = 5)
## Corpus consisting of 5 documents and 4 docvars.
## 1981-Reagan :
## "Senator Hatfield, Mr. Chief Justice, Mr. President, Vice Pre..."
##
## 1933-Roosevelt :
## "I am certain that my fellow Americans expect that on my indu..."
##
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
##
## 1885-Cleveland :
## "Fellow citizens, in the presence of this vast assemblage of ..."
##
## 1825-Adams :
## "In compliance with an usage coeval with the existence of our..."Select by minimum or maximum values of a docvar:
# Add token counts first
corp <- data_corpus_inaugural %>%
mutate(n_tokens = ntoken(data_corpus_inaugural))
# Shortest speeches
slice_min(corp, n_tokens, n = 3)
## Corpus consisting of 3 documents and 5 docvars.
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
##
## 1945-Roosevelt :
## "Chief Justice, Mr. Vice President, my friends, you will unde..."
##
## 1865-Lincoln :
## "Fellow-Countrymen: At this second appearing to take the oath..."
# Longest speeches
slice_max(corp, n_tokens, n = 3)
## Corpus consisting of 3 documents and 5 docvars.
## 1841-Harrison :
## "Called from a retirement which I had supposed was to continu..."
##
## 1909-Taft :
## "My fellow citizens: Anyone who has taken the oath I have jus..."
##
## 1845-Polk :
## "Fellow citizens, without solicitation on my part, I have bee..."Use arrange() to reorder documents:
# Sort alphabetically by president
data_corpus_inaugural[1:5] %>%
arrange(President)
## Corpus consisting of 5 documents and 4 docvars.
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
##
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
##
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
##
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
##
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
# Sort by year descending
data_corpus_inaugural[1:5] %>%
arrange(desc(Year))
## Corpus consisting of 5 documents and 4 docvars.
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
##
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
##
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
##
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
##
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."Use distinct() to keep only unique combinations of
docvar values:
# Keep first document for each president
data_corpus_inaugural %>%
distinct(President, .keep_all = TRUE) %>%
summary(n = 10)
## Corpus consisting of 36 documents, showing 10 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1537 23 1789 Washington George
## 1797-Adams 826 2577 37 1797 Adams John
## 1801-Jefferson 717 1923 41 1801 Jefferson Thomas
## 1809-Madison 535 1261 21 1809 Madison James
## 1817-Monroe 1040 3677 121 1817 Monroe James
## 1829-Jackson 517 1208 25 1829 Jackson Andrew
## 1837-VanBuren 1315 4158 95 1837 Van Buren Martin
## 1841-Harrison 1896 9125 210 1841 Harrison William Henry
## 1845-Polk 1334 5186 153 1845 Polk James Knox
## 1849-Taylor 496 1178 22 1849 Taylor Zachary
## Party
## none
## Federalist
## Democratic-Republican
## Democratic-Republican
## Democratic-Republican
## Democratic
## Democratic
## Whig
## Whig
## WhigThese functions create, modify, rename, reorder, or select document variables.
Use select() to keep or drop docvars:
data_corpus_inaugural %>%
select(President, Year) %>%
summary(n = 5)
## Corpus consisting of 60 documents, showing 5 documents:
##
## Text Types Tokens Sentences President Year
## 1789-Washington 625 1537 23 Washington 1789
## 1793-Washington 96 147 4 Washington 1793
## 1797-Adams 826 2577 37 Adams 1797
## 1801-Jefferson 717 1923 41 Jefferson 1801
## 1805-Jefferson 804 2380 45 Jefferson 1805Use rename() for direct renaming:
data_corpus_inaugural %>%
rename(LastName = President, Given = FirstName) %>%
summary(n = 5)
## Corpus consisting of 60 documents, showing 5 documents:
##
## Text Types Tokens Sentences Year LastName Given
## 1789-Washington 625 1537 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2577 37 1797 Adams John
## 1801-Jefferson 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2380 45 1805 Jefferson Thomas
## Party
## none
## none
## Federalist
## Democratic-Republican
## Democratic-RepublicanUse rename_with() to rename using a function:
data_corpus_inaugural %>%
rename_with(toupper) %>%
summary(n = 5)
## Corpus consisting of 60 documents, showing 5 documents:
##
## Text Types Tokens Sentences YEAR PRESIDENT FIRSTNAME
## 1789-Washington 625 1537 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2577 37 1797 Adams John
## 1801-Jefferson 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2380 45 1805 Jefferson Thomas
## PARTY
## none
## none
## Federalist
## Democratic-Republican
## Democratic-RepublicanUse relocate() to change column order:
data_corpus_inaugural %>%
relocate(Party, President) %>%
summary(n = 5)
## Corpus consisting of 60 documents, showing 5 documents:
##
## Text Types Tokens Sentences Party President Year
## 1789-Washington 625 1537 23 none Washington 1789
## 1793-Washington 96 147 4 none Washington 1793
## 1797-Adams 826 2577 37 Federalist Adams 1797
## 1801-Jefferson 717 1923 41 Democratic-Republican Jefferson 1801
## 1805-Jefferson 804 2380 45 Democratic-Republican Jefferson 1805
## FirstName
## George
## George
## John
## Thomas
## ThomasUse mutate() to add new docvars or modify existing
ones:
data_corpus_inaugural %>%
mutate(
fullname = paste(FirstName, President, sep = " "),
century = floor(Year / 100) + 1
) %>%
summary(n = 5)
## Corpus consisting of 60 documents, showing 5 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1537 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2577 37 1797 Adams John
## 1801-Jefferson 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2380 45 1805 Jefferson Thomas
## Party fullname century
## none George Washington 18
## none George Washington 18
## Federalist John Adams 18
## Democratic-Republican Thomas Jefferson 19
## Democratic-Republican Thomas Jefferson 19Use transmute() to create new docvars and drop all
others:
data_corpus_inaugural %>%
transmute(
speech_id = paste(Year, President, sep = "-"),
party = Party
) %>%
summary(n = 5)
## Corpus consisting of 60 documents, showing 5 documents:
##
## Text Types Tokens Sentences speech_id party
## 1789-Washington 625 1537 23 1789-Washington none
## 1793-Washington 96 147 4 1793-Washington none
## 1797-Adams 826 2577 37 1797-Adams Federalist
## 1801-Jefferson 717 1923 41 1801-Jefferson Democratic-Republican
## 1805-Jefferson 804 2380 45 1805-Jefferson Democratic-RepublicanUse pull() to extract a single docvar as a vector:
Use glimpse() (from tibble) to see a
compact summary:
glimpse(data_corpus_inaugural)
## Rows: 60
## Columns: 6
## $ doc_id <chr> "1789-Washington", "1793-Washington", "1797-Adams", "1801-Je…
## $ text <chr> "Fellow-Cit…", "Fellow cit…", "When it wa…", "Friends an…", …
## $ Year <int> 1789, 1793, 1797, 1801, 1805, 1809, 1813, 1817, 1821, 1825, …
## $ President <chr> "Washington", "Washington", "Adams", "Jefferson", "Jefferson…
## $ FirstName <chr> "George", "George", "John", "Thomas", "Thomas", "James", "Ja…
## $ Party <fct> none, none, Federalist, Democratic-Republican, Democratic-Re…These functions compute summaries or add variables based on groups.
Use add_count() to add a count variable by group:
# Count speeches per president
data_corpus_inaugural %>%
add_count(President, name = "n_speeches") %>%
filter(n_speeches > 1) %>%
summary(n = 10)
## Corpus consisting of 44 documents, showing 10 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1537 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2577 37 1797 Adams John
## 1801-Jefferson 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2380 45 1805 Jefferson Thomas
## 1809-Madison 535 1261 21 1809 Madison James
## 1813-Madison 541 1302 33 1813 Madison James
## 1817-Monroe 1040 3677 121 1817 Monroe James
## 1821-Monroe 1259 4886 131 1821 Monroe James
## 1825-Adams 1003 3147 74 1825 Adams John Quincy
## Party n_speeches
## none 2
## none 2
## Federalist 2
## Democratic-Republican 2
## Democratic-Republican 2
## Democratic-Republican 2
## Democratic-Republican 2
## Democratic-Republican 2
## Democratic-Republican 2
## Democratic-Republican 2Use add_tally() to add the total count:
data_corpus_inaugural %>%
slice(1:5) %>%
add_tally() %>%
summary()
## Corpus consisting of 5 documents, showing 5 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1537 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2577 37 1797 Adams John
## 1801-Jefferson 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2380 45 1805 Jefferson Thomas
## Party n
## none 5
## none 5
## Federalist 5
## Democratic-Republican 5
## Democratic-Republican 5These functions combine a corpus with an external data frame.
Use left_join() to add columns from a data frame to your
corpus:
# Create some external data
party_colors <- data.frame(
Party = c("Democratic", "Republican", "none", "Federalist",
"Democratic-Republican", "Whig"),
color = c("blue", "red", "gray", "purple", "green", "orange")
)
# Join to corpus
data_corpus_inaugural %>%
left_join(party_colors, by = "Party") %>%
summary(n = 10)
## Corpus consisting of 60 documents, showing 10 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1537 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2577 37 1797 Adams John
## 1801-Jefferson 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2380 45 1805 Jefferson Thomas
## 1809-Madison 535 1261 21 1809 Madison James
## 1813-Madison 541 1302 33 1813 Madison James
## 1817-Monroe 1040 3677 121 1817 Monroe James
## 1821-Monroe 1259 4886 131 1821 Monroe James
## 1825-Adams 1003 3147 74 1825 Adams John Quincy
## Party color
## none gray
## none gray
## Federalist purple
## Democratic-Republican green
## Democratic-Republican green
## Democratic-Republican green
## Democratic-Republican green
## Democratic-Republican green
## Democratic-Republican green
## Democratic-Republican greenleft_join() provides special handling for joining on
document names. Use "docname" in the by
argument to match on document names even when "docname" is
not a docvar:
# Create data with document name as key
doc_metadata <- data.frame(
docname = c("1789-Washington", "1793-Washington", "1797-Adams"),
notes = c("First inaugural", "Second inaugural", "First Adams speech")
)
# Join using docname
data_corpus_inaugural[1:5] %>%
left_join(doc_metadata, by = "docname") %>%
summary()
## Corpus consisting of 5 documents, showing 5 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1537 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2577 37 1797 Adams John
## 1801-Jefferson 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2380 45 1805 Jefferson Thomas
## Party notes
## none First inaugural
## none Second inaugural
## Federalist First Adams speech
## Democratic-Republican <NA>
## Democratic-Republican <NA>You can also match document names to a differently-named column:
doc_metadata2 <- data.frame(
doc_id = c("1789-Washington", "1793-Washington"),
rating = c(5, 4)
)
data_corpus_inaugural[1:5] %>%
left_join(doc_metadata2, by = c("docname" = "doc_id")) %>%
summary()
## Corpus consisting of 5 documents, showing 5 documents:
##
## Text Types Tokens Sentences Year President FirstName
## 1789-Washington 625 1537 23 1789 Washington George
## 1793-Washington 96 147 4 1793 Washington George
## 1797-Adams 826 2577 37 1797 Adams John
## 1801-Jefferson 717 1923 41 1801 Jefferson Thomas
## 1805-Jefferson 804 2380 45 1805 Jefferson Thomas
## Party rating
## none 5
## none 4
## Federalist NA
## Democratic-Republican NA
## Democratic-Republican NAAll quanteda.tidy functions work seamlessly with the pipe operator, allowing you to chain multiple operations:
data_corpus_inaugural %>%
# Add metadata
mutate(
decade = floor(Year / 10) * 10,
n_tokens = ntoken(data_corpus_inaugural)
) %>%
# Filter to 20th century
filter(Year >= 1900, Year < 2000) %>%
# Keep only relevant columns
select(President, Party, decade, n_tokens) %>%
# Sort by speech length
arrange(desc(n_tokens)) %>%
summary(n = 10)
## Corpus consisting of 25 documents, showing 10 documents:
##
## Text Types Tokens Sentences President Party decade n_tokens
## 1909-Taft 1437 5821 158 Taft Republican 1900 5821
## 1925-Coolidge 1220 4440 196 Coolidge Republican 1920 4440
## 1929-Hoover 1090 3860 158 Hoover Republican 1920 3860
## 1921-Harding 1169 3719 148 Harding Republican 1920 3719
## 1985-Reagan 925 2909 123 Reagan Republican 1980 2909
## 1981-Reagan 902 2781 129 Reagan Republican 1980 2781
## 1953-Eisenhower 900 2743 119 Eisenhower Republican 1950 2743
## 1989-Bush 795 2674 141 Bush Republican 1980 2674
## 1949-Truman 781 2504 116 Truman Democratic 1940 2504
## 1901-McKinley 854 2437 100 McKinley Republican 1900 2437