Introduction to surveytable

The surveytable package provides short and understandable commands that generate tabulated, formatted, and rounded survey estimates.

Preliminaries

Concepts

There are two important concepts that we need to learn and distinguish:

  1. A data frame is a standard way of storing data in R. A data frame is rectangular data. Variables are in columns, observations are in rows. Example:
head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

A data frame, in an of itself, cannot represent a complex survey. This is because, just by looking at a data frame, R does not know what the sampling weights are, what the strata are, etc. Even if the variables that represent the sampling weights, etc, are part of the data frame, just by looking at the data frame, R does not know which variable represents the weights or other survey design variables.

You can get a data frame into R in many different ways. If your data is currently in a comma-separated values (CSV) file, you can use read.csv(). If it’s in a SAS file, you can use a package like haven or importsurvey. If it’s already in R format, use readRDS(), and so on.

  1. A survey object is an object that describes a survey. It tells R what the sampling weights are, what the strata are, and so on. A data frame can be converted into a survey object using the survey::svydesign() function; if a survey uses replicate weights, the survey::svrepdesign() function should be used.

Generally speaking, you only need to convert a data frame to a survey object once. After it has been converted, you can save it with saveRDS() (or similar). In the future, you can load it with readRDS(). You do not need to re-convert a data frame to a survey object every time.

NAMCS

Examples in this tutorial use a survey called the National Ambulatory Medical Care Survey (NAMCS) 2019 Public Use File (PUF). NAMCS is “an annual nationally representative sample survey of visits to non-federal office-based patient care physicians, excluding anesthesiologists, radiologists, and pathologists.” Note that the unit of observation is visits, not patients – this distinction is important since a single patient can make multiple visits.

The surveytable package comes with a data frame of selected variables from NAMCS, called namcs2019sv_df (sv = selected variables; df = data frame). The survey object of this survey is called namcs2019sv.

namcs2019sv is the object that we analyze. You really only need namcs2019sv. The reason that the package has namcs2019sv_df is to illustrate how to convert the data frame to the survey object.

More concepts

When importing data from another source, such as SAS or CSV, analysts should be aware of the standard way in which variables are handled in R.

Variables in namcs2019sv_df are already stored correctly. Thus,

library("surveytable")
class(namcs2019sv_df$AGER)
#> [1] "factor"
class(namcs2019sv_df$PAYNOCHG)
#> [1] "logical"
class(namcs2019sv_df$AGE)
#> [1] "numeric"

Create a survey object

As seen below, tables produced by surveytable are clearer if either the variable names themselves are descriptive, or if the variables have the "label" attribute that is descriptive. In namcs2019sv_df, all variables already have the "label" attribute set. For example, while the variable name AGE itself is not very descriptive, the variable does have a more descriptive "label" attribute:

attr(namcs2019sv_df$AGE, "label")
#> [1] "Patient age in years"

Documentation for the NAMCS survey provides the names of the survey design variables. Specifically, in NAMCS,

Thus, the namcs2019sv_df data frame can be turned into a survey object as follows:

mysurvey = survey::svydesign(ids = ~ CPSUM
  , strata = ~ CSTRATM
  , weights = ~ PATWT
  , data = namcs2019sv_df)

Tables produced by surveytable are clearer if either the name of the survey object is descriptive, or if the object has the "label" attribute that is descriptive. Let’s set this attribute for mysurvey:

attr(mysurvey, "label") = "NAMCS 2019 PUF"

The mysurvey object should now be identical to namcs2019sv. Let’s verify this:

identical(namcs2019sv, mysurvey)
#> [1] TRUE

We have just successfully created a survey object from a data frame.

Begin analysis

First, specify the survey object that you’d like to analyze.

set_survey(namcs2019sv)
#> * To adjust how counts are rounded, see ?set_count_int
#>                        _                                                                    
#> Survey name            NAMCS 2019 PUF                                                       
#> Number of variables    33                                                                   
#> Number of observations 8250                                                                 
#> Info1                  Stratified 1 - level Cluster Sampling design (with replacement)      
#> Info2                  With (398) clusters.                                                 
#> Info3                  survey::svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT, 
#> Info4                      data = namcs2019sv_df)

Check the survey label, survey design variables, and the number of observations to verify that it all looks correct.

List variables

The var_list() function lists the variables in the survey. To avoid unintentionally listing all the variables in a survey, which can be many, the starting characters of variable names are specified. For example, to list the variables that start with the letters age, type:

var_list("age")
Variables beginning with ‘age’ {NAMCS 2019 PUF}
Variable Class Long name
AGE numeric Patient age in years
AGER factor Patient age recode

The table lists

Common classes are factor (categorical variable), logical (yes / no variable), and numeric.

Tabulate categorical and logical variables

The main function of the surveytable package is tab(), which tabulates variables. It operates on categorical and logical variables, and presents both estimated counts, with their standard errors (SEs) and 95% confidence intervals (CIs), and percentages, with their SEs and CIs. For example, to tabulate AGER, type:

tab("AGER")
Patient age recode {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Under 15 years 117,917 14,097 93,229 149,142 11.4 1.3 8.9 14.2
15-24 years 64,856 7,018 52,387 80,292 6.3 0.6 5.1 7.5
25-44 years 170,271 13,966 144,925 200,049 16.4 1.1 14.3 18.8
45-64 years 309,506 23,290 266,994 358,787 29.9 1.4 27.2 32.6
65-74 years 206,866 14,366 180,481 237,109 20   1.2 17.6 22.5
75 years and over 167,069 15,179 139,746 199,735 16.1 1.3 13.7 18.8
(Checked presentation standards. Nothing to report.)

The table title shows the variable label (the long variable name) and the survey label.

For each level of the variable, the table shows:

NCHS presentation standards. The tab() function also applies the National Center for Health Statistics (NCHS) presentation standards for counts and percentages, and flags estimates if, according to the standards, they should be suppressed, footnoted, or reviewed by an analyst. The CIs that are displayed are the ones that are used by the NCHS presentation standards. Specifically, for counts, the tables show the log Student’s t 95% CI, with adaptations for complex surveys; for percentages, they show the 95% Korn and Graubard CI.

One does not need to do anything extra to perform presentation standards checking – it is performed automatically. For example, let’s tabulate PAYNOCHG:

tab("PAYNOCHG")
Expected source of payment for visit: No Charge/Charity {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL Flags
FALSE 1,034,338 48,874 942,808 1,134,754 99.8 0.2 99 100
TRUE 2,146 1,919 293 15,703 0.2 0.2 0 1 Cx
Cx: suppress count (and rate)

This table tells us that, according to the NCHS presentation standards, the estimated number of visits in which there was no charge for the visit should be suppressed due to low precision. However, the lack of a percentage flag indicates that the estimated percentage of such visits can be shown.

Drop missing values. Some variables might contain missing values (NA). Consider the following variable, which is not part of the actual survey, but was constructed specifically for this example:

tab("SPECCAT.bad")
Type of specialty (BAD - do not use) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Primary care specialty 422,807 26,382 374,099 477,857 40.8 2.2 36.5 45.2
Surgical care specialty 170,714 23,333 130,514 223,297 16.5 2.3 12.2 21.5
Medical care specialty 235,502 35,527 175,049 316,831 22.7 2.9 17.2 29.1
<N/A> 207,462 12,458 184,378 233,436 20   0.8 18.5 21.6
(Checked presentation standards. Nothing to report.)

To calculate percentages based on the non-missing values only, use the drop_na argument:

tab("SPECCAT.bad", drop_na = TRUE)
Type of specialty (BAD - do not use) (knowns only) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Primary care specialty 422,807 26,382 374,099 477,857 51   2.6 45.7 56.3
Surgical care specialty 170,714 23,333 130,514 223,297 20.6 2.9 15.2 26.9
Medical care specialty 235,502 35,527 175,049 316,831 28.4 3.6 21.5 36.2
(Checked presentation standards. Nothing to report.)

The above table gives percentages based only on the knowns, that is, based only on non-NA values.

Multiple tables. Multiple tables can be created with a single command:

tab("MDDO", "SPECCAT", "MSA")
Type of doctor (MD or DO) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
M.D. - Doctor of Medicine 980,280 48,388 889,842 1,079,910 94.6 0.7 93.1 95.8
D.O. - Doctor of Osteopathy 56,204 6,602 44,597 70,832 5.4 0.7 4.2 6.9
(Checked presentation standards. Nothing to report.)
Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Primary care specialty 521,466 31,136 463,840 586,252 50.3 2.6 45.1 55.5
Surgical care specialty 214,832 31,110 161,661 285,490 20.7 3   15.1 27.3
Medical care specialty 300,186 43,497 225,806 399,067 29   3.6 22.1 36.6
(Checked presentation standards. Nothing to report.)
Metropolitan Statistical Area Status of physician location {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
MSA (Metropolitan Statistical Area) 973,676 50,515 879,490 1,077,947 93.9 1.7 89.6 96.8
Non-MSA 62,809 17,549 36,249 108,830 6.1 1.7 3.2 10.4
(Checked presentation standards. Nothing to report.)

Entire population

Estimate the total count for the entire population using the total() command:

total()
Total {NAMCS 2019 PUF}
Number (000) SE (000) LL (000) UL (000)
1,036,484 48,836 945,014 1,136,809
(Checked presentation standards. Nothing to report.)

Subsets or interactions

To create a table of AGER for each value of the variable SEX, type:

tab_subset("AGER", "SEX")
Patient age recode (Patient sex = Female) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Under 15 years 59,958 7,206 47,318 75,974 9.9 1.2 7.6 12.6
15-24 years 41,128 4,532 33,066 51,156 6.8 0.7 5.4 8.4
25-44 years 113,708 11,461 93,256 138,646 18.8 1.6 15.8 22.1
45-64 years 175,978 16,009 147,153 210,450 29.1 1.7 25.7 32.6
65-74 years 120,099 11,066 100,171 143,992 19.8 1.5 17   22.9
75 years and over 94,173 11,085 74,682 118,751 15.6 1.5 12.8 18.7
(Checked presentation standards. Nothing to report.)
Patient age recode (Patient sex = Male) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Under 15 years 57,959 7,728 44,570 75,371 13.4 1.7 10.3 17.1
15-24 years 23,728 4,344 16,457 34,210 5.5 0.8 4   7.4
25-44 years 56,562 7,277 43,861 72,942 13.1 1.3 10.7 15.8
45-64 years 133,528 12,956 110,319 161,619 30.9 1.6 27.8 34.3
65-74 years 86,766 6,767 74,409 101,176 20.1 1.5 17.3 23.1
75 years and over 72,896 6,661 60,872 87,296 16.9 1.5 14   20.2
(Checked presentation standards. Nothing to report.)

In addition to giving the long name of the variable being tabulated, the title of each table reflects the value of the subsetting variable (in this case, either Female or Male).

With the tab_subset() command, in each table (that is, in each subset), the percentages add up to 100%.

The tab_cross() function is similar – it crosses or interacts two variables and generates a table using this new variable. Thus, to create a table of the interaction of AGER and SEX, type:

tab_cross("AGER", "SEX")
(Patient age recode) x (Patient sex) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Under 15 years : Female 59,958 7,206 47,318 75,974 5.8 0.7 4.5 7.3
15-24 years : Female 41,128 4,532 33,066 51,156 4   0.4 3.2 4.9
25-44 years : Female 113,708 11,461 93,256 138,646 11   1   9   13.2
45-64 years : Female 175,978 16,009 147,153 210,450 17   1.1 14.8 19.3
65-74 years : Female 120,099 11,066 100,171 143,992 11.6 1   9.7 13.7
75 years and over : Female 94,173 11,085 74,682 118,751 9.1 0.9 7.3 11.1
Under 15 years : Male 57,959 7,728 44,570 75,371 5.6 0.7 4.3 7.2
15-24 years : Male 23,728 4,344 16,457 34,210 2.3 0.4 1.6 3.2
25-44 years : Male 56,562 7,277 43,861 72,942 5.5 0.6 4.3 6.8
45-64 years : Male 133,528 12,956 110,319 161,619 12.9 1   10.9 15.1
65-74 years : Male 86,766 6,767 74,409 101,176 8.4 0.6 7.2 9.7
75 years and over : Male 72,896 6,661 60,872 87,296 7   0.6 5.9 8.3
(Checked presentation standards. Nothing to report.)

While the estimated counts produced by tab_subset() and tab_cross() are the same, the percentages are different.

Tabulate numeric variables

The tab() and tab_subset() functions also work with numeric variables, though with such variables, the output is different. To tabulate NUMMED (number of medications), a numeric variable, type:

tab("NUMMED")
Number of medications coded {NAMCS 2019 PUF}
% known Mean SEM SD
100 3.46 0.268 4.43

As before, the table title shows the variable label (the long variable name) and the survey label.

The table shows the percentage of values that are not missing (not NA), the mean, the standard error of the mean (SEM), and the standard deviation (SD).

Subsetting works too:

tab_subset("NUMMED", "AGER")
Number of medications coded (for different levels of Patient age recode) {NAMCS 2019 PUF}
Level % known Mean SEM SD
Under 15 years 100 1.58 0.168 1.75
15-24 years 100 1.64 0.112 1.7 
25-44 years 100 2.15 0.225 2.74
45-64 years 100 3.49 0.303 4.49
65-74 years 100 4.44 0.431 5.03
75 years and over 100 5.53 0.494 5.59

Perform statistical hypothesis testing

The tab_subset() function makes it easy to perform hypothesis testing by using the test argument. When the argument is TRUE, a test of association is performed. In addition, t-tests for all pairs of levels are performed as well.

Categorical variables

Consider the relationship between AGER an SPECCAT:

tab_subset("AGER", "SPECCAT", test = TRUE)
Patient age recode (Type of specialty (Primary, Medical, Surgical) = Primary care specialty) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Under 15 years 102,720 14,137 78,373 134,631 19.7 2.5 15   25.2
15-24 years 40,808 4,941 32,127 51,835 7.8 0.8 6.2 9.7
25-44 years 95,305 9,118 78,964 115,027 18.3 1.5 15.3 21.5
45-64 years 124,384 10,371 105,582 146,535 23.9 1.5 20.9 27  
65-74 years 85,504 10,210 67,581 108,182 16.4 1.6 13.4 19.8
75 years and over 72,745 9,886 55,660 95,073 14   1.6 10.9 17.4
(Checked presentation standards. Nothing to report.)
Patient age recode (Type of specialty (Primary, Medical, Surgical) = Surgical care specialty) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Under 15 years 6,201 1,359 4,017 9,571 2.9 0.7 1.6 4.7
15-24 years 8,561 1,622 5,824 12,585 4   0.6 2.9 5.3
25-44 years 35,953 11,539 18,976 68,119 16.7 3.6 10.2 25.2
45-64 years 73,204 12,475 52,307 102,450 34.1 1.6 31   37.3
65-74 years 53,482 6,405 42,227 67,736 24.9 2.8 19.6 30.9
75 years and over 37,431 6,364 26,763 52,352 17.4 2.4 12.8 22.8
(Checked presentation standards. Nothing to report.)
Patient age recode (Type of specialty (Primary, Medical, Surgical) = Medical care specialty) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Under 15 years 8,996 3,158 4,330 18,690 3   1   1.4 5.7
15-24 years 15,487 5,035 7,908 30,326 5.2 1.4 2.8 8.6
25-44 years 39,012 6,892 27,419 55,507 13   1.3 10.5 15.8
45-64 years 111,918 22,754 74,907 167,215 37.3 3.1 31   43.8
65-74 years 67,880 10,945 49,319 93,425 22.6 2.9 17.1 29  
75 years and over 56,894 10,806 39,008 82,980 19   3.3 12.7 26.6
(Checked presentation standards. Nothing to report.)
Association between Patient age recode and Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
p-value Flag
0 *
Pearson’s X^2: Rao & Scott adjustment. *: p-value <= 0.05
Comparison of all possible pairs of Patient age recode (Type of specialty (Primary, Medical, Surgical) = Primary care specialty) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Under 15 years 15-24 years 0     *
Under 15 years 25-44 years 0.67 
Under 15 years 45-64 years 0.259
Under 15 years 65-74 years 0.334
Under 15 years 75 years and over 0.083
15-24 years 25-44 years 0     *
15-24 years 45-64 years 0     *
15-24 years 65-74 years 0     *
15-24 years 75 years and over 0.002 *
25-44 years 45-64 years 0.007 *
25-44 years 65-74 years 0.461
25-44 years 75 years and over 0.092
45-64 years 65-74 years 0.001 *
45-64 years 75 years and over 0     *
65-74 years 75 years and over 0.194
Design-based t-test. *: p-value <= 0.05
Comparison of all possible pairs of Patient age recode (Type of specialty (Primary, Medical, Surgical) = Surgical care specialty) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Under 15 years 15-24 years 0.221
Under 15 years 25-44 years 0     *
Under 15 years 45-64 years 0     *
Under 15 years 65-74 years 0     *
Under 15 years 75 years and over 0     *
15-24 years 25-44 years 0     *
15-24 years 45-64 years 0     *
15-24 years 65-74 years 0     *
15-24 years 75 years and over 0     *
25-44 years 45-64 years 0     *
25-44 years 65-74 years 0.196
25-44 years 75 years and over 0.904
45-64 years 65-74 years 0.027 *
45-64 years 75 years and over 0     *
65-74 years 75 years and over 0.007 *
Design-based t-test. *: p-value <= 0.05
Comparison of all possible pairs of Patient age recode (Type of specialty (Primary, Medical, Surgical) = Medical care specialty) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Under 15 years 15-24 years 0.112
Under 15 years 25-44 years 0     *
Under 15 years 45-64 years 0     *
Under 15 years 65-74 years 0     *
Under 15 years 75 years and over 0     *
15-24 years 25-44 years 0     *
15-24 years 45-64 years 0     *
15-24 years 65-74 years 0     *
15-24 years 75 years and over 0     *
25-44 years 45-64 years 0     *
25-44 years 65-74 years 0     *
25-44 years 75 years and over 0.139
45-64 years 65-74 years 0.009 *
45-64 years 75 years and over 0.003 *
65-74 years 75 years and over 0.4  
Design-based t-test. *: p-value <= 0.05
Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (Patient age recode = Under 15 years) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Primary care specialty Surgical care specialty 0     *
Primary care specialty Medical care specialty 0     *
Surgical care specialty Medical care specialty 0.364
Design-based t-test. *: p-value <= 0.05
Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (Patient age recode = 15-24 years) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Primary care specialty Surgical care specialty 0     *
Primary care specialty Medical care specialty 0.002 *
Surgical care specialty Medical care specialty 0.124
Design-based t-test. *: p-value <= 0.05
Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (Patient age recode = 25-44 years) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Primary care specialty Surgical care specialty 0.001 *
Primary care specialty Medical care specialty 0     *
Surgical care specialty Medical care specialty 0.848
Design-based t-test. *: p-value <= 0.05
Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (Patient age recode = 45-64 years) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Primary care specialty Surgical care specialty 0.005 *
Primary care specialty Medical care specialty 0.631
Surgical care specialty Medical care specialty 0.163
Design-based t-test. *: p-value <= 0.05
Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (Patient age recode = 65-74 years) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Primary care specialty Surgical care specialty 0.006 *
Primary care specialty Medical care specialty 0.248
Surgical care specialty Medical care specialty 0.298
Design-based t-test. *: p-value <= 0.05
Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (Patient age recode = 75 years and over) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Primary care specialty Surgical care specialty 0.003 *
Primary care specialty Medical care specialty 0.291
Surgical care specialty Medical care specialty 0.099
Design-based t-test. *: p-value <= 0.05

According to these tables, there is an association between physician specialty type and patient age. For instance, for patients under 15 years, there is a statistical difference between primary care physician specialty and medical care specialty. But for older patients, such as in the 45-64 age group, there is no statistical difference between the two specialty types.

As another example, consider the relationship between MRI and SPECCAT:

tab_subset("MRI", "SPECCAT", test = TRUE)
MRI (Type of specialty (Primary, Medical, Surgical) = Primary care specialty) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL Flags
FALSE 515,172 30,724 458,304 579,096 98.8 0.5 97.2 99.6
TRUE 6,295 2,768 2,295 17,268 1.2 0.5 0.4 2.8 Cx
Cx: suppress count (and rate)
MRI (Type of specialty (Primary, Medical, Surgical) = Surgical care specialty) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
FALSE 207,915 30,117 156,442 276,323 96.8 0.7 95 98
TRUE 6,917 1,845 3,925 12,191 3.2 0.7 2 5
(Checked presentation standards. Nothing to report.)
MRI (Type of specialty (Primary, Medical, Surgical) = Medical care specialty) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL Flags
FALSE 291,560 40,805 221,456 383,855 97.1 1.4 92.9 99.2 Pc
TRUE 8,626 4,768 2,451 30,364 2.9 1.4 0.8 7.1 Cx Px
Cx: suppress count (and rate); Px: suppress percent; Pc: footnote percent - complement
Association between MRI and Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
p-value Flag
0.169
Pearson’s X^2: Rao & Scott adjustment. *: p-value <= 0.05
Comparison of all possible pairs of MRI (Type of specialty (Primary, Medical, Surgical) = Primary care specialty) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
FALSE TRUE 0 *
Design-based t-test. *: p-value <= 0.05
Comparison of all possible pairs of MRI (Type of specialty (Primary, Medical, Surgical) = Surgical care specialty) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
FALSE TRUE 0 *
Design-based t-test. *: p-value <= 0.05
Comparison of all possible pairs of MRI (Type of specialty (Primary, Medical, Surgical) = Medical care specialty) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
FALSE TRUE 0 *
Design-based t-test. *: p-value <= 0.05
Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (MRI = FALSE) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Primary care specialty Surgical care specialty 0     *
Primary care specialty Medical care specialty 0     *
Surgical care specialty Medical care specialty 0.156
Design-based t-test. *: p-value <= 0.05
Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (MRI = TRUE) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Primary care specialty Surgical care specialty 0.856
Primary care specialty Medical care specialty 0.657
Surgical care specialty Medical care specialty 0.733
Design-based t-test. *: p-value <= 0.05

According to these tables, there is no statistical association between MRI and physician specialty. For each of the 3 specialty types, a minority of visits have MRI’s. For the visits with MRI’s, there was no statistical difference between specialty types.

As a general rule of thumb, since there is no statistical association between MRI and physician specialty, presenting this tabulation would not be particularly interesting, especially since the subsetting decreases the sample size and therefore also decreases the estimate reliability. Instead, it would generally make more sense to just tabulate MRI without subsetting by SPECCAT.

Numeric variables

The relationship between NUMMED and AGER:

tab_subset("NUMMED", "AGER", test = TRUE)
Number of medications coded (for different levels of Patient age recode) {NAMCS 2019 PUF}
Level % known Mean SEM SD
Under 15 years 100 1.58 0.168 1.75
15-24 years 100 1.64 0.112 1.7 
25-44 years 100 2.15 0.225 2.74
45-64 years 100 3.49 0.303 4.49
65-74 years 100 4.44 0.431 5.03
75 years and over 100 5.53 0.494 5.59
Association between Number of medications coded and Patient age recode {NAMCS 2019 PUF}
p-value Flag
0 *
Wald test. *: p-value <= 0.05
Comparison of Number of medications coded across all possible pairs of Patient age recode {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Under 15 years 15-24 years 0.739
Under 15 years 25-44 years 0.043 *
Under 15 years 45-64 years 0     *
Under 15 years 65-74 years 0     *
Under 15 years 75 years and over 0     *
15-24 years 25-44 years 0.029 *
15-24 years 45-64 years 0     *
15-24 years 65-74 years 0     *
15-24 years 75 years and over 0     *
25-44 years 45-64 years 0     *
25-44 years 65-74 years 0     *
25-44 years 75 years and over 0     *
45-64 years 65-74 years 0.007 *
45-64 years 75 years and over 0     *
65-74 years 75 years and over 0.002 *
Design-based t-test. *: p-value <= 0.05

According to these tables, there is an association between the number of medications and age category. NUMMED is statistically similar for the “Under 15 years” and “15-24 years” AGER categories. It is statistically different for all other pairs of age categories.

Finally, let’s look at the relationship between NUMMED and SPECCAT:

tab_subset("NUMMED", "SPECCAT", test = TRUE)
Number of medications coded (for different levels of Type of specialty (Primary, Medical, Surgical)) {NAMCS 2019 PUF}
Level % known Mean SEM SD
Primary care specialty 100 3.7  0.309 4.46
Surgical care specialty 100 2.87 0.616 4.59
Medical care specialty 100 3.46 0.637 4.22
Association between Number of medications coded and Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
p-value Flag
0.52
Wald test. *: p-value <= 0.05
Comparison of Number of medications coded across all possible pairs of Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Primary care specialty Surgical care specialty 0.254
Primary care specialty Medical care specialty 0.738
Surgical care specialty Medical care specialty 0.501
Design-based t-test. *: p-value <= 0.05

According to these tables, there is no association between the number of medications and physician specialty type. NUMMED is statistically similar for all pairs of physician specialties.

As a general rule of thumb, since there is no statistical association between the number of medications and physician specialty, presenting this tabulation would not be particularly interesting, especially since the subsetting decreases the sample size and therefore also decreases the estimate reliability. Instead, it would generally make more sense to just tabulate NUMMED without subsetting by SPECCAT.

Categorical variables (single variable)

To test whether any pair of SPECCAT levels is statistically similar or different, type:

tab("SPECCAT", test = TRUE)
Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Primary care specialty 521,466 31,136 463,840 586,252 50.3 2.6 45.1 55.5
Surgical care specialty 214,832 31,110 161,661 285,490 20.7 3   15.1 27.3
Medical care specialty 300,186 43,497 225,806 399,067 29   3.6 22.1 36.6
(Checked presentation standards. Nothing to report.)
Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
Level 1 Level 2 p-value Flag
Primary care specialty Surgical care specialty 0     *
Primary care specialty Medical care specialty 0     *
Surgical care specialty Medical care specialty 0.168
Design-based t-test. *: p-value <= 0.05

According to this, surgical and medical care specialties are statistically similar, and are statistically different from primary care.

Calculate rates

A rate is a ratio of count estimates based on the survey in question divided by population size, which is assumed to be known. For example, the number of physician visits per 100 people in the population is a rate: the number of physician visits is estimated from the namcs2019sv survey, while the number of people in the population comes from another source.

To calculate rates, in addition to the survey, we need a source of information on population size. You would typically use a function such as read.csv() to load the population figures and get them into the correct format. The surveytable package comes with an object called uspop2019 that contains several population figures for use in these examples.

Let’s examine uspop2019:

class(uspop2019)
#> [1] "list"
names(uspop2019)
#> [1] "total"       "MSA"         "AGER"        "Age group"   "SEX"        
#> [6] "AGER x SEX"  "Age group 5"

The overall population size for the country as a whole is:

uspop2019$total
#> [1] 323186697

Once we have the overall population size, the overall rate is:

total_rate(uspop2019$total)
Total (rate per 100 population) {NAMCS 2019 PUF}
Rate SE LL UL
320.7 15.1 292.4 351.7
(Checked presentation standards. Nothing to report.)

To calculate the rates for a particular variable, we need to provide a data frame with a column called Level that matches the levels of the variable in the survey, and a column called Population that gives the size of the population for that level.

For example, for AGER, this data frame as follows:

uspop2019$AGER
#>               Level Population
#> 1    Under 15 years   60526656
#> 2       15-24 years   41718700
#> 3       25-44 years   85599410
#> 4       45-64 years   82562049
#> 5       65-74 years   31260202
#> 6 75 years and over   21519680

Now that we have the appropriate population figures, the rates table is obtained by typing:

tab_rate("AGER", uspop2019$AGER)
Patient age recode (rate per 100 population) {NAMCS 2019 PUF}
Level Rate SE LL UL
Under 15 years 194.8 23.3 154   246.4
15-24 years 155.5 16.8 125.6 192.5
25-44 years 198.9 16.3 169.3 233.7
45-64 years 374.9 28.2 323.4 434.6
65-74 years 661.8 46   577.3 758.5
75 years and over 776.4 70.5 649.4 928.1
(Checked presentation standards. Nothing to report.)

To calculate the rates for one variable (AGER) by another variable (SEX), we need population figures in the following format:

uspop2019$`AGER x SEX`
#>                Level Subset Population
#> 1     Under 15 years Female   29604762
#> 2        15-24 years Female   20730118
#> 3        25-44 years Female   43192143
#> 4        45-64 years Female   42508901
#> 5        65-74 years Female   16673240
#> 6  75 years and over Female   12421444
#> 7     Under 15 years   Male   30921894
#> 8        15-24 years   Male   20988582
#> 9        25-44 years   Male   42407267
#> 10       45-64 years   Male   40053148
#> 11       65-74 years   Male   14586962
#> 12 75 years and over   Male    9098236

With this data frame, the rates table is obtained by typing:

tab_subset_rate("AGER", "SEX", uspop2019$`AGER x SEX`)
Patient age recode (Patient sex = Female) (rate per 100 population) {NAMCS 2019 PUF}
Level Rate SE LL UL
Under 15 years 202.5 24.3 159.8 256.6
15-24 years 198.4 21.9 159.5 246.8
25-44 years 263.3 26.5 215.9 321  
45-64 years 414   37.7 346.2 495.1
65-74 years 720.3 66.4 600.8 863.6
75 years and over 758.1 89.2 601.2 956  
(Checked presentation standards. Nothing to report.)
Patient age recode (Patient sex = Male) (rate per 100 population) {NAMCS 2019 PUF}
Level Rate SE LL UL
Under 15 years 187.4 25   144.1 243.7
15-24 years 113.1 20.7 78.4 163  
25-44 years 133.4 17.2 103.4 172  
45-64 years 333.4 32.3 275.4 403.5
65-74 years 594.8 46.4 510.1 693.6
75 years and over 801.2 73.2 669.1 959.5
(Checked presentation standards. Nothing to report.)

Create or modify variables

In some situations, it might be necessary to modify survey variables, or to create new ones. This section describes how to do this.

Convert factor to logical. The variable MAJOR (major reason for this visit) has several levels.

tab("MAJOR")
Major reason for this visit {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Blank 15,887 3,354 10,335 24,419 1.5 0.3 1   2.3
New problem (less than 3 mos. onset) 275,014 19,691 238,955 316,514 26.5 1.5 23.7 29.5
Chronic problem, routine 380,910 35,080 317,916 456,386 36.8 2.5 31.8 41.9
Chronic problem, flare-up 74,017 9,329 57,706 94,939 7.1 0.9 5.5 9.1
Pre-surgery 12,864 2,151 9,188 18,010 1.2 0.2 0.9 1.7
Post-surgery 54,170 6,749 42,350 69,289 5.2 0.7 4   6.7
Preventive care 223,624 18,520 190,068 263,103 21.6 1.7 18.3 25.1
(Checked presentation standards. Nothing to report.)

Notice that one of the levels is called "Preventive care". Suppose an analyst is only interested in whether or not a visit is a preventive care visit – they are not interested in the other visit types. They can create a new variable called Preventive care visits that is TRUE for preventive care visits and FALSE for all other types of visits, as follows:

var_case("Preventive care visits", "MAJOR", "Preventive care")
tab("Preventive care visits")
Preventive care visits {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
FALSE 812,861 45,220 728,841 906,566 78.4 1.7 74.9 81.7
TRUE 223,624 18,520 190,068 263,103 21.6 1.7 18.3 25.1
(Checked presentation standards. Nothing to report.)

This creates a logical variable that is TRUE for preventive care visits and then tabulates it. When using the var_case() function, specify the name of the new logical variable to be created, an existing factor variable, and one or more levels of the factor variable that should be set to TRUE in the logical variable.

Thus, if an analyst is interested in surgery-related visits, which are indicated by two different levels of MAJOR, they could type:

var_case("Surgery-related visits"
  , "MAJOR"
  , c("Pre-surgery", "Post-surgery"))
tab("Surgery-related visits")
Surgery-related visits {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
FALSE 969,451 47,976 879,793 1,068,246 93.5 0.8 91.9 94.9
TRUE 67,034 7,810 53,273 84,348 6.5 0.8 5.1 8.1
(Checked presentation standards. Nothing to report.)

Collapse levels. The variable PRIMCARE (whether the physician is this patient’s primary care provider) has levels Unknown and Blank, among others.

tab("PRIMCARE")
Are you the patient’s primary care provider? {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL Flags
Blank 1,150 478 440 3,005 0.1 0   0   0.2 Cx
Unknown 39,519 9,507 24,520 63,692 3.8 0.9 2.3 6  
Yes 383,481 28,555 331,362 443,798 37   2.6 31.9 42.3
No 612,335 43,282 533,050 703,413 59.1 2.5 53.9 64.1
Cx: suppress count (and rate)

To collapse Unknown and Blank into a single level, type:

var_collapse("PRIMCARE", "Unknown if PCP", c("Unknown", "Blank"))
tab("PRIMCARE")
Are you the patient’s primary care provider? {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Unknown if PCP 40,669 9,479 25,619 64,560 3.9 0.9 2.4 6.1
Yes 383,481 28,555 331,362 443,798 37   2.6 31.9 42.3
No 612,335 43,282 533,050 703,413 59.1 2.5 53.9 64.1
(Checked presentation standards. Nothing to report.)

Convert numeric to factor. The variable AGE is numeric.

tab("AGE")
Patient age in years {NAMCS 2019 PUF}
% known Mean SEM SD
100 51 1.04 24.3

To create a new variable of age categories based on AGE, type:

var_cut("Age group", "AGE"
        , c(-Inf, 0, 4, 14, 64, Inf)
        , c("Under 1", "1-4", "5-14", "15-64", "65 and over") )
tab("Age group")
Age group {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Under 1 31,148 5,282 22,269 43,566 3   0.5 2.1 4.1
1-4 38,240 5,444 28,864 50,662 3.7 0.5 2.7 4.8
5-14 48,529 5,741 38,430 61,282 4.7 0.5 3.7 5.9
15-64 544,632 36,082 478,254 620,223 52.5 2   48.6 56.5
65 and over 373,935 24,523 328,777 425,296 36.1 1.9 32.3 40  
(Checked presentation standards. Nothing to report.)

In the var_cut() command, specify the following information:

Check whether any variable is true. For a series of logical variables, you can check whether any of them are TRUE using the var_any() command.

A physician visit is considered to be an “imaging services” visit if it had any of a number of imaging services ordered or provided. Imaging services are indicated using logical variables, such as MRI and XRAY. To create the Imaging services variable, type:

var_any("Imaging services"
  , c("ANYIMAGE", "BONEDENS", "CATSCAN", "ECHOCARD", "OTHULTRA"
  , "MAMMO", "MRI", "XRAY", "OTHIMAGE"))
tab("Imaging services")
Imaging services {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
FALSE 901,115 43,298 820,085 990,151 86.9 1.1 84.6 89.1
TRUE 135,369 13,574 111,134 164,890 13.1 1.1 10.9 15.4
(Checked presentation standards. Nothing to report.)

Interact variables. The tab_cross() function creates a table of an interaction of two variables, but it does not save the interacted variable. To create the interacted variable, use the var_cross() command:

var_cross("Age x Sex", "AGER", "SEX")

Specify the name of the new variable as well as names of the two variables to interact.

Copy a variable. Create a new variable that is a copy of another variable using var_copy(). You can modify the copy, while the original remains unchanged. For example:

var_copy("Age group", "AGER")
#> Warning in var_copy("Age group", "AGER"): Age group: overwriting a variable
#> that already exists.
var_collapse("Age group", "65+", c("65-74 years", "75 years and over"))
var_collapse("Age group", "25-64", c("25-44 years", "45-64 years"))
tab("AGER", "Age group")
Patient age recode {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Under 15 years 117,917 14,097 93,229 149,142 11.4 1.3 8.9 14.2
15-24 years 64,856 7,018 52,387 80,292 6.3 0.6 5.1 7.5
25-44 years 170,271 13,966 144,925 200,049 16.4 1.1 14.3 18.8
45-64 years 309,506 23,290 266,994 358,787 29.9 1.4 27.2 32.6
65-74 years 206,866 14,366 180,481 237,109 20   1.2 17.6 22.5
75 years and over 167,069 15,179 139,746 199,735 16.1 1.3 13.7 18.8
(Checked presentation standards. Nothing to report.)
Age group {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Under 15 years 117,917 14,097 93,229 149,142 11.4 1.3 8.9 14.2
15-24 years 64,856 7,018 52,387 80,292 6.3 0.6 5.1 7.5
25-64 479,777 32,175 420,624 547,247 46.3 1.8 42.7 49.9
65+ 373,935 24,523 328,777 425,296 36.1 1.9 32.3 40  
(Checked presentation standards. Nothing to report.)

Here, the AGER variable remains unchanged, while the Age group variable has fewer categories.

The variables data frame. Recall that survey objects have an element called variables, which is a data frame that contains the survey variables.

class(namcs2019sv$variables)
#> [1] "data.frame"

More advanced users can create or modify variables in the variables data frame directly. After they modify these variables, they must call set_survey() again. For example:

namcs2019sv$variables$`Medicare and Medicaid` = ( 
  namcs2019sv$variables$PAYMCARE & namcs2019sv$variables$PAYMCAID)
set_survey(namcs2019sv)
#> * To adjust how counts are rounded, see ?set_count_int
#>                        _                                                                    
#> Survey name            NAMCS 2019 PUF                                                       
#> Number of variables    34                                                                   
#> Number of observations 8250                                                                 
#> Info1                  Stratified 1 - level Cluster Sampling design (with replacement)      
#> Info2                  With (398) clusters.                                                 
#> Info3                  survey::svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT, 
#> Info4                      data = namcs2019sv_df)
tab("Medicare and Medicaid")
Medicare and Medicaid {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
FALSE 1,016,202 47,395 927,389 1,113,520 98 0.5 96.9 98.9
TRUE 20,282 5,177 12,120 33,941 2 0.5 1.1 3.1
(Checked presentation standards. Nothing to report.)

Note, however, that the var_*() functions do not modify the survey object specified in set_survey() directly. Rather, they modify variables inside the following data frame: surveytable:::env$survey$variables. If you use the var_*() functions and then need access to the modified / created variables, that’s where you should look. For example:

var_cross("newvar", "MAJOR", "AGER")
# This should give NULL. The new variable does not exist here:
namcs2019sv$variables$newvar
#> NULL
# Rather, the new variable is here:
str(surveytable:::env$survey$variables$newvar)
#>  Factor w/ 42 levels "Blank : Under 15 years",..: 17 31 24 38 38 24 30 31 24 38 ...
#>  - attr(*, "label")= chr "(Major reason for this visit) x (Patient age recode)"

Save the output

The tab* and total* functions have an argument called csv that specifies the name of a comma-separated values (CSV) file to save the output to. Alternatively, you can name the default CSV output file using the set_output() function. For example, the following directs surveytable to send all future output to a CSV file, create some tables, and then turn off sending output to the file:

tmp_file = tempfile(fileext = ".csv")
suppressMessages( set_output(csv = tmp_file) )
tab("MDDO", "SPECCAT", "MSA")
Type of doctor (MD or DO) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
M.D. - Doctor of Medicine 980,280 48,388 889,842 1,079,910 94.6 0.7 93.1 95.8
D.O. - Doctor of Osteopathy 56,204 6,602 44,597 70,832 5.4 0.7 4.2 6.9
(Checked presentation standards. Nothing to report.)
Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
Primary care specialty 521,466 31,136 463,840 586,252 50.3 2.6 45.1 55.5
Surgical care specialty 214,832 31,110 161,661 285,490 20.7 3   15.1 27.3
Medical care specialty 300,186 43,497 225,806 399,067 29   3.6 22.1 36.6
(Checked presentation standards. Nothing to report.)
Metropolitan Statistical Area Status of physician location {NAMCS 2019 PUF}
Level Number (000) SE (000) LL (000) UL (000) Percent SE LL UL
MSA (Metropolitan Statistical Area) 973,676 50,515 879,490 1,077,947 93.9 1.7 89.6 96.8
Non-MSA 62,809 17,549 36,249 108,830 6.1 1.7 3.2 10.4
(Checked presentation standards. Nothing to report.)
set_output(csv = "")
#> * Turning off CSV output.
#> * ?set_output for other options.

If the tabulation functions are called from within an R Markdown notebook, they produce HTML tables. This makes it easy to incorporate the output of the surveytable package directly into documents, presentations, “shiny” web apps, and other output types.

Finally, the tabulation functions return the tables that they produce. More advanced analysts can use this functionality to integrate surveytable into other programming tasks.