Preliminaries

Concepts

There are two important concepts that we need to learn and distinguish:

A data frame is a standard way of storing data in R. A data frame is rectangular data. Variables are in columns, observations are in rows. Example:

head(iris)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa

A data frame, in an of itself, cannot represent a complex survey. This is because, just by looking at a data frame, R does not know what the sampling weights are, what the strata are, etc. Even if the variables that represent the sampling weights, etc, are part of the data frame, just by looking at the data frame, R does not know which variable represents the weights or other survey design variables.

You can get a data frame into R in many different ways. If your data is currently in a comma-separated values (CSV) file, you can use read.csv(). If it’s in a SAS file, you can use a package like haven or importsurvey. If it’s already in R format, use readRDS(), and so on.

A survey object is an object that describes a survey. It tells R what the sampling weights are, what the strata are, and so on. A data frame can be converted into a survey object using the survey::svydesign() function; if a survey uses replicate weights, the survey::svrepdesign() function should be used.

Generally speaking, you only need to convert a data frame to a survey object once. After it has been converted, you can save it with saveRDS() (or similar). In the future, you can load it with readRDS(). You do not need to re-convert a data frame to a survey object every time.

NAMCS

Examples in this tutorial use a survey called the National Ambulatory Medical Care Survey (NAMCS) 2019 Public Use File (PUF). NAMCS is “an annual nationally representative sample survey of visits to non-federal office-based patient care physicians, excluding anesthesiologists, radiologists, and pathologists.” Note that the unit of observation is visits, not patients – this distinction is important since a single patient can make multiple visits.

The surveytable package comes with a data frame of selected variables from NAMCS, called namcs2019sv_df (sv = selected variables; df = data frame). The survey object of this survey is called namcs2019sv.

namcs2019sv is the object that we analyze. You really only need namcs2019sv. The reason that the package has namcs2019sv_df is to illustrate how to convert the data frame to the survey object.

More concepts

When importing data from another source, such as SAS or CSV, analysts should be aware of the standard way in which variables are handled in R.

Specifically, categorical variables should be stored as factor.
While true / false variables could be stored as factor as well, some programming tasks are easier if they are stored as logical.
Unknown values should be stored as missing (NA). If a variable contains “special values”, such as a negative value indicating that the age is missing, those “special values” need to be converted to NA.

Variables in namcs2019sv_df are already stored correctly. Thus,

AGER (patient’s age group) is a factor variable;
PAYNOCHG (which indicates whether there was no charge for the physician visit) is a logical variable; and
AGE (patient’s age in years) is a numeric variable.

library("surveytable")

class(namcs2019sv_df$AGER)
#> [1] "factor"
class(namcs2019sv_df$PAYNOCHG)
#> [1] "logical"
class(namcs2019sv_df$AGE)
#> [1] "numeric"

Create a survey object

As seen below, tables produced by surveytable are clearer if either the variable names themselves are descriptive, or if the variables have the "label" attribute that is descriptive. In namcs2019sv_df, all variables already have the "label" attribute set. For example, while the variable name AGE itself is not very descriptive, the variable does have a more descriptive "label" attribute:

attr(namcs2019sv_df$AGE, "label")
#> [1] "Patient age in years"

Documentation for the NAMCS survey provides the names of the survey design variables. Specifically, in NAMCS,

cluster ID’s, also known as primary sampling units (PSU’s), are given in CPSUM;
strata are given in CSTRATM; and
sampling weights are given in PATWT.

Thus, the namcs2019sv_df data frame can be turned into a survey object as follows:

mysurvey = survey::svydesign(ids = ~ CPSUM
  , strata = ~ CSTRATM
  , weights = ~ PATWT
  , data = namcs2019sv_df)

Tables produced by surveytable are clearer if either the name of the survey object is descriptive, or if the object has the "label" attribute that is descriptive. Let’s set this attribute for mysurvey:

attr(mysurvey, "label") = "NAMCS 2019 PUF"

The mysurvey object should now be identical to namcs2019sv. Let’s verify this:

identical(namcs2019sv, mysurvey)
#> [1] TRUE

We have just successfully created a survey object from a data frame.

Begin analysis

First, specify the survey object that you’d like to analyze.

set_survey(namcs2019sv)
#> * To adjust how counts are rounded, see ?set_count_int
#>                        _                                                                    
#> Survey name            NAMCS 2019 PUF                                                       
#> Number of variables    33                                                                   
#> Number of observations 8250                                                                 
#> Info1                  Stratified 1 - level Cluster Sampling design (with replacement)      
#> Info2                  With (398) clusters.                                                 
#> Info3                  survey::svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT, 
#> Info4                      data = namcs2019sv_df)

Check the survey label, survey design variables, and the number of observations to verify that it all looks correct.

List variables

The var_list() function lists the variables in the survey. To avoid unintentionally listing all the variables in a survey, which can be many, the starting characters of variable names are specified. For example, to list the variables that start with the letters age, type:

var_list("age")

Variables beginning with ‘age’ {NAMCS 2019 PUF}
Variable	Class	Long name
AGE	numeric	Patient age in years
AGER	factor	Patient age recode

The table lists

variable name;
class, which is the type of variable; and
variable label, which is the long name of the variable.

Common classes are factor (categorical variable), logical (yes / no variable), and numeric.

Tabulate categorical and logical variables

The main function of the surveytable package is tab(), which tabulates variables. It operates on categorical and logical variables, and presents both estimated counts, with their standard errors (SEs) and 95% confidence intervals (CIs), and percentages, with their SEs and CIs. For example, to tabulate AGER, type:

tab("AGER")

Patient age recode {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Under 15 years	117,917	14,097	93,229	149,142	11.4	1.3	8.9	14.2
15-24 years	64,856	7,018	52,387	80,292	6.3	0.6	5.1	7.5
25-44 years	170,271	13,966	144,925	200,049	16.4	1.1	14.3	18.8
45-64 years	309,506	23,290	266,994	358,787	29.9	1.4	27.2	32.6
65-74 years	206,866	14,366	180,481	237,109	20	1.2	17.6	22.5
75 years and over	167,069	15,179	139,746	199,735	16.1	1.3	13.7	18.8
(Checked presentation standards. Nothing to report.)

The table title shows the variable label (the long variable name) and the survey label.

For each level of the variable, the table shows:

the estimated count, its standard error, and its 95% confidence interval; and
the estimated percentage, its standard error, and its 95% confidence interval.

NCHS presentation standards. The tab() function also applies the National Center for Health Statistics (NCHS) presentation standards for counts and percentages, and flags estimates if, according to the standards, they should be suppressed, footnoted, or reviewed by an analyst. The CIs that are displayed are the ones that are used by the NCHS presentation standards. Specifically, for counts, the tables show the log Student’s t 95% CI, with adaptations for complex surveys; for percentages, they show the 95% Korn and Graubard CI.

One does not need to do anything extra to perform presentation standards checking – it is performed automatically. For example, let’s tabulate PAYNOCHG:

tab("PAYNOCHG")

Expected source of payment for visit: No Charge/Charity {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL	Flags
FALSE	1,034,338	48,874	942,808	1,134,754	99.8	0.2	99	100
TRUE	2,146	1,919	293	15,703	0.2	0.2	0	1	Cx
Cx: suppress count (and rate)

This table tells us that, according to the NCHS presentation standards, the estimated number of visits in which there was no charge for the visit should be suppressed due to low precision. However, the lack of a percentage flag indicates that the estimated percentage of such visits can be shown.

Drop missing values. Some variables might contain missing values (NA). Consider the following variable, which is not part of the actual survey, but was constructed specifically for this example:

tab("SPECCAT.bad")

Type of specialty (BAD - do not use) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Primary care specialty	422,807	26,382	374,099	477,857	40.8	2.2	36.5	45.2
Surgical care specialty	170,714	23,333	130,514	223,297	16.5	2.3	12.2	21.5
Medical care specialty	235,502	35,527	175,049	316,831	22.7	2.9	17.2	29.1
<N/A>	207,462	12,458	184,378	233,436	20	0.8	18.5	21.6
(Checked presentation standards. Nothing to report.)

To calculate percentages based on the non-missing values only, use the drop_na argument:

tab("SPECCAT.bad", drop_na = TRUE)

Type of specialty (BAD - do not use) (knowns only) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Primary care specialty	422,807	26,382	374,099	477,857	51	2.6	45.7	56.3
Surgical care specialty	170,714	23,333	130,514	223,297	20.6	2.9	15.2	26.9
Medical care specialty	235,502	35,527	175,049	316,831	28.4	3.6	21.5	36.2
(Checked presentation standards. Nothing to report.)

The above table gives percentages based only on the knowns, that is, based only on non-NA values.

Multiple tables. Multiple tables can be created with a single command:

tab("MDDO", "SPECCAT", "MSA")

Type of doctor (MD or DO) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
M.D. - Doctor of Medicine	980,280	48,388	889,842	1,079,910	94.6	0.7	93.1	95.8
D.O. - Doctor of Osteopathy	56,204	6,602	44,597	70,832	5.4	0.7	4.2	6.9
(Checked presentation standards. Nothing to report.)

Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Primary care specialty	521,466	31,136	463,840	586,252	50.3	2.6	45.1	55.5
Surgical care specialty	214,832	31,110	161,661	285,490	20.7	3	15.1	27.3
Medical care specialty	300,186	43,497	225,806	399,067	29	3.6	22.1	36.6
(Checked presentation standards. Nothing to report.)

Metropolitan Statistical Area Status of physician location {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
MSA (Metropolitan Statistical Area)	973,676	50,515	879,490	1,077,947	93.9	1.7	89.6	96.8
Non-MSA	62,809	17,549	36,249	108,830	6.1	1.7	3.2	10.4
(Checked presentation standards. Nothing to report.)

Entire population

Estimate the total count for the entire population using the total() command:

total()

Total {NAMCS 2019 PUF}
Number (000)	SE (000)	LL (000)	UL (000)
1,036,484	48,836	945,014	1,136,809
(Checked presentation standards. Nothing to report.)

Subsets or interactions

To create a table of AGER for each value of the variable SEX, type:

tab_subset("AGER", "SEX")

Patient age recode (Patient sex = Female) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Under 15 years	59,958	7,206	47,318	75,974	9.9	1.2	7.6	12.6
15-24 years	41,128	4,532	33,066	51,156	6.8	0.7	5.4	8.4
25-44 years	113,708	11,461	93,256	138,646	18.8	1.6	15.8	22.1
45-64 years	175,978	16,009	147,153	210,450	29.1	1.7	25.7	32.6
65-74 years	120,099	11,066	100,171	143,992	19.8	1.5	17	22.9
75 years and over	94,173	11,085	74,682	118,751	15.6	1.5	12.8	18.7
(Checked presentation standards. Nothing to report.)

Patient age recode (Patient sex = Male) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Under 15 years	57,959	7,728	44,570	75,371	13.4	1.7	10.3	17.1
15-24 years	23,728	4,344	16,457	34,210	5.5	0.8	4	7.4
25-44 years	56,562	7,277	43,861	72,942	13.1	1.3	10.7	15.8
45-64 years	133,528	12,956	110,319	161,619	30.9	1.6	27.8	34.3
65-74 years	86,766	6,767	74,409	101,176	20.1	1.5	17.3	23.1
75 years and over	72,896	6,661	60,872	87,296	16.9	1.5	14	20.2
(Checked presentation standards. Nothing to report.)

In addition to giving the long name of the variable being tabulated, the title of each table reflects the value of the subsetting variable (in this case, either Female or Male).

With the tab_subset() command, in each table (that is, in each subset), the percentages add up to 100%.

The tab_cross() function is similar – it crosses or interacts two variables and generates a table using this new variable. Thus, to create a table of the interaction of AGER and SEX, type:

tab_cross("AGER", "SEX")

(Patient age recode) x (Patient sex) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Under 15 years : Female	59,958	7,206	47,318	75,974	5.8	0.7	4.5	7.3
15-24 years : Female	41,128	4,532	33,066	51,156	4	0.4	3.2	4.9
25-44 years : Female	113,708	11,461	93,256	138,646	11	1	9	13.2
45-64 years : Female	175,978	16,009	147,153	210,450	17	1.1	14.8	19.3
65-74 years : Female	120,099	11,066	100,171	143,992	11.6	1	9.7	13.7
75 years and over : Female	94,173	11,085	74,682	118,751	9.1	0.9	7.3	11.1
Under 15 years : Male	57,959	7,728	44,570	75,371	5.6	0.7	4.3	7.2
15-24 years : Male	23,728	4,344	16,457	34,210	2.3	0.4	1.6	3.2
25-44 years : Male	56,562	7,277	43,861	72,942	5.5	0.6	4.3	6.8
45-64 years : Male	133,528	12,956	110,319	161,619	12.9	1	10.9	15.1
65-74 years : Male	86,766	6,767	74,409	101,176	8.4	0.6	7.2	9.7
75 years and over : Male	72,896	6,661	60,872	87,296	7	0.6	5.9	8.3
(Checked presentation standards. Nothing to report.)

While the estimated counts produced by tab_subset() and tab_cross() are the same, the percentages are different.

With the tab_subset() command, within each table (that is, within each subset), the percentages add up to 100%.
On the other hand, with tab_cross(), the percentages across the entire population add up to 100%.

Tabulate numeric variables

The tab() and tab_subset() functions also work with numeric variables, though with such variables, the output is different. To tabulate NUMMED (number of medications), a numeric variable, type:

tab("NUMMED")

Number of medications coded {NAMCS 2019 PUF}
% known	Mean	SEM	SD
100	3.46	0.268	4.43

As before, the table title shows the variable label (the long variable name) and the survey label.

The table shows the percentage of values that are not missing (not NA), the mean, the standard error of the mean (SEM), and the standard deviation (SD).

Subsetting works too:

tab_subset("NUMMED", "AGER")

Number of medications coded (for different levels of Patient age recode) {NAMCS 2019 PUF}
Level	% known	Mean	SEM	SD
Under 15 years	100	1.58	0.168	1.75
15-24 years	100	1.64	0.112	1.7
25-44 years	100	2.15	0.225	2.74
45-64 years	100	3.49	0.303	4.49
65-74 years	100	4.44	0.431	5.03
75 years and over	100	5.53	0.494	5.59

Perform statistical hypothesis testing

The tab_subset() function makes it easy to perform hypothesis testing by using the test argument. When the argument is TRUE, a test of association is performed. In addition, t-tests for all pairs of levels are performed as well.

Categorical variables

Consider the relationship between AGER an SPECCAT:

tab_subset("AGER", "SPECCAT", test = TRUE)

Patient age recode (Type of specialty (Primary, Medical, Surgical) = Primary care specialty) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Under 15 years	102,720	14,137	78,373	134,631	19.7	2.5	15	25.2
15-24 years	40,808	4,941	32,127	51,835	7.8	0.8	6.2	9.7
25-44 years	95,305	9,118	78,964	115,027	18.3	1.5	15.3	21.5
45-64 years	124,384	10,371	105,582	146,535	23.9	1.5	20.9	27
65-74 years	85,504	10,210	67,581	108,182	16.4	1.6	13.4	19.8
75 years and over	72,745	9,886	55,660	95,073	14	1.6	10.9	17.4
(Checked presentation standards. Nothing to report.)

Patient age recode (Type of specialty (Primary, Medical, Surgical) = Surgical care specialty) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Under 15 years	6,201	1,359	4,017	9,571	2.9	0.7	1.6	4.7
15-24 years	8,561	1,622	5,824	12,585	4	0.6	2.9	5.3
25-44 years	35,953	11,539	18,976	68,119	16.7	3.6	10.2	25.2
45-64 years	73,204	12,475	52,307	102,450	34.1	1.6	31	37.3
65-74 years	53,482	6,405	42,227	67,736	24.9	2.8	19.6	30.9
75 years and over	37,431	6,364	26,763	52,352	17.4	2.4	12.8	22.8
(Checked presentation standards. Nothing to report.)

Patient age recode (Type of specialty (Primary, Medical, Surgical) = Medical care specialty) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Under 15 years	8,996	3,158	4,330	18,690	3	1	1.4	5.7
15-24 years	15,487	5,035	7,908	30,326	5.2	1.4	2.8	8.6
25-44 years	39,012	6,892	27,419	55,507	13	1.3	10.5	15.8
45-64 years	111,918	22,754	74,907	167,215	37.3	3.1	31	43.8
65-74 years	67,880	10,945	49,319	93,425	22.6	2.9	17.1	29
75 years and over	56,894	10,806	39,008	82,980	19	3.3	12.7	26.6
(Checked presentation standards. Nothing to report.)

Association between Patient age recode and Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
p-value	Flag
0	*
Pearson’s X^2: Rao & Scott adjustment. *: p-value <= 0.05

Comparison of all possible pairs of Patient age recode (Type of specialty (Primary, Medical, Surgical) = Primary care specialty) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
Under 15 years	15-24 years	0	*
Under 15 years	25-44 years	0.67
Under 15 years	45-64 years	0.259
Under 15 years	65-74 years	0.334
Under 15 years	75 years and over	0.083
15-24 years	25-44 years	0	*
15-24 years	45-64 years	0	*
15-24 years	65-74 years	0	*
15-24 years	75 years and over	0.002	*
25-44 years	45-64 years	0.007	*
25-44 years	65-74 years	0.461
25-44 years	75 years and over	0.092
45-64 years	65-74 years	0.001	*
45-64 years	75 years and over	0	*
65-74 years	75 years and over	0.194
Design-based t-test. *: p-value <= 0.05

Comparison of all possible pairs of Patient age recode (Type of specialty (Primary, Medical, Surgical) = Surgical care specialty) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
Under 15 years	15-24 years	0.221
Under 15 years	25-44 years	0	*
Under 15 years	45-64 years	0	*
Under 15 years	65-74 years	0	*
Under 15 years	75 years and over	0	*
15-24 years	25-44 years	0	*
15-24 years	45-64 years	0	*
15-24 years	65-74 years	0	*
15-24 years	75 years and over	0	*
25-44 years	45-64 years	0	*
25-44 years	65-74 years	0.196
25-44 years	75 years and over	0.904
45-64 years	65-74 years	0.027	*
45-64 years	75 years and over	0	*
65-74 years	75 years and over	0.007	*
Design-based t-test. *: p-value <= 0.05

Comparison of all possible pairs of Patient age recode (Type of specialty (Primary, Medical, Surgical) = Medical care specialty) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
Under 15 years	15-24 years	0.112
Under 15 years	25-44 years	0	*
Under 15 years	45-64 years	0	*
Under 15 years	65-74 years	0	*
Under 15 years	75 years and over	0	*
15-24 years	25-44 years	0	*
15-24 years	45-64 years	0	*
15-24 years	65-74 years	0	*
15-24 years	75 years and over	0	*
25-44 years	45-64 years	0	*
25-44 years	65-74 years	0	*
25-44 years	75 years and over	0.139
45-64 years	65-74 years	0.009	*
45-64 years	75 years and over	0.003	*
65-74 years	75 years and over	0.4
Design-based t-test. *: p-value <= 0.05

Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (Patient age recode = Under 15 years) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
Primary care specialty	Surgical care specialty	0	*
Primary care specialty	Medical care specialty	0	*
Surgical care specialty	Medical care specialty	0.364
Design-based t-test. *: p-value <= 0.05

Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (Patient age recode = 15-24 years) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
Primary care specialty	Surgical care specialty	0	*
Primary care specialty	Medical care specialty	0.002	*
Surgical care specialty	Medical care specialty	0.124
Design-based t-test. *: p-value <= 0.05

Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (Patient age recode = 25-44 years) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
Primary care specialty	Surgical care specialty	0.001	*
Primary care specialty	Medical care specialty	0	*
Surgical care specialty	Medical care specialty	0.848
Design-based t-test. *: p-value <= 0.05

Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (Patient age recode = 45-64 years) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
Primary care specialty	Surgical care specialty	0.005	*
Primary care specialty	Medical care specialty	0.631
Surgical care specialty	Medical care specialty	0.163
Design-based t-test. *: p-value <= 0.05

Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (Patient age recode = 65-74 years) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
Primary care specialty	Surgical care specialty	0.006	*
Primary care specialty	Medical care specialty	0.248
Surgical care specialty	Medical care specialty	0.298
Design-based t-test. *: p-value <= 0.05

Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (Patient age recode = 75 years and over) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
Primary care specialty	Surgical care specialty	0.003	*
Primary care specialty	Medical care specialty	0.291
Surgical care specialty	Medical care specialty	0.099
Design-based t-test. *: p-value <= 0.05

According to these tables, there is an association between physician specialty type and patient age. For instance, for patients under 15 years, there is a statistical difference between primary care physician specialty and medical care specialty. But for older patients, such as in the 45-64 age group, there is no statistical difference between the two specialty types.

As another example, consider the relationship between MRI and SPECCAT:

tab_subset("MRI", "SPECCAT", test = TRUE)

MRI (Type of specialty (Primary, Medical, Surgical) = Primary care specialty) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL	Flags
FALSE	515,172	30,724	458,304	579,096	98.8	0.5	97.2	99.6
TRUE	6,295	2,768	2,295	17,268	1.2	0.5	0.4	2.8	Cx
Cx: suppress count (and rate)

MRI (Type of specialty (Primary, Medical, Surgical) = Surgical care specialty) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
FALSE	207,915	30,117	156,442	276,323	96.8	0.7	95	98
TRUE	6,917	1,845	3,925	12,191	3.2	0.7	2	5
(Checked presentation standards. Nothing to report.)

MRI (Type of specialty (Primary, Medical, Surgical) = Medical care specialty) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL	Flags
FALSE	291,560	40,805	221,456	383,855	97.1	1.4	92.9	99.2	Pc
TRUE	8,626	4,768	2,451	30,364	2.9	1.4	0.8	7.1	Cx Px
Cx: suppress count (and rate); Px: suppress percent; Pc: footnote percent - complement

Association between MRI and Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
p-value	Flag
0.169
Pearson’s X^2: Rao & Scott adjustment. *: p-value <= 0.05

Comparison of all possible pairs of MRI (Type of specialty (Primary, Medical, Surgical) = Primary care specialty) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
FALSE	TRUE	0	*
Design-based t-test. *: p-value <= 0.05

Comparison of all possible pairs of MRI (Type of specialty (Primary, Medical, Surgical) = Surgical care specialty) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
FALSE	TRUE	0	*
Design-based t-test. *: p-value <= 0.05

Comparison of all possible pairs of MRI (Type of specialty (Primary, Medical, Surgical) = Medical care specialty) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
FALSE	TRUE	0	*
Design-based t-test. *: p-value <= 0.05

Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (MRI = FALSE) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
Primary care specialty	Surgical care specialty	0	*
Primary care specialty	Medical care specialty	0	*
Surgical care specialty	Medical care specialty	0.156
Design-based t-test. *: p-value <= 0.05

Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) (MRI = TRUE) {NAMCS 2019 PUF}
Level 1	Level 2	p-value
Primary care specialty	Surgical care specialty	0.856
Primary care specialty	Medical care specialty	0.657
Surgical care specialty	Medical care specialty	0.733
Design-based t-test. *: p-value <= 0.05

According to these tables, there is no statistical association between MRI and physician specialty. For each of the 3 specialty types, a minority of visits have MRI’s. For the visits with MRI’s, there was no statistical difference between specialty types.

As a general rule of thumb, since there is no statistical association between MRI and physician specialty, presenting this tabulation would not be particularly interesting, especially since the subsetting decreases the sample size and therefore also decreases the estimate reliability. Instead, it would generally make more sense to just tabulate MRI without subsetting by SPECCAT.

Numeric variables

The relationship between NUMMED and AGER:

tab_subset("NUMMED", "AGER", test = TRUE)

Number of medications coded (for different levels of Patient age recode) {NAMCS 2019 PUF}
Level	% known	Mean	SEM	SD
Under 15 years	100	1.58	0.168	1.75
15-24 years	100	1.64	0.112	1.7
25-44 years	100	2.15	0.225	2.74
45-64 years	100	3.49	0.303	4.49
65-74 years	100	4.44	0.431	5.03
75 years and over	100	5.53	0.494	5.59

Association between Number of medications coded and Patient age recode {NAMCS 2019 PUF}
p-value	Flag
0	*
Wald test. *: p-value <= 0.05

Comparison of Number of medications coded across all possible pairs of Patient age recode {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
Under 15 years	15-24 years	0.739
Under 15 years	25-44 years	0.043	*
Under 15 years	45-64 years	0	*
Under 15 years	65-74 years	0	*
Under 15 years	75 years and over	0	*
15-24 years	25-44 years	0.029	*
15-24 years	45-64 years	0	*
15-24 years	65-74 years	0	*
15-24 years	75 years and over	0	*
25-44 years	45-64 years	0	*
25-44 years	65-74 years	0	*
25-44 years	75 years and over	0	*
45-64 years	65-74 years	0.007	*
45-64 years	75 years and over	0	*
65-74 years	75 years and over	0.002	*
Design-based t-test. *: p-value <= 0.05

According to these tables, there is an association between the number of medications and age category. NUMMED is statistically similar for the “Under 15 years” and “15-24 years” AGER categories. It is statistically different for all other pairs of age categories.

Finally, let’s look at the relationship between NUMMED and SPECCAT:

tab_subset("NUMMED", "SPECCAT", test = TRUE)

Number of medications coded (for different levels of Type of specialty (Primary, Medical, Surgical)) {NAMCS 2019 PUF}
Level	% known	Mean	SEM	SD
Primary care specialty	100	3.7	0.309	4.46
Surgical care specialty	100	2.87	0.616	4.59
Medical care specialty	100	3.46	0.637	4.22

Association between Number of medications coded and Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
p-value	Flag
0.52
Wald test. *: p-value <= 0.05

Comparison of Number of medications coded across all possible pairs of Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
Level 1	Level 2	p-value
Primary care specialty	Surgical care specialty	0.254
Primary care specialty	Medical care specialty	0.738
Surgical care specialty	Medical care specialty	0.501
Design-based t-test. *: p-value <= 0.05

According to these tables, there is no association between the number of medications and physician specialty type. NUMMED is statistically similar for all pairs of physician specialties.

As a general rule of thumb, since there is no statistical association between the number of medications and physician specialty, presenting this tabulation would not be particularly interesting, especially since the subsetting decreases the sample size and therefore also decreases the estimate reliability. Instead, it would generally make more sense to just tabulate NUMMED without subsetting by SPECCAT.

Categorical variables (single variable)

To test whether any pair of SPECCAT levels is statistically similar or different, type:

tab("SPECCAT", test = TRUE)

Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Primary care specialty	521,466	31,136	463,840	586,252	50.3	2.6	45.1	55.5
Surgical care specialty	214,832	31,110	161,661	285,490	20.7	3	15.1	27.3
Medical care specialty	300,186	43,497	225,806	399,067	29	3.6	22.1	36.6
(Checked presentation standards. Nothing to report.)

Comparison of all possible pairs of Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
Level 1	Level 2	p-value	Flag
Primary care specialty	Surgical care specialty	0	*
Primary care specialty	Medical care specialty	0	*
Surgical care specialty	Medical care specialty	0.168
Design-based t-test. *: p-value <= 0.05

According to this, surgical and medical care specialties are statistically similar, and are statistically different from primary care.

Calculate rates

A rate is a ratio of count estimates based on the survey in question divided by population size, which is assumed to be known. For example, the number of physician visits per 100 people in the population is a rate: the number of physician visits is estimated from the namcs2019sv survey, while the number of people in the population comes from another source.

To calculate rates, in addition to the survey, we need a source of information on population size. You would typically use a function such as read.csv() to load the population figures and get them into the correct format. The surveytable package comes with an object called uspop2019 that contains several population figures for use in these examples.

Let’s examine uspop2019:

class(uspop2019)
#> [1] "list"
names(uspop2019)
#> [1] "total"       "MSA"         "AGER"        "Age group"   "SEX"        
#> [6] "AGER x SEX"  "Age group 5"

The overall population size for the country as a whole is:

uspop2019$total
#> [1] 323186697

Once we have the overall population size, the overall rate is:

total_rate(uspop2019$total)

Total (rate per 100 population) {NAMCS 2019 PUF}
Rate	SE	LL	UL
320.7	15.1	292.4	351.7
(Checked presentation standards. Nothing to report.)

To calculate the rates for a particular variable, we need to provide a data frame with a column called Level that matches the levels of the variable in the survey, and a column called Population that gives the size of the population for that level.

For example, for AGER, this data frame as follows:

uspop2019$AGER
#>               Level Population
#> 1    Under 15 years   60526656
#> 2       15-24 years   41718700
#> 3       25-44 years   85599410
#> 4       45-64 years   82562049
#> 5       65-74 years   31260202
#> 6 75 years and over   21519680

Now that we have the appropriate population figures, the rates table is obtained by typing:

tab_rate("AGER", uspop2019$AGER)

Patient age recode (rate per 100 population) {NAMCS 2019 PUF}
Level	Rate	SE	LL	UL
Under 15 years	194.8	23.3	154	246.4
15-24 years	155.5	16.8	125.6	192.5
25-44 years	198.9	16.3	169.3	233.7
45-64 years	374.9	28.2	323.4	434.6
65-74 years	661.8	46	577.3	758.5
75 years and over	776.4	70.5	649.4	928.1
(Checked presentation standards. Nothing to report.)

To calculate the rates for one variable (AGER) by another variable (SEX), we need population figures in the following format:

uspop2019$`AGER x SEX`
#>                Level Subset Population
#> 1     Under 15 years Female   29604762
#> 2        15-24 years Female   20730118
#> 3        25-44 years Female   43192143
#> 4        45-64 years Female   42508901
#> 5        65-74 years Female   16673240
#> 6  75 years and over Female   12421444
#> 7     Under 15 years   Male   30921894
#> 8        15-24 years   Male   20988582
#> 9        25-44 years   Male   42407267
#> 10       45-64 years   Male   40053148
#> 11       65-74 years   Male   14586962
#> 12 75 years and over   Male    9098236

With this data frame, the rates table is obtained by typing:

tab_subset_rate("AGER", "SEX", uspop2019$`AGER x SEX`)

Patient age recode (Patient sex = Female) (rate per 100 population) {NAMCS 2019 PUF}
Level	Rate	SE	LL	UL
Under 15 years	202.5	24.3	159.8	256.6
15-24 years	198.4	21.9	159.5	246.8
25-44 years	263.3	26.5	215.9	321
45-64 years	414	37.7	346.2	495.1
65-74 years	720.3	66.4	600.8	863.6
75 years and over	758.1	89.2	601.2	956
(Checked presentation standards. Nothing to report.)

Patient age recode (Patient sex = Male) (rate per 100 population) {NAMCS 2019 PUF}
Level	Rate	SE	LL	UL
Under 15 years	187.4	25	144.1	243.7
15-24 years	113.1	20.7	78.4	163
25-44 years	133.4	17.2	103.4	172
45-64 years	333.4	32.3	275.4	403.5
65-74 years	594.8	46.4	510.1	693.6
75 years and over	801.2	73.2	669.1	959.5
(Checked presentation standards. Nothing to report.)

Create or modify variables

In some situations, it might be necessary to modify survey variables, or to create new ones. This section describes how to do this.

Convert factor to logical. The variable MAJOR (major reason for this visit) has several levels.

tab("MAJOR")

Major reason for this visit {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Blank	15,887	3,354	10,335	24,419	1.5	0.3	1	2.3
New problem (less than 3 mos. onset)	275,014	19,691	238,955	316,514	26.5	1.5	23.7	29.5
Chronic problem, routine	380,910	35,080	317,916	456,386	36.8	2.5	31.8	41.9
Chronic problem, flare-up	74,017	9,329	57,706	94,939	7.1	0.9	5.5	9.1
Pre-surgery	12,864	2,151	9,188	18,010	1.2	0.2	0.9	1.7
Post-surgery	54,170	6,749	42,350	69,289	5.2	0.7	4	6.7
Preventive care	223,624	18,520	190,068	263,103	21.6	1.7	18.3	25.1
(Checked presentation standards. Nothing to report.)

Notice that one of the levels is called "Preventive care". Suppose an analyst is only interested in whether or not a visit is a preventive care visit – they are not interested in the other visit types. They can create a new variable called Preventive care visits that is TRUE for preventive care visits and FALSE for all other types of visits, as follows:

var_case("Preventive care visits", "MAJOR", "Preventive care")
tab("Preventive care visits")

Preventive care visits {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
FALSE	812,861	45,220	728,841	906,566	78.4	1.7	74.9	81.7
TRUE	223,624	18,520	190,068	263,103	21.6	1.7	18.3	25.1
(Checked presentation standards. Nothing to report.)

This creates a logical variable that is TRUE for preventive care visits and then tabulates it. When using the var_case() function, specify the name of the new logical variable to be created, an existing factor variable, and one or more levels of the factor variable that should be set to TRUE in the logical variable.

Thus, if an analyst is interested in surgery-related visits, which are indicated by two different levels of MAJOR, they could type:

var_case("Surgery-related visits"
  , "MAJOR"
  , c("Pre-surgery", "Post-surgery"))
tab("Surgery-related visits")

Surgery-related visits {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
FALSE	969,451	47,976	879,793	1,068,246	93.5	0.8	91.9	94.9
TRUE	67,034	7,810	53,273	84,348	6.5	0.8	5.1	8.1
(Checked presentation standards. Nothing to report.)

Collapse levels. The variable PRIMCARE (whether the physician is this patient’s primary care provider) has levels Unknown and Blank, among others.

tab("PRIMCARE")

Are you the patient’s primary care provider? {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL	Flags
Blank	1,150	478	440	3,005	0.1	0	0	0.2	Cx
Unknown	39,519	9,507	24,520	63,692	3.8	0.9	2.3	6
Yes	383,481	28,555	331,362	443,798	37	2.6	31.9	42.3
No	612,335	43,282	533,050	703,413	59.1	2.5	53.9	64.1
Cx: suppress count (and rate)

To collapse Unknown and Blank into a single level, type:

var_collapse("PRIMCARE", "Unknown if PCP", c("Unknown", "Blank"))
tab("PRIMCARE")

Are you the patient’s primary care provider? {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Unknown if PCP	40,669	9,479	25,619	64,560	3.9	0.9	2.4	6.1
Yes	383,481	28,555	331,362	443,798	37	2.6	31.9	42.3
No	612,335	43,282	533,050	703,413	59.1	2.5	53.9	64.1
(Checked presentation standards. Nothing to report.)

Convert numeric to factor. The variable AGE is numeric.

tab("AGE")

Patient age in years {NAMCS 2019 PUF}
% known	Mean	SEM	SD
100	51	1.04	24.3

To create a new variable of age categories based on AGE, type:

var_cut("Age group", "AGE"
        , c(-Inf, 0, 4, 14, 64, Inf)
        , c("Under 1", "1-4", "5-14", "15-64", "65 and over") )
tab("Age group")

Age group {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Under 1	31,148	5,282	22,269	43,566	3	0.5	2.1	4.1
1-4	38,240	5,444	28,864	50,662	3.7	0.5	2.7	4.8
5-14	48,529	5,741	38,430	61,282	4.7	0.5	3.7	5.9
15-64	544,632	36,082	478,254	620,223	52.5	2	48.6	56.5
65 and over	373,935	24,523	328,777	425,296	36.1	1.9	32.3	40
(Checked presentation standards. Nothing to report.)

In the var_cut() command, specify the following information:

name of the new categorical variable;
name of the existing numeric variable;
cut points – note that the intervals are inclusive on the right; and
category labels.

Check whether any variable is true. For a series of logical variables, you can check whether any of them are TRUE using the var_any() command.

A physician visit is considered to be an “imaging services” visit if it had any of a number of imaging services ordered or provided. Imaging services are indicated using logical variables, such as MRI and XRAY. To create the Imaging services variable, type:

var_any("Imaging services"
  , c("ANYIMAGE", "BONEDENS", "CATSCAN", "ECHOCARD", "OTHULTRA"
  , "MAMMO", "MRI", "XRAY", "OTHIMAGE"))
tab("Imaging services")

Imaging services {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
FALSE	901,115	43,298	820,085	990,151	86.9	1.1	84.6	89.1
TRUE	135,369	13,574	111,134	164,890	13.1	1.1	10.9	15.4
(Checked presentation standards. Nothing to report.)

Interact variables. The tab_cross() function creates a table of an interaction of two variables, but it does not save the interacted variable. To create the interacted variable, use the var_cross() command:

var_cross("Age x Sex", "AGER", "SEX")

Specify the name of the new variable as well as names of the two variables to interact.

Copy a variable. Create a new variable that is a copy of another variable using var_copy(). You can modify the copy, while the original remains unchanged. For example:

var_copy("Age group", "AGER")
#> Warning in var_copy("Age group", "AGER"): Age group: overwriting a variable
#> that already exists.
var_collapse("Age group", "65+", c("65-74 years", "75 years and over"))
var_collapse("Age group", "25-64", c("25-44 years", "45-64 years"))
tab("AGER", "Age group")

Patient age recode {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Under 15 years	117,917	14,097	93,229	149,142	11.4	1.3	8.9	14.2
15-24 years	64,856	7,018	52,387	80,292	6.3	0.6	5.1	7.5
25-44 years	170,271	13,966	144,925	200,049	16.4	1.1	14.3	18.8
45-64 years	309,506	23,290	266,994	358,787	29.9	1.4	27.2	32.6
65-74 years	206,866	14,366	180,481	237,109	20	1.2	17.6	22.5
75 years and over	167,069	15,179	139,746	199,735	16.1	1.3	13.7	18.8
(Checked presentation standards. Nothing to report.)

Age group {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Under 15 years	117,917	14,097	93,229	149,142	11.4	1.3	8.9	14.2
15-24 years	64,856	7,018	52,387	80,292	6.3	0.6	5.1	7.5
25-64	479,777	32,175	420,624	547,247	46.3	1.8	42.7	49.9
65+	373,935	24,523	328,777	425,296	36.1	1.9	32.3	40
(Checked presentation standards. Nothing to report.)

Here, the AGER variable remains unchanged, while the Age group variable has fewer categories.

The variables data frame. Recall that survey objects have an element called variables, which is a data frame that contains the survey variables.

class(namcs2019sv$variables)
#> [1] "data.frame"

More advanced users can create or modify variables in the variables data frame directly. After they modify these variables, they must call set_survey() again. For example:

namcs2019sv$variables$`Medicare and Medicaid` = ( 
  namcs2019sv$variables$PAYMCARE & namcs2019sv$variables$PAYMCAID)
set_survey(namcs2019sv)
#> * To adjust how counts are rounded, see ?set_count_int
#>                        _                                                                    
#> Survey name            NAMCS 2019 PUF                                                       
#> Number of variables    34                                                                   
#> Number of observations 8250                                                                 
#> Info1                  Stratified 1 - level Cluster Sampling design (with replacement)      
#> Info2                  With (398) clusters.                                                 
#> Info3                  survey::svydesign(ids = ~CPSUM, strata = ~CSTRATM, weights = ~PATWT, 
#> Info4                      data = namcs2019sv_df)

tab("Medicare and Medicaid")

Medicare and Medicaid {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
FALSE	1,016,202	47,395	927,389	1,113,520	98	0.5	96.9	98.9
TRUE	20,282	5,177	12,120	33,941	2	0.5	1.1	3.1
(Checked presentation standards. Nothing to report.)

Note, however, that the var_*() functions do not modify the survey object specified in set_survey() directly. Rather, they modify variables inside the following data frame: surveytable:::env$survey$variables. If you use the var_*() functions and then need access to the modified / created variables, that’s where you should look. For example:

var_cross("newvar", "MAJOR", "AGER")
# This should give NULL. The new variable does not exist here:
namcs2019sv$variables$newvar
#> NULL
# Rather, the new variable is here:
str(surveytable:::env$survey$variables$newvar)
#>  Factor w/ 42 levels "Blank : Under 15 years",..: 17 31 24 38 38 24 30 31 24 38 ...
#>  - attr(*, "label")= chr "(Major reason for this visit) x (Patient age recode)"

Save the output

The tab* and total* functions have an argument called csv that specifies the name of a comma-separated values (CSV) file to save the output to. Alternatively, you can name the default CSV output file using the set_output() function. For example, the following directs surveytable to send all future output to a CSV file, create some tables, and then turn off sending output to the file:

tmp_file = tempfile(fileext = ".csv")
suppressMessages( set_output(csv = tmp_file) )

tab("MDDO", "SPECCAT", "MSA")

Type of doctor (MD or DO) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
M.D. - Doctor of Medicine	980,280	48,388	889,842	1,079,910	94.6	0.7	93.1	95.8
D.O. - Doctor of Osteopathy	56,204	6,602	44,597	70,832	5.4	0.7	4.2	6.9
(Checked presentation standards. Nothing to report.)

Type of specialty (Primary, Medical, Surgical) {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
Primary care specialty	521,466	31,136	463,840	586,252	50.3	2.6	45.1	55.5
Surgical care specialty	214,832	31,110	161,661	285,490	20.7	3	15.1	27.3
Medical care specialty	300,186	43,497	225,806	399,067	29	3.6	22.1	36.6
(Checked presentation standards. Nothing to report.)

Metropolitan Statistical Area Status of physician location {NAMCS 2019 PUF}
Level	Number (000)	SE (000)	LL (000)	UL (000)	Percent	SE	LL	UL
MSA (Metropolitan Statistical Area)	973,676	50,515	879,490	1,077,947	93.9	1.7	89.6	96.8
Non-MSA	62,809	17,549	36,249	108,830	6.1	1.7	3.2	10.4
(Checked presentation standards. Nothing to report.)

set_output(csv = "")
#> * Turning off CSV output.
#> * ?set_output for other options.

If the tabulation functions are called from within an R Markdown notebook, they produce HTML tables. This makes it easy to incorporate the output of the surveytable package directly into documents, presentations, “shiny” web apps, and other output types.

Finally, the tabulation functions return the tables that they produce. More advanced analysts can use this functionality to integrate surveytable into other programming tasks.

Introduction to surveytable