Introduction to dtreg

library(dtreg)

You can use dtreg for the following tasks:

1. Load a DTR schema

To load a DTR schema, you need to know its identifier (see the help page), such as the ePIC datatype DOI or the ORKG template URL.

For instance, let us imagine you intend to use the “statistical test of difference” schema; it has the ePIC DOI “https://doi.org/21.T11969/ff5e3f857788d20dd1aa”.

If a valid identifier is used, you get an R object containing information about the DTR schema:

dt <- dtreg::load_datatype("https://doi.org/21.T11969/ff5e3f857788d20dd1aa")

In addition to the schema you requested (in this case, the “statistical test of difference”), you get schemata that you might need for using it (see Nested structure). You can look at the list of these schemata:

names(dt)
#>  [1] "string"                         "software"                      
#>  [3] "url"                            "software_library"              
#>  [5] "software_method"                "statistical_variable"          
#>  [7] "integer_in_string"              "sample_size"                   
#>  [9] "column"                         "cell"                          
#> [11] "row"                            "table"                         
#> [13] "data_input"                     "outcome_variable"              
#> [15] "grouping_factor"                "interaction_effect"            
#> [17] "covariate"                      "inferential_test_output"       
#> [19] "statistical_test_of_difference"

2. Create an instance

To write down your data in accordance with a DTR schema, dtreg uses R6 classes. Therefore, you need to create an instance of a specific class.

Fields

For doing that, you first need to know which fields you can use for the selected schema:

dtreg::show_fields(dt$statistical_test_of_difference())
#> [1] "has_specified_output" "has_characteristic"   "has_part"            
#> [4] "difference_between"   "evaluates"            "has_specified_input" 
#> [7] "executes"             "label"

For example, this schema has the field “label”. If your instance included only a label, it would be:

labelled_inst <- dt$statistical_test_of_difference(label = "my_test_results")

However, the data you want to write are usually more complex than this. The help page specifies which fields to use and what input is required by a field (e.g., another schema or a specific type of data, such as numeric).

String, numeric, and data frame

The most frequently used types of input are string, numeric, and data frame. Strings are used for labels and comments, and URLs are also presented as strings:

method_url <- dt$software_method(has_support_url = "https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test")

Numeric values are also frequently used:

sample_size <- dt$sample_size(has_specified_numeric_value = 50)

Some fields, such as “has_format”, require a data frame:

my_dataframe <- data.frame(W = 44.5, p = 2.2e-16)
output_dataframe <- dt$inferential_test_output(has_format = my_dataframe)

Please check which elements you include in your data frame and how you assign the columns. When in doubt, look at your column with:

class(my_dataframe$W)

The result should be either a basic R data type or an R factor.

By default, dtreg assigns a data frame a generic label “Table”. If you want to give your data frame a label, you can use a tuple. Please always include first the data frame and then the name as a string in the tuple:

library(sets)
my_tuple <- sets::tuple(my_dataframe, "the Wilcoxon test results")
output_tuple <- dt$inferential_test_output(has_format = my_tuple)

These are the most frequently used types of input that you write in a field.

More than one input in a field

Sometimes a few objects should be written in one field. In this case, simply concatenate them:

var_1 <- dt$statistical_variable(label = "var_1")
var_2 <- dt$statistical_variable(label = "var_2")
two_vars <- dt$statistical_test_of_difference(evaluates = c(var_1, var_2))

Nested structure

In the example above you can see that a field of a schema might require another schema. This nested structure is important for making the data machine readable. The example above can be also written this way, and you can choose which is more convenient for you:

two_vars <- dt$statistical_test_of_difference(evaluates = c(
  dt$statistical_variable(label = "var_1"),
  dt$statistical_variable(label = "var_2")
))

None of the fields are mandatory: you will not get an error message if you leave any field, or all of them, empty. It makes sense, though, to create a useful JSON-LD file.

Example: writing down the Wilcoxon test results

In this vignette, we report the results of the Wilcoxon rank sum test from the help page. As we emphasize on the help page, even for this simple test, you might want to report more than the most basic version described here. Your actual results might include descriptive statistics, the effect size, and visualizations of the data. For this example, we use the ePIC datatype statistical_test_of_difference (see Load a DTR schema above).

Let us write down the details of the software method we used:

software <- dt$software(label = "R",
                        versioninfo = "4.3.1")

soft_library <- dt$software_library(
  label = "stats::wilcoxon",
  part_of = software,
  versioninfo = "4.3.1",
  has_support_url = "https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test"
)

soft_method <- dt$software_method(
  label = "Wilcoxon rank sum test",
  uses_software = soft_library,
  is_implemented_by = "URL_code"
)

Now we can add information about the input data:

sample_size <- dt$sample_size(has_specified_numeric_value = 50)
input <- dt$data_input(label = "Iris: virginia and versicolor",
                       has_characteristic = sample_size,
                       source =
                         "https://search.r-project.org/CRAN/refmans/MVTests/html/iris.html")

And now the vectors, or the variables, that were compared:

virginia <- dt$statistical_variable(label = "petal length virginia",
                                    has_format = "numeric")
versicolor <-
  dt$statistical_variable(label = "petal length versicolor",
                          has_format = "numeric")
variables <- dt$outcome_variable(label = "virginia vs versicolor",
                                 requires = c(virginia, versicolor))

Finally, we have the results of the test:

df_result <- data.frame(W = 44.5, p = 2.2e-16)
output <- dt$inferential_test_output(has_format = df_result)

Let us now write everything together in the final instance:

wilcoxon_inst <- dt$statistical_test_of_difference(
  label = "Wilcoxon petal length, virginia vs versicolor",
  executes = soft_method,
  has_specified_input = input,
  evaluates = variables,
  has_specified_output = result
)

Please be attentive when writing the results. In case you misspell a variable, omit a comma separating two fields, forget a closing bracket, or make another seemingly tiny mistake, you will get an error message (something like “object ‘inputt’ not found”). You can experiment with such typos and see which error messages you get.

3. Convert the instance into JSON-LD format

This stage is simple. This line gives you a string in JSON-LD format:

wilcoxon_json <- dtreg::to_jsonld(wilcoxon_instance)

You can save the result as a machine-readable JSON file in your working directory:

write(wilcoxon_json, "wilcoxon_file.json")

The code to run the example

The code of the example above is given in one chunk here:

library(dtreg)
dt <-
  dtreg::load_datatype("https://doi.org/21.T11969/ff5e3f857788d20dd1aa")

software <- dt$software(label = "R",
                        versioninfo = "4.3.1")
soft_library <- dt$software_library(
  label = "stats::wilcoxon",
  part_of = software,
  versioninfo = "4.3.1",
  has_support_url = "https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test"
)
soft_method <- dt$software_method(
  label = "Wilcoxon rank sum test",
  uses_software = soft_library,
  is_implemented_by = "url_code"
)
sample_size <- dt$sample_size(has_specified_numeric_value = 50)
input <- dt$data_input(label = "Iris: virginia and versicolor",
                       has_characteristic = sample_size,
                       source = "https://search.r-project.org/CRAN/refmans/MVTests/html/iris.html")
virginia <- dt$statistical_variable(label = "petal length virginia",
                                    has_format = "numeric")
versicolor <-
  dt$statistical_variable(label = "petal length versicolor",
                          has_format = "numeric")
variables <- dt$outcome_variable(label = "virginia vs versicolor",
                                 requires = c(virginia, versicolor))
df_result <-
  data.frame(W = 44.5,
             p = 2.2e-16,
             stringsAsFactors = FALSE)
output <- dt$inferential_test_output(has_format = df_result)
wilcoxon_inst <- dt$statistical_test_of_difference(
  label = "Wilcoxon iris petal length, virginia vs versicolor",
  executes = soft_method,
  has_specified_input = input,
  evaluates = variables,
  has_specified_output = output
)

wilcoxon_json <- dtreg::to_jsonld(wilcoxon_inst)
write(wilcoxon_json, "wilcoxon_file.json")