parseRPDR

The Research Patient Data Registry is a centralized clinical data registry, or data warehouse at Partners Hospitals. Populated by data from several source systems, including the TSI hospital and IDX clinic/physician billing systems from BWH and MGH, as well as data from Partners Clinical Data Repository (CDR), Epic and the Enterprise Patient Master Index (EMPI). In this tutorial we will go through how to use the parseRPDR R package to load and manipulate the text outputs generated by RPDR. The package does not provide compatibility with access databases provided by the system.

Installation

You can install the released version of parseRPDR from CRAN with:

install.packages("parseRPDR")

parseRPDR package functionalities

The aim of the package is to provide a standardized framework to analyze outputs provided by RPDR. All functions of parseRPDR are parallelized to assist large data queries (please see section on parallelization for details). Data is loaded into the R environment using the load_abc functions, where abc is the three letter abbreviation of the given datasource. Currently “mrn”, “con”, “dem”, “enc”, “rdt”, “lab”, “med”, “dia”, “rfv”, “prc”, “phy”, “lno”, “car”, “dis”, “end”, “hnp”, “opn”, “pat”, “prg”, “pul”, “rad” and “vis” datasources are supported. Data is loaded into data.table objects to provide fast and efficient manipulations on even large datasets. Besides importing the data, the load functions also modify the variables names in a standardized fashion to help later analyses:

The functions also do minimal data cleaning to help later analyses:

Due to potential issues with PHI and PPI, the example datasets can be downloaded from the Partners Gitlab repository at Partners under parserpdr-sample-data. For each supported datasource a raw version and a parsed version is provided. The raw data provided by RPDR is called: data_abc_raw, where abc is the three letter abbreviation of the given datasource. Results of the load functions on these example datasets are also provided in the form of: data_abc, where abc is the three letter abbreviation of the given datasource, to show exactly what modifications the software does on the inputs.

Besides providing an interface to import text outputs from RPDR into the R environment, parseRPDR also provides functions to do common tasks. Similar to the load functions, the package contains another family of functions: convert_abc, where abc is the three letter abbreviation of the given datasource, which does common manipulations on a given datasource. Brief description of these can be found below:

Besides these functions, there are ones that are not connected to specific datasources and provide other commonly used functionalities. Brief description of these can be found below:

Parallelization and shared RAM management in parseRPDR

All functionalities of parseRPDR are parallelized to assist the analysis of large datasets. The user does not need to know anything of what is being done in the background, simply setting the nThread parameter in the function calls sets everything up. In detail, on unix-based systems forking is used, while on windows machines socket clusters are initiated. Please be aware that the optimal number of threads depends on the system running the application. By default nThread is set to 4, but on less powerful machines this might need to be set to lower values, while on more powerful machines even more cores can be initialized. Be aware that parallelization also requires additional memory to run the functions. This also depends on the operating system (generally unix-based system require less) and therefore the optimal number of threads needs to be empirically determined. Setting nThread=1 results in sequential analysis, which might be beneficial for small datasets.

In case of the find_exam function, there is an opportunity to use shared RAM management. In case of large datasets (>1M rows), this may provide more efficient RAM management. However, this also results in slower run times, but on the other hand may allow to run the search process on more threads. The balance needs to be determined empirically for each machine, but as a rule of thumb, if the datasets supplied to find_exam do not exceed 1M rows then shared RAM management is not beneficial.

NOTE!!!

On macOS, data.table may return a warning message similar to: “data.table 1.13.6 using 1 threads (see ?getDTthreads)…” Disregard the warning message as the package does not use functionalities affected by this limitation of macOS.

Detailed functionalities of parseRPDR

The first step of most analyses is requesting data from RPDR.

Requesting data from RPDR

RPDR provides three main data sources:

parseRPDR supports the analysis of detailed patient information text files which can be requested separately, in conjunction with a radiological image or biological specimen data request.

There are two main ways to request data from RPDR:

If we have a known list of MRNs, first we need to format them according to the standards of RPDR. parseRPDR provides the pretty_mrn function to help with this. RPDR requires different standard lengths depending on the type of MRN. For example, MGH MRNs must be 7 digits long, while BWH MRNs are 8 digits long. Also, RPDR requires concatenating the source of the MRN to the beginning of the ID. pretty_mrn takes care of these rules automatically depending on the MRN source and converts a vector of MRNs to the required format of RPDR. This can then be exported using base functions of R such as: write.csv or data.table::fwrite() using the data.table package. Detailed functionality can be found in the help documentation of pretty_mrn.

mrns <- sample(1e4:1e7, size = 10) #Simulate MRNs

#MGH format
pretty_mrn(v = mrns, prefix = "MGH")

#BWH format
pretty_mrn(v = mrns, prefix = "BWH")

#Multiple sources using space as a separator
pretty_mrn(v = mrns[1:3], prefix = c("MGH", "BWH", "EMPI"), sep = " ")

#Keeping the length of the IDs despite not adhering to the requirements
pretty_mrn(v = mrns, prefix = "EMPI", id_length = "asis")

Once we have the IDs, then we can request data from RPDR and continue similarly as if we were to use the query tool of RPDR.

NOTE!!!

Be aware that RPDR uses Enterprise Master Patient Index IDs in the background. This means that the supplied MRNs are converted to EMPI IDs and these are used to fetch data for the patients. Therefore, the returned IDs might not match the requested IDs, if for example the patient has a new MGH ID. Therefore, it is advised to use the EMPI IDs to merge different data sources (provided by all load functions in the column: ID_MERGE) and manually check instances where the requested and returned IDs do not match to be sure that the right patient data has been retrieved.

Requesting radiological images from RPDR

As stated above, hospital provided IDs may change over time. While using EMPIs as IDs to merge different data sources solves this issue, in case of radiological images the supplied mi2b2 only works with MGH or BHW IDs, therefore a complete list of all MRNs the patients had at any time is needed.

NOTE!!!

Since mi2b2 only works with MGH and BWH IDs, if a predefined set of MRNs are used to request data from RPDR (not the query tool), then it is advised to parse out all IDs present in con and mrn datasources using all_ids_mi2b2 function of parseRPDR as specified under load_con function in the document. This is needed as the mi2b2 workbench only works for requested MRNs, therefore if the MGH MRN changes for a given patient, and we wish to access an image that was saved using the previous MRN, then the most recent MRN won’t grant us access to that image. Therefore, in case we work with a predefined set of MRNs, then after requesting data for all datasources that we need, the user should parse out a complete list of MRNs and request the mi2b2 using this list.

Loading data using parseRPDR

parseRPDR provides individual functions to load each type of dataset into the R environment. For note type files (car, dis, end, hnp, opn, pat, prg, pul, rad and vis.), there is a load_notes function that can be used for any type of report file. For all datasources, load_notes and load_abc functions, where abc is the three letter abbreviation of the given datasource, are very similar in nature and have the same arguments. Also, for most cases default values for the input arguments are satisfactory and only the location of the data is needed to be specified. Nevertheless, the arguments provide full flexibility if needed.

The package provides sample datasets for all supported datasources. Currently “mrn”, “con”, “dem”, “enc”, “rdt”, “lab”, “med”, “dia”, “rfv”, “prc”, “phy”, “lno”, “car”, “dis”, “end”, “hnp”, “opn”, “pat”, “prg”, “pul”, “rad” and “vis” are supported. data_abc_raw are the unprocessed datasets, while data_abc are the processed example datasets where abc is the three letter abbreviation of the supported datasource. Due to potential issues with PHI and PPI, the example datasets can be downloaded from the Partners Gitlab repository under parserpdr-sample-data. Examples corresponding to the mrn datasource can be found below.

#Print raw data
head(data_mrn_raw)

#Print processed data loaded using load_mrn
head(data_mrn)

As it can be appreciated, parseRPDR does several modifications to the data.

In the example datasets na and identical arguments are set to FALSE to provide all columns, however by default they are set to TRUE to minimize data sizes and only provide columns with meaningful information.

load_con function

The con.txt dataset is a unique datasource as it has a Patient_ID_List column which contains all MRNs from all hospitals. As MRNs may change over time, this list contains a possible alternative ID for each hospital that was previously used by the patient. Also, there are additional hospital IDs present in this list. parseRPDR converts this list into specific columns. As there may be IDs which are only present in the minority of patients, the load function has the argument perc to specify what percentage of patients should have a given ID to add it as an additional column into the output of load_con function which have “_list” appended to them. Default is 0.6 corresponding to 60%. Also, by default the MRN_Type and MRN columns are parsed so that information from these columns are also provided. This should only be one for one datasource, as all datasources contain the same information in these columns.

#Print raw data containing columns which are proccessed using the load_con function
data_con_raw[, c("MRN_Type", "MRN", "Patient_ID_List")]

#Print processed ID data
data_con[, grep("ID_.*", colnames(data_con), value = TRUE)]

NOTE!!!

As mentioned previously, if a predefined set of MRNs are used to acquire data and radiological images are also required, then a full set of MGH or BWH IDs are needed to cover all possible IDs that the patients had during their encounters at Partners hospitals. For this the IDs should be gathered from all sources and combined into one list that can be used to request the mi2b2 workbench. For this we can use the all_ids_mi2b2 function. It requires the parsed con and mrn data.tables to provide a complete list of MRNs that the patients had during their visits to Partners hospitals. This list can then be used as a new data query for radiological images.

#Initially requested IDs
data_con$ID_con_MGH

#All MGH MRNs that the patients had anytime during their visits to Partners hospitals
#Due to fake sample data it is the same as above
all_MGH_mrn <- all_ids_mi2b2(type = "MGH", d_mrn = data_mrn, d_con = data_con)

load_notes function

There are several types of different note files provided by RPDR: car, dis, end, hnp, opn, pat, prg, pul, rad and vis. For these the load_notes function provides a single interface to load these files. Simply the type of note must be provided and a standard output is provided for each type of note.

#Using defaults
d_hnp <- load_notes(file = "test_Hnp.txt", type = "hnp")
#Use sequential processing
d_hnp <- load_notes(file = "test_Hnp.txt", type = "hnp", nThread = 1)
#Use parallel processing and parse data in MRN_Type and MRN columns and keep all IDs
d_hnp <- load_notes(file = "test_Hnp.txt", type = "hnp", nThread = 20, mrn_type = TRUE, perc = 1)

load_all function

parseRPDR provides a convenient function to load all RPDR data using a single line of code. The load_all function can be use for loading different datasources at once and/or to load multiple files of the same datasource, which occurs if our query results in more than 25,000 patients. The function requires a folder path instead of a file path. Also, we can use the which_data argument to specify which datasources we wish to load into the list. Currently “mrn”, “con”, “dem”, “enc”, “rdt”, “lab”, “med”, “dia”, “rfv”, “prc”, “phy”, “lno”, “car”, “dis”, “end”, “hnp”, “opn”, “pat”, “prg”, “pul”, “rad” and “vis” are supported. In case there are multiple files for a given datasource, then add a “_” and a number to merge the same data sources into a single output in the order of the provided number.

The load_all function is parallelized for efficiency. It allows two different forms of parallelization. Either it is done on the level of different datasources (i.e. mrn, con, dem are loaded parallel) or within the datasources (if multiple files are present per datasource i.e. con_1, con_2 etc.). Using the many_sources arguments we can specify which to use. If set to TRUE, then parallelization will be done on the level of different datasources, if FALSE then parallelization is done within the datasources. It should only be set to FALSE if there is more than one txt files per datasource. If the user does not wish to use parallel processing (i.e. small dataset), then setting nThread=1 will promote sequential processing. However, be aware that all functions calls run sequentially within load_all, that is all loading sub-processes are done using 1 thread, so that it does not cause issues with load functions running in parallel.

NOTE!!!

Large datasets may crash R as they exceed the available memory. In this case consider loading specific data sources separately and filtering out cases to decrease the amount of memory needed.

RPDR returns large requests in multiple files (in case of patient numbers > 25,000). These are arranged so that a given individuals’ data is present in only that given batch. Therefore, in case of large queries, it is advised to process the data sequentially, meaning that doing all your calculations on the given batch (i.e. mrn, lab etc. datasources), for each batch, and then concatenating the results together from each 25,000 patients’ data to receive the final database.

#Load all Con, Dem and Mrn datasets processing all files within given datasource in parallel
load_all(folder = folder_rpdr, which_data = c("con", "dem", "mrn"), nThread = 2, many_sources = FALSE)

#Load all supported file types parallelizing on the level of datasources
load_all(folder = folder_rpdr, nThread = 2, many_sources = TRUE)

Manipulating data using parseRPDR

parseRPDR provides a family of convert_abc functions, where where abc is the three letter abbreviation of the given datasource, which execute common manipulations on a given datasource. Besides these functions the program also provides functions to assist common tasks which are not dependent of a given datasource. The arguments are standard among the functions:

convert_enc - Parsing ICD codes and finding disease diagnoses

parseRPDR provides the convert_enc function to format ICD codes provided by RPDR in the encounter tables. It also identifies disease groups by searching through the registered diagnoses and providing an indicator column whether that encounter has any of the ICD codes associated with that given disease. For this we need to specify the following arguments:

The function returns the data.table provided in the argument d with the new parse ICD columns starting with ICD_ and also the original ICD columns, if requested next to the new indicator columns if a list is supplied in codes_to_find argument.

The function can also provide summary information by collapsing the data.table based-on the column specified in collapse. For example, if the ID_MERGE column is given, then a data.table is returned where each unique value in the column collapse is returned and the indicator columns, whether that ID had at any time the given ICD coded. Furthermore, time_type defines whether the earliest or latest time is returned defined by code_time if multiple occurrences are present. Also, if for example a complex ID is created prior to the function call based-on the ID and the encounter timepoint, then the function can collapse the results based-on this complex ID and return whether any of the ICD codes were registered during encounters on the same day.

#Parse encounter ICD columns and keep original ones as well
data_enc_parse <- convert_enc(d = data_enc, keep = TRUE, nThread = 2)

#Parse encounter ICD columns and discard original ones,
#and create indicator variables for the following diseases
diseases <- list(HT = c("I10"), Stroke = c("434.91", "I63.50"))
data_enc_disease <-  convert_enc(d = data_enc, keep = FALSE, codes_to_find = diseases, nThread = 2)

#Parse encounter ICD columns and discard original ones,
#and create indicator variables for the following diseases and summarize per patient
#whether there are any encounters where the given diseases were registered
diseases <- list(HT = c("I10"), Stroke = c("434.91", "I63.50"))
data_enc_disease <-  convert_enc(d = data_enc, keep = FALSE, codes_to_find = diseases, nThread = 2,
                                 collapse = "ID_MERGE", time_type = "earliest")

convert_dia - Searching for given diagnosis codes

Similar to convert_enc, the conert_dia function is used to search for given diagnosis groups withing the diagnosis data.table. The difference is that the diagnoses do not need to be parsed as they are stored separately. Also, since not only ICD codes are present, the code type needs to be defined also. For this, instead of a simple code, a complex code should be given in the form of: code_type:code, i.e. ICD9:250.00. All other functions are similar to convert_enc.

#Search for Hypertension and Stroke ICD codes
diseases <- list(HT = c("ICD10:I10"), Stroke = c("ICD9:434.91", "ICD9:I63.50"))
data_dia_parse <- convert_dia(d = data_dia, codes_to_find = diseases, nThread = 2)

#Search for Hypertension and Stroke ICD codes and summarize per patient providing earliest time
diseases <- list(HT = c("ICD10:I10"), Stroke = c("ICD9:434.91", "ICD9:I63.50"))
data_dia_disease <-  convert_dia(d = data_dia, codes_to_find = diseases, nThread = 2,
                                 collapse = "ID_MERGE", time_type = "earliest")

convert_rfv - Searching for given reason for visit codes

Similar to convert_enc, the conert_rfv function is used to search for given reason for visit groups withing the reason for visit data.table. The difference is that the diagnoses do not need to be parsed as they are stored separately.

#Parse reason for visit columns and create indicator variables for the following reasons
#and summarize per patient, whether there are any encounters where the given reasons were registered
reasons <- list(Pain = c("ERFV:160357", "ERFV:140012"), Visit = c("ERFV:501"))
data_rfv_disease <-  convert_rfv(d = data_rfv, keep = FALSE, codes_to_find = reasons, nThread = 2,
                                 collapse = "ID_MERGE")

convert_prc - Searching for given procedures

Similar to convert_enc, the conert_prc function is used to search for given procedures within the procedures data.table. The difference is that the procedures do not need to be parsed as they are stored separately.

#Parse procedure columns and create indicator variables for the following procedures
#and summarize per patient, whether there are any procedures registered
procedures <- list(Anesthesia = c("CTP:00410", "CPT:00104"))
data_prc_procedures <- convert_prc(d = data_prc, codes_to_find = procedures, nThread = 2,
                                   collapse = "ID_MERGE", time_type = "earliest")

convert_phy - Searching for given health hisotry codes

Similar to convert_enc, the conert_phy function is used to search for given health history codes withing the health history data.table. The difference is that the health history do not need to be parsed as they are stored separately.

#Search for for Height and Weight codes and summarize per patient providing earliest time
anthropometrics <- list(Weight = c("LMR:3688", "EPIC:WGT"), Height = c("LMR:3771", "EPIC:HGT"))
data_phy_parse <- convert_phy(d = data_phy, codes_to_find = anthropometrics, nThread = 2,
                              collapse = "ID_MERGE", time_type = "earliest")

convert_lab - Parsing lab results and identifying abnormal values

Laboratory results can be loaded using the load_lab function. However, RPDR only provides the raw results of the tests (i.e. <0.03), which make it difficult to use for later analyses. parseRPDR provides the convert_lab function, which converts the results to numerical values. This is done by replacing x< or x> notations with the value x, as the only thing we know certainly is that the values is not larger or smaller then the boundary value. When ranges are returned, then the upper bound of the range is provided. Also, qualitative results are converted to standard outputs of POS: positive, NEG: negative and BORD: borderline. These converted values are returned in a new column called: lab_result_pretty. The function also returns a column lab_result_abn_pretty which uses the normal range values to determine whether the value is normal or abnormal. Borderline values are considered as normal.

#Convert loaded lab results
data_lab_pretty <- convert_lab(d = data_lab)
data_lab_pretty[, c("lab_result", "lab_result_pretty", "lab_result_range", "lab_result_abn_pretty")]

convert_med - Parsing medication results and finding given medications

When putting together the database, it is often the question whether a patient has received a given medication or not. parseRPDR provides the convert_med function to search for the presence of given medications. Its use is similar to convert_enc without converting the columns, as medication info columns are not standard. Therefore, we only have to provide it the data.table parsed using load_med and a list of medications, which can contain multiple list entries. The function also provides summary statistics using the collapse argument, which looks whether a given medication is present among the IDs in the column specified. If this column is a patient ID, then the summation will be done per person. However, composite IDs can also be created and provided, for example to summarize within each encounter of the given patient. This functionality works similar to convert_enc. The function is case insensitive, therefore differences in capitalization do not matter.

#Define medication group and add an indicator column whether the
#given medication group was administered
meds <- list(statin = c("Simvastatin", "Atorvastatin"),
             NSAID  = c("Acetaminophen", "Paracetamol"))

data_med_indic <- convert_med(d = data_med, codes_to_find = meds, nThread = 2)
data_med_indic[, c("statin", "NSAID")]

#Summarize per patient if they ever had the given medication groups registered
data_med_indic_any <- convert_med(d = data_med, codes_to_find = meds, nThread = 2,
                                  collapse = "ID_MERGE", time_type = "earliest")

convert_notes - Extracting information from note free text

There are many information in note free text reports. parseRPDR does not provide natural language processing or other text analysis techniques, however the convert_notes function splits the reports into sections specified by the user. This function should be used for data imported using load_notes or load_lno functions. By default, many reports have identical sections, such as findings or impressions. These potential sections, called anchors may be provided, in which case the convert_notes function exacts text following the specific section name, until the next section. This way the report is cut into different parts, which are easier to analyze later. The function needs an array of anchor points, which which is different for each type of note. For potential defaults please see the documentation of the function.

A few things to note. The order of the section elements does not matter, the function figures out the next closest one, therefore any arbitrary order can be given. “report_end” should always be present among the anchors, as it is the last text located in all reports. The anchor points are case sensitive, therefore be careful specifying them. Multiple versions may also be given, in which case then the user can merge these columns later to cover all possible differences in spelling. These anchor points may be also standard values reported in the text such as CAD-RADS on cardiac CTA, or EF on cardiac echo.

#Create columns with specific parts of the radiological report defined by anchors
data_rad_parsed <- convert_notes(d = data_rad, code = "rad_rep_txt",
                                 anchors = c("Exam Code", "Ordering Provider", "HISTORY",
                                             "Associated Reports", "Report Below", "REASON",
                                             "REPORT", "TECHNIQUE", "COMPARISON", "FINDINGS",
                                             "IMPRESSION", "RECOMMENDATION", "SIGNATURES",
                                             "report_end"), nThread = 2)

find_exam - Finding exams within a timeframe of a timepoint

One of the most common tasks in creating a database from clinical data, is to find given examinations (lab results, radiological examinations, diagnoses etc.) within a given timeframe of an event or encounter. For this, parseRPDR provides the find_exam function, which tries to find the earliest, closest or all the examinations within a given timeframe of an event or encounter. The function runs the search in parallel on multiple threads which can be specified using nThread. If it is set to 1, then no parallel backends are created and the function is executed sequentially. The function allows flexibility using its input arguments:

Here we present example cases of using the function to locate radiological examinations within a timeframe of given encounters.

#Filter encounters for first emergency visits at one of MGH's ED departments
data_enc_ED <- data_enc[enc_clinic == "MGH EMERGENCY (10020010608)"]
data_enc_ED <- data_enc_ED[!duplicated(data_enc_ED$ID_MERGE)]

#Find all radiological examinations within 3 day of the ED registration
rdt_ED <- find_exam(d_from = data_rdt, d_to = data_enc_ED,
                    d_from_ID = "ID_MERGE", d_to_ID = "ID_MERGE",
                    d_from_time = "time_rdt_exam", d_to_time = "time_enc_admit",
                    time_diff_name = "time_diff_ED_rdt", before = TRUE, after = TRUE,
                    time = 3, time_unit = "days", multiple = "all",
                    nThread = 2, shared_RAM = FALSE)

#Find earliest radiological examinations within 3 day of the ED registration
rdt_ED <- find_exam(d_from = data_rdt, d_to = data_enc_ED,
                    d_from_ID = "ID_MERGE", d_to_ID = "ID_MERGE",
                    d_from_time = "time_rdt_exam", d_to_time = "time_enc_admit",
                    time_diff_name = "time_diff_ED_rdt", before = TRUE, after = TRUE,
                    time = 3, time_unit = "days", multiple = "earliest",
                    nThread = 2, shared_RAM = FALSE)

#Find closest radiological examinations on or after 1 day of the ED registration
#and add primary diagnosis column from encounters
rdt_ED <- find_exam(d_from = data_rdt, d_to = data_enc_ED,
                    d_from_ID = "ID_MERGE", d_to_ID = "ID_MERGE",
                    d_from_time = "time_rdt_exam", d_to_time = "time_enc_admit",
                    time_diff_name = "time_diff_ED_rdt", before = FALSE, after = TRUE,
                    time = 1, time_unit = "days", multiple = "earliest",
                    add_column = "enc_diag_princ", nThread = 2, shared_RAM = FALSE)

#Find closest radiological examinations on or after 1 day of the ED registration but
#also provide empty rows for patients with exam data but not within the timeframe
rdt_ED <- find_exam(d_from = data_rdt, d_to = data_enc_ED,
                    d_from_ID = "ID_MERGE", d_to_ID = "ID_MERGE",
                    d_from_time = "time_rdt_exam", d_to_time = "time_enc_admit",
                    time_diff_name = "time_diff_ED_rdt", before = FALSE, after = TRUE,
                    time = 1, time_unit = "days", multiple = "earliest",
                    add_column = "enc_diag_princ", keep_data = TRUE,
                    nThread = 2, shared_RAM = FALSE)

The function only supports intervals following up to or beginning at the given encounter. If for example the user wishes to search whether a patient had any radiological examination 1 year prior to the encounter, then the search criteria should be set to a large interval prior to the encounter and then using one line of code the user can filer out all exams which are less than 365 days from the encounter using the resulting time_diff_name column in the returned data.table.

A similar task might be to find out whether a patient had an diagnosis of a given disease within a timeframe of an index encounter or event. For this we can use the find_exam and convert_enc functions to locate all encounters within a timeframe and see whether given diseases occurred within this timeframe.

First we create a data.table containing the index encounters. Then we create a unique ID containing both IDs and time variables, therefore if a patient has multiple index encounters, they will not be combined but handled as separate encounters. Then we search for all encounters (this could also be diagnoses) within a given timeframe. Then we convert the ICD codes and search for given diagnoses and collapse the results based-on the created ID and time unique ID to create a data.table which contains information whether there was a given disease diagnosed within the timeframe for that given encounter.

#Filter encounters for first emergency visits at one of MGH's ED departments
data_enc_ED <- data_enc[enc_clinic == "MGH EMERGENCY (10020010608)"]

#Create new column adding a time stamp to ID if individual had multiple encounters 
data_enc_ED$ID_MERGE_time <- paste0(data_enc_ED$ID_MERGE, "_", data_enc_ED$time_enc_admit)

#Find all encounters within 30 days after the registration to ED
data_enc_ED_30d <- find_exam(d_from = data_enc,d_to = data_enc_ED,
                             d_from_ID = "ID_MERGE", d_to_ID = "ID_MERGE",
                             d_from_time = "time_enc_admit", d_to_time = "time_enc_admit",
                             time_diff_name = "time_diff_ED_enc", before = FALSE, after = TRUE,
                             time = 30, time_unit = "days", multiple = "all",
                             add_column = "ID_MERGE_time", keep_data = FALSE, nThread = 2)

#Combine encounters and search if any of them registered a given diagnosis for each ID
#and admission time unique ID and return the earliest date of occurrence
enc_cols <- colnames(data_enc)[19:30]
diseases <- list(HT = c("I10"), Stroke = c("434.91", "I63.50"))

data_enc_ED_30d_summ <- convert_enc(d = data_enc_ED_30d, code = enc_cols,
                                    keep = FALSE, codes_to_find = diseases,
                                    collapse = "ID_MERGE_time", nThread = 2)

#Merge original encounter data with 30 day summary data
data_enc_ED_ALL <- data.table::merge.data.table(x = data_enc_ED, y = data_enc_ED_30d_summ,
                                                by = "ID_MERGE_time", all.x = TRUE, all.y = FALSE)

Be aware that RPDR is a clinical research database that utilizes information present in the hospital information systems, therefore there can be anomalies and missing values in the exported datasources. Always conduct quality control of the data. There can be missing MRNS, different patients with identical MRNs, inability to locate individuals in the system, unstandardized examination results, multiple coding of diseases etc. These need to be tested and taken care of to end up with a clean database. Also be aware that parseRPDR was created with the best intention and knowledge of the creator and developer, but it still may have bugs. Therefore, the software does not offer any warranty of any sort and therefore the author cannot be help liable for any claim.

Conclusions

parseRPDR is an R package specifically built to assist large scale analyses on RPDR outputs. Besides providing functions to load and clean datasets, it also provides functions for most commonly used data manipulations. All of its functions support parallelization and therefore provide fast and efficient analyses of RPDR data. The package is updated regularly with new functionalities. For support contact: mkolossvary@mgh.harvard.edu