# Basic VWP Preprocessing

## Before using VWPre

Before using this package a number of steps are required: First, your eye gaze data must have been collected using an SR Research Eyelink eye tracker. Second, your data must have been exported using SR Research Data Viewer software. For this basic example, it is assumed that you have specified an interest period relative to the onset of the critical stimulus in Data Viewer. However, this package is also able to preprocess data without a specified relative interest period. If you have not aligned your data to a particular message in Data Viewer, please refer to the Message Alignment vignette for functions related to this.

The Sample Report should be exported along with all available columns (this will ensure that you have all of the necessary columns for the functions contained in this package to work). Additionally, it is preferable to export to a .txt file rather than a .xlsx file.

The following preprocessing assumes that, in your experiment, interest area IDs and Labels were assigned consistently to the object types displayed on the screen. For example, in a typical VWP experiment, the target was always in interest area 1, the competitor was always in interest area 2, et cetera. This is typically done by dynamically moving the interest areas trial-by-trial to correspond with the position of the objects. If, instead, your interest areas were static and you have columns indicating the location of each object for each trial, you will need to reassign your interest areas. Specific functions for this are available in this package; please see the Interest Areas vignette for illustration. Once that is complete, you can follow the preprocessing procedure below. Note that the functions presented here are capable of handling data with a maximum of 8 interest areas. If you have more than 8 interest areas, it is necessary to adjust the source code to accommodate the number needed (please contact the package maintainer for an example).

Lastly, the functions included here, internally make use of dplyr for manipulating and restructuring data. For more information about dplyr, please refer to its reference manual and extensive collection of vignettes.

First, load the sample report. By default, Data Viewer will assign “.” to missing values; therefore it is important to include this in the na.strings parameter, so R will know how to handle any missing data.

library(VWPre)
VWdat <- read.table("1000HzData.txt", header = T, sep = "\t", na.strings = c(".", "NA"))

However, for the purposes of this vignette we will use the sample dataset included in the package.

data(VWdat)

## Preparing the data

### Verifying and creating necessary columns

In order for the functions in the package to work appropriately, the data need to be in a specific format. The prep_data function examines the presence and class of specific columns (LEFT_INTEREST_AREA_ID, RIGHT_INTEREST_AREA_ID, LEFT_INTEREST_AREA_LABEL, RIGHT_INTEREST_AREA_LABEL, TIMESTAMP, and TRIAL_INDEX) to ensure they are present in the data and appropriately assigned (e.g., categorical variables are encoded as factors). It also checks for columns SAMPLE_MESSAGE, RIGHT_GAZE_X, RIGHT_GAZE_Y, LEFT_GAZE_X, and LEFT_GAZE_Y, which are not required for basic preporcessing, but are needed to use the functions align_msg and custom_ia.

Additionally, the Subject parameter is used to specify the column corresponding to the subject identifier. Typical Data Viewer output contains a column called RECORDING_SESSION_LABEL which is the name of the column containing the subject identifier. The function will rename it Subject and will ensure it is encoded as a factor.

If your data contain a column corresponding to an item identifier please specify it in the Item parameter. In doing so, the function will standardize the name of the column to Item and will ensure it is encoded as a factor. If you don’t have an item identifier column, by default the value of this parameter is NA.

Lastly, a new column called Event will be created which indexes each unique recording sequence and corresponds to the combination of Subject and TRIAL_INDEX. This Event variable is required internally for subsequent operations. Should you choose to define the Event variable differently, you can override the default; however, do so cautiously as this may impact the performance of subsequent operations because it must index each time sequence in the data uniquely. Upon completion, the function prints a summary indicating the results.

dat0 <- prep_data(data = VWdat, Subject = "RECORDING_SESSION_LABEL", Item = "itemid")
## Checking required columns...
##     All required columns are present in the data.
## Checking optional columns...
##     The following optional is not present in the data:  EYE_TRACKED
## Working on required columns...
##     RECORDING_SESSION_LABEL renamed to Subject.
##     itemid renamed to Item.
##     Subject converted to factor.
##     LEFT_INTEREST_AREA_ID converted to numeric.
##     LEFT_INTEREST_AREA_LABEL converted to factor.
##     RIGHT_INTEREST_AREA_ID converted to numeric.
##     RIGHT_INTEREST_AREA_LABEL converted to factor.
##     TIMESTAMP converted to numeric.
##     TRIAL_INDEX converted to numeric.
##     Event variable created from Subject and TRIAL_INDEX
## Working on optional columns...
##     SAMPLE_MESSAGE converted to factor.
##     LEFT_GAZE_X converted to numeric.
##     LEFT_GAZE_Y converted to numeric.
##     RIGHT_GAZE_X converted to numeric.
##     RIGHT_GAZE_Y converted to numeric.
##     LEFT_IN_BLINK converted to numeric.
##     RIGHT_IN_BLINK converted to numeric.
##     LEFT_IN_SACCADE converted to numeric.
##     RIGHT_IN_SACCADE converted to numeric.

### Remove unnecessary columns

At this point, it is safe to remove the columns which were output by Data Viewer, but that are not needed for preprocessing in using this package. Removing these will reduce the amount of system memory consumed and result in a final dataset that consume less disk space. This is done straightforwardly using the function rm_extra_DVcols. By default it will remove all the Data Viewer columns that are not needed for preprocessing (if they are present in the data). However, if desired, it is possible to keep specific columns from this set using the Keep parameter, which accommodates a string or character vector. If using the sample data set included in this package, it is not necessary to do this step, as these columns have already been removed.

dat0 <- rm_extra_DVcols(dat0, Keep = c("RIGHT_PUPIL_SIZE", "LEFT_PUPIL_SIZE"))

### Relabel NA samples as outside any interest area

When the data were loaded, samples that were outside of any interest area were labeled as NA. The relabel_na function examines the interest area columns (LEFT_INTEREST_AREA_ID, RIGHT_INTEREST_AREA_ID, LEFT_INTEREST_AREA_LABEL, and RIGHT_INTEREST_AREA_LABEL) for cells containing NAs. It then assigns 0 to the ID columns and “Outside” to the LABEL columns) to indicate those eye gaze samples which fell outside of the interest areas defined in the study. The number of interest areas you defined in your experiment should be supplied to the parameter NoIA.

dat1 <- relabel_na(data = dat0, NoIA = 4)
## LEFT_INTEREST_AREA_LABEL: Number of levels DO NOT match NoIA.
## RIGHT_INTEREST_AREA_LABEL: Number of levels match NoIA.

Notice that the output informs us that the number of levels LEFT_INTEREST_AREA_LABEL does not match the number of interest areas listed in NoIA. This is because we only have data from the right eye (hence, all samples in LEFT_INTEREST_AREA_LABEL are listed as “Outside”).

### Check encoding of interest areas

The subsequent preprocessing requires that the interest area IDs are numerically coded, with values ranging from 0 (i.e., outside all interest areas) up to a maximum of 8. So, it’s important to check that the IDs present in the data set, conform to this. The check_ia functions does just this and indicates how those IDs are mapped to the interest area labels.

check_ia(data = dat1)
##  RIGHT_IA_ID RIGHT_IA_LABEL
##            0        Outside
##            1     Target_IA
##            2  RhymeComp_IA
##            3  OnsetComp_IA
##            4   Distract_IA
##  LEFT_IA_ID LEFT_IA_LABEL
##           0       Outside
## Interest Area IDs for the right eye are coded appropriately between 0 and 8.
## Interest Area IDs for the left eye are coded appropriately between 0 and 8.
## Interest Area ID and label mapping combinations for the right eye are consistent.
## Interest Area ID and label mapping combinations for the left eye are consistent.

If your interest area IDs do not conform to the required coding, or you would like to create new labels for your existing interest areas, please consult the Interest Areas vignette. That vignette illustrates how to relabel existing interest area codings (as well as remap the gaze data to entirely new interest areas, should you so desire).

## Creating the Time series column

The function create_time_series creates a time series (a new column called Time) which is required for subsequent processing, plotting, and modeling of the data. It is common to export a period of time prior to the onset of the stimulus as a baseline. In this case, an adjustment (equal to the duration of the baseline period) must be applied to the time series, specified in the Adjust parameter. In effect, the adjustment simply subtracts the given value to each time point. So, a positive value will shift the zero point forward (making the initial zero a negative time value), while a negative value will shift the zero point backward (making the initial zero a positive time value). An example illustrating this can be found in the Message Alignment vignette. In the example below, the data were exported with a 100ms pre-stimulus interval.

dat2 <- create_time_series(data = dat1, Adjust = 100)
## 100 ms adjustment applied.

Note that if you have used the align_msg function (illustrated in the Message Alignment vignette), you may need to specify a column name in Adjust. That column can be used to apply the recording event specific adjustment to each trial. Consult that vignette for further details.

The function check_time_series can be used to verify the time series. It outputs the unique start times present in the data. These will be the same standardized time point relative to the stimulus if you have exported your data from Data Viewer with pre-defined interest period relative to a message. By specifying the parameter ReturnData = T, the function can return a summary data frame that can be used to inspect the start time of each event. As you can see below, by providing Adjust with a postive value, we have effectively shifted the zero point forward along the number line, causing the first sample to have a negative time value.

check_time_series(data = dat2)
## # A tibble: 1 x 1
##   Start_Time
##        <dbl>
## 1       -100
## Set ReturnData to TRUE to output full, event-specific information.

Another way to check that your time series has been created correctly is to use the check_msg_time function. By providing the appropriate message text, we can see that the onset of our target now occurs at Time = 0. Note that the Msg parameter can handle exact matches or matches based on regular expressions. As with check_time_series, the parameter ReturnData = T will return a summary data frame that can be used to inspect the message time of each event.

check_msg_time(data = dat2, Msg = "TargetOnset")
## # A tibble: 1 x 2
##   SAMPLE_MESSAGE  Time
##   <fct>          <dbl>
## 1 TargetOnset        0
## Set ReturnData to TRUE to output full, event-specific information.

If you do not remember the messages in your data, you can output all existing messages and their corresponding timestamps using check_all_msgs. Additionally and optionally, the output of the function can be saved using the parameter ReturnData = T.

check_all_msgs(data = dat2)
## # A tibble: 4 x 1
##   SAMPLE_MESSAGE
##   <fct>
## 1 Preview
## 2 TargetOnset
## 3 VowelOnset
## 4 TIMER_search
## Set ReturnData to TRUE to output full, event-specific information.

## Selecting which eye to use

Depending on the design of the study, right, left, or both eyes may have been recorded during the experiment. Data Viewer outputs gaze data by placing it in separate columns for each eye (LEFT_INTEREST_AREA_ID, LEFT_INTEREST_AREA_LABEL, RIGHT_INTEREST_AREA_ID, RIGHT_INTEREST_AREA_LABEL). However, it is preferable to have gaze data in a single set of columns, regardless of which eye was recorded during the experiment. The function select_recorded_eye provides the functionality for this purpose, returning three new columns (IA_ID, IA_LABEL, IA_Data).

The function select_recorded_eye requires that the parameter Recording be specified. This parameter instructs the function about which eye(s) was used to record the gaze data. It takes one of four possible strings: “LandR”, “LorR”, “L”, or “R”. “LandR” should be used when any participant had both eyes recorded. “LorR” should be used when some participants had their left eye recorded and others had their right eye recorded “L” should be used when all participant had their left eye recorded. “R” should be used when all participant had their right eye recorded.

If in doubt, use the function check_eye_recording which will do a quick check to see if LEFT_INTEREST_AREA_ID and RIGHT_INTEREST_AREA_ID contain data. It will then suggest the appropriate Recording parameter setting. When in complete doubt, use “LandR”. The “LandR” setting requires an additional parameter (WhenLandR) to be specified. This instructs the function to select either the right eye or the left eye when data exist for both.

check_eye_recording(data = dat2)
## Checking gaze data using Data Viewer columns LEFT_INTEREST_AREA_ID and RIGHT_INTEREST_AREA_ID.
## The dataset contains recordings for ONLY the right eye.
##  Set the Recording parameter in select_recorded_eye() to 'R'.

After executing, the function prints a summary of the output. While the function check_eye_recording indicated that the parameter Recording should be set to “R”, the example below sets the parameter to “LandR”, which can act as a “catch-all”. Consequently, in the summary, it can be seen that there were only recordings in the right eye.

dat3 <- select_recorded_eye(data = dat2, Recording = "R", WhenLandR = "Right")
## Selecting gaze data using Data Viewer columns LEFT_INTEREST_AREA_ID and RIGHT_INTEREST_AREA_ID and the Recording argument: R
## Gaze data summary for 160 events:
## 0 event(s) contained gaze data for both eyes, for which the Right eye has been selected.
## The final data frame contains 158 event(s) using gaze data from the right eye.
## The final data frame contains 2 event(s) with no samples falling within any interest area during the given time series.

## Trackloss

Prior to binning the data, some researchers might prefer to remove trials with excessive trackloss. Because Data Viewer does not provide a specific column for trackloss, it is possible to determine this using a combination of information, namely, the column In_Blink and/or the X and Y coordinates (Gaze_X and Gaze_Y).
The function mark_trackloss uses this information to determine the status of a given sample. The argument Type can be set to “Blink”, “OffScreen”, or “Both”. When set to “OffScreen” or “Both”, ScreenSize must be supplied as a numeric vector of the X and Y dimensions of the computer sceen used during the experiment.

dat3 <- mark_trackloss(dat3, Type = "Both", ScreenSize = c(1920, 1080))

Once the samples corresponding to trackloss have been identified, events with less than the required amount of quality data can be removed from the data set, using the function rm_trackloss_events. The argument RequiredData represents the percentage of data (non-trackloss) required in order to retain the event. In the example below, each event must contain 75% quality data, in other words, no more than 25% trackloss.

dat3 <- rm_trackloss_events(dat3, RequiredData = 75)

## Binning the data

In order to obtain proportion looks, it is necessary to bin the data. That is, group samples in chunks of time, count the number of samples in each of the interest areas, and calculate the proportions based on the counts. The sampling rate at which the eye gaze data were recorded must be provided. For Eyelink trackers, this is typically 250Hz, 500Hz, or 1000Hz. If in doubt, use the function check_samplingrate to determine it. The sampling rate can then be supplied to the function bin_prop.

check_samplingrate(dat3)
## Sampling rate(s) present in the data are: 1000 Hz.
## Set ReturnData to TRUE to output full, event-specific information.

Note that the check_samplingrate function returns a printed message indicating the sampling rate(s) present in the data. Optionally, it can return a new column called SamplingRate by specifying the parameter ReturnData as TRUE. In the event that data were collected at different sampling rates, this column can be used to subset the dataset by the sampling rate before proceeding to the next processing step.

The function bin_prop calculates the proportion of looks (samples) to each interest area in a particular span of time (bin size). In order to do this, it is necessary to supply the parameters BinSize and SamplingRate. BinSize should be specified in milliseconds, representing the chunk of time within which to calculate the proportions.

Not all bin sizes work for all sampling rates, due to downsampling constraints. If unsure which are appropriate for your current sampling rate, use the ds_options function. When provided with the current sampling rate in SamplingRate (see above), the function will return a printed summary of the bin size options and their corresponding downsampled rate. By default, this returns the whole number downsampling rates users are likely to want; however, it can also return all possible (valid) downsampling rates, even if they are not round numbers.

ds_options(SamplingRate = 1000)
## Suggested binning/downsampling options:
## Bin size: 1 ms; Samples per bin: 1 samples; Downsampled rate: 1000 Hz
## Bin size: 2 ms; Samples per bin: 2 samples; Downsampled rate: 500 Hz
## Bin size: 4 ms; Samples per bin: 4 samples; Downsampled rate: 250 Hz
## Bin size: 5 ms; Samples per bin: 5 samples; Downsampled rate: 200 Hz
## Bin size: 8 ms; Samples per bin: 8 samples; Downsampled rate: 125 Hz
## Bin size: 10 ms; Samples per bin: 10 samples; Downsampled rate: 100 Hz
## Bin size: 20 ms; Samples per bin: 20 samples; Downsampled rate: 50 Hz
## Bin size: 25 ms; Samples per bin: 25 samples; Downsampled rate: 40 Hz
## Bin size: 40 ms; Samples per bin: 40 samples; Downsampled rate: 25 Hz
## Bin size: 50 ms; Samples per bin: 50 samples; Downsampled rate: 20 Hz
## Bin size: 100 ms; Samples per bin: 100 samples; Downsampled rate: 10 Hz

The SamplingRate parameter in bin_prop should be specified in Hertz (see check_samplingrate), representing the original sampling rate of the data and the BinSize should be specified in milliseconds (see ds_options), representing the span of time over which to calculate the proportion. The bin_prop function returns new columns corresponding to each interest area ID (e.g., IA_1_C, IA_1_P). The extension ‘_C’ indicates the count of samples in the bin and the extension ‘_P’ indicates the proportion.

dat4 <- bin_prop(dat3, NoIA = 4, BinSize = 20, SamplingRate = 1000)
## Binning information:
##  Original rate of 1000 Hz with one sample every 1 ms.
##  Downsampled rate of 50 Hz using 20 ms bins.
##  New bins contain 20 samples.
## Binning...
## Calculating proportions...
## There are 103 data points with less than 20 samples per bin.
## These can be examined and/or removed using the column 'NSamples'.
## Subsequent Empirical Logit calculations may be influenced by the number of samples (depending on the number of observations requested).
## These all occur in the last bin of the time series (typical of Data Viewer output).

In performing the calculation, the function effectively downsamples the data. To check this and to know the new sampling rate, simply call the function check_samplingrate again.

check_samplingrate(dat4)
## Sampling rate(s) present in the data are: 50 Hz.
## Set ReturnData to TRUE to output full, event-specific information.

## Empirical logits

Proportions are inherently bound between 0 and 1 and are therefore not suitable for many types of analysis. Logits provide a transformation resulting in an unbounded measure (as well as weights which estimate the variance). The calculations contained in this package are based on: Barr, D. J., (2008) Analyzing ‘visual world’ eyetracking data using multilevel logistic regression, Journal of Memory and Language, 59(4), 457–474. However, they have been modified to allow greater flexibility.

When using an empirical logit transformation it is important to keep two things in mind. The first is the number of observations (or samples) on which to base the calculation. Typically, this is the number of samples per bin, which varies depending on your original sampling rate and bin size.

To determine the number of samples per bin present in the data, use the function check_samples_per_bin.

check_samples_per_bin(dat4)
## There are 20 samples per bin.
## One data point every 20 millisecond(s)
##
## There are data points with less than 20 samples per bin.
## Subsequent Empirical Logit calculations may be influenced by the number of samples (depending on the number of observations requested).
## These all occur in the last bin of the time series (typical of Data Viewer output).

However, a user may choose to define a different number of observations (because the number of samples is inherently linked to the sampling rate). Though, it is important to note that changing this value can drastically impact the results of the transformation and weight calculations. There are some safeguards within the transformation function to prevent users from choosing inadvisable values (though these safeguards can be overridden with the parameter ObsOverride). So, if in doubt, it is safest to use the number of samples present in your data (as indicated by check_samples_per_bin). The second things to keep in mind is the constant to be added in the transformation. Note that by default the calculation uses a constant of 0.5; however, the user can specify a different value to be used.

If you are interested in visualizing the effect of both number of observations and constant on the result of the empirical logit transformation and weight calculations, please refer to the Plotting vignette, which illustrates and discusses the function plot_transformation_app.

The function transform_to_elogit transforms the proportions to empirical logits and also calculates a weight for each value. The weight estimates the variance in each bin (because the variance of the logit depends on the mean). This is particularly important for regression analyses and should be specified in the model call (e.g., weight = 1 / IA_1_wts). As mentioned above, the function takes the number of observations in the parameter ObsPerBin. Here we use the number of samples per bin present in the data.

dat5 <- transform_to_elogit(dat4, NoIA = 4, ObsPerBin = 20)
## Number of Observations equal to Number of Samples.
##  Calculation will be based on Number of Samples.

## Binomial data

Some researchers may prefer to perform a binomial analysis. Therefore the function create_binomial uses (previously calculated) proportions and number of observations to create a success/failure column for each IA. This column is then a suitable response variable for logistic regression of the time series. As with the empirical logit transformation, a user may choose to define a number of observations that is different from the number of samples per bin. Because this can create artifacts in the scaling or more samples than are present in the data, safeguards are in place to prevent users from choosing inadvisable values (though these safeguards can be overridden with the parameter ObsOverride).

dat5a <- create_binomial(data = dat4, NoIA = 4, ObsPerBin = 20)
## Number of Observations equal to Number of Samples.
##  Counts will remain as present in the data.

By default the function will create a success/failure column for each IA in the data; however, it is also possible to create a custom column comparing looks between two specific interest areas. This is done by specifying the parameter CustomBinom with a vector of two integers (e.g., CustomBinom = c(1,2)) in which the two integers correspond to the IDs of the desired interest areas.

## Fastrack function

For advanced users who have worked with the package functions before and who are familiar with the required steps and output, there is a meta-function, called fasttrack, which runs through the previous functions and outputs a dataframe with either empirical logits or binomial data. Note that using this function will still require the user to manually remove unneeded columns (see above). This meta-function takes as parameters all the required arguments to the component functions. Also, this function assumes that dynamic interest areas were used and do not need to be relabelled/reassigned. It also assumes an interest period was defined in Data Viewer relative to the critical stimulus, thus not requiring separate message alignment. Again, this is only recommended for users who have previously worked with visual world data, the functions contained in this package, and are confident that their data meet the requirements/assumptions of the fasttrack function.

dat5b <- fasttrack(data = VWdat, Subject = "RECORDING_SESSION_LABEL", Item = "itemid",
EventColumns = c("Subject", "TRIAL_INDEX"), NoIA = 4, Adjust = 100, Recording = "LandR",
WhenLandR = "Right", BinSize = 20, SamplingRate = 1000,
ObsPerBin = 20, Constant = 0.5, Output = "ELogit")

## Renaming interest area columns

Some may wish to rename the interest area columns created by the functions to something more meaningful than the numeric coding scheme. To do so, use the function rename_columns. This will convert column names like IA_1_C and IA_2_P to IA_Target_C and IA_Rhyme_P, respectively. This will perform the operation on all the IA_ columns for upto 8 interest areas.

dat6 <- rename_columns(dat5, Labels = c(IA1="Target", IA2="Rhyme",
IA3="OnsetComp", IA4="Distractor")) 
## Renaming 4 interest areas.

You can now check the column names in the data.

colnames(dat6) 

[1] “Subject” “LEFT_GAZE_X”
[7] “LEFT_INTEREST_AREA_LABEL” “RIGHT_GAZE_X”
[13] “RIGHT_INTEREST_AREA_LABEL” “SAMPLE_MESSAGE”
[15] “TIMESTAMP” “TRIAL_INDEX”
[17] “talker” “Rating”
[19] “Exp” “Item”
[21] “Event” “Time”
[23] “EyeRecorded” “EyeSelected”
[25] “IA_ID” “IA_LABEL”
[27] “Gaze_X” “Gaze_Y”
[31] “IA_Data” “DS”
[33] “NSamples” “IA_outside_C”
[35] “IA_Target_C” “IA_Rhyme_C”
[37] “IA_OnsetComp_C” “IA_Distractor_C”
[39] “IA_outside_P” “IA_Target_P”
[41] “IA_Rhyme_P” “IA_OnsetComp_P”
[43] “IA_Distractor_P” “Obs”
[45] “IA_outside_ELogit” “IA_outside_wts”
[47] “IA_Target_ELogit” “IA_Target_wts”
[49] “IA_Rhyme_ELogit” “IA_Rhyme_wts”
[51] “IA_OnsetComp_ELogit” “IA_OnsetComp_wts”
[53] “IA_Distractor_ELogit” “IA_Distractor_wts”

## Saving the data

### Subsetting and ordering

Before embarking on a statistical analysis, it is probably necessary to take a couple steps, such as paring down the data to only include the columns which will be needed later and ensuring the data are ordered appropriately. This is straightforward using dplyr.

FinalDat <- dat5 %>%
# Select just the columns you want
select(Subject, Item, Time, starts_with("IA"), Event, TRIAL_INDEX, Rating, Exp) %>%
# Order the data by Subject, Trial, and Time
arrange(Subject, TRIAL_INDEX, Time)

### Saving to a file

Save the resulting dataset to a .rda file and use compression to make it more compact (though this will add to the amount of time it takes to save).

save(FinalDat, file = "FinalDat.rda", compress = "xz")

## Plotting

You are now ready to plot your data. Please refer to the Plotting vignette for details on the various plotting functions contained in the package.