library(webtrackR)
webtrackR is an R package to preprocess and analyze web tracking data, i.e., web browsing histories of participants in an academic study. Web tracking data is oftentimes collected and analyzed in conjunction with survey data of the same participants.
webtrackR
is part of a series of R packages to analyse
webtracking data:
You can install the development version of webtrackR from GitHub with:
# install.packages("devtools")
::install_github("schochastics/webtrackR") devtools
The CRAN version can be installed with:
install.packages("webtrackR")
wt_dt
The package defines an S3 class called wt_dt
which
inherits most of the functionality from the data.frame
class. A summary
and print
method are included
in the package.
Each row in a web tracking data set represents a visit. Raw data need to have at least the following variables:
panelist_id
: the individual from which the data was
collectedurl
: the URL of the visittimestamp
: the time of the URL visitThe function as.wt_dt
assigns the class
wt_dt
to a raw web tracking data set. It also allows you to
specify the name of the raw variables corresponding to
panelist_id
, url
and timestamp
.
Additionally, it turns the timestamp variable into POSIXct
format.
All preprocessing functions check if these three variables are present. Otherwise an error is thrown.
Several other variables can be derived from the raw data with the following functions:
add_duration()
adds a variable called
duration
based on the sequence of timestamps. The basic
logic is that the duration of a visit is set to the time difference to
the subsequent visit, unless this difference exceeds a certain value
(defined by argument cutoff
), in which case the duration
will be replaced by NA
or some user-defined value (defined
by replace_by
).add_session()
adds a variable called
session
, which groups subsequent visits into a session
until the difference to the next visit exceeds a certain value (defined
by cutoff
).extract_host()
, extract_domain()
,
extract_path()
extracts the host, domain and path of the
raw URL and adds variables named accordingly. See function descriptions
for definitions of these terms. drop_query()
lets you drop
the query and fragment components of the raw URL.add_next_visit()
and add_previous_visit()
adds the previous or the next URL, domain, or host (defined by
level
) as a new variable.add_referral()
adds a new variable indicating whether a
visit was referred by a social media platform. Follows the logic of
Schmidt et al., (2023).add_title()
downloads the title of a website (the text
within the <title>
tag of a web site’s
<head>
) and adds it as a new variable.add_panelist_data()
. Joins a data set containing
information about participants such as a survey.classify_visits()
categorizes website visits by either
extracting the URL’s domain or host and matching them to a list of
domains or hosts, or by matching a list of regular expressions against
the visit URL.deduplicate()
flags or drops (as defined by argument
method
) consecutive visits to the same URL within a
user-defined time frame (as set by argument within
).
Alternatively to dropping or flagging visits, the function aggregates
the durations of such duplicate visits.sum_visits()
and sum_durations()
aggregate
the number or the durations of visits, by participant and by a time
period (as set by argument timeframe
). Optionally, the
function aggregates the number / duration of visits to a certain class
of visits.sum_activity()
counts the number of active time periods
(defined by timeframe
) by participant.A typical workflow including preprocessing, classifying and aggregating web tracking data looks like this (using the in-built example data):
library(webtrackR)
# load example data and turn it into wt_dt
data("testdt_tracking")
<- as.wt_dt(testdt_tracking)
wt
# add duration
<- add_duration(wt)
wt
# extract domains
<- extract_domain(wt)
wt
# drop duplicates (consecutive visits to the same URL within one second)
<- deduplicate(wt, within = 1, method = "drop")
wt
# load example domain classification and classify domains
data("domain_list")
<- classify_visits(wt, classes = domain_list, match_by = "domain")
wt
# load example survey data and join with web tracking data
data("testdt_survey_w")
<- add_panelist_data(wt, testdt_survey_w)
wt
# aggregate number of visits by day and panelist, and by domain class
<- sum_visits(wt, timeframe = "date", visit_class = "type") wt_summ