This is a package where I collected some of the function I have used when dealing with data.
Currently, there is only one function:
html_decode which will replace the HTML entities like
& into their original form
This function is a thin-wrapper of C++ code provided by Christoph on Stack Overflow.
An example would be
It works very well!
To the best of my knowledge, there are already several solutions to this problem, and why do I need to wrap up a new function to do this? Because of performance.
First of all, there is an existing package
textutils that contains lots of functions dealing with data. The one of our interest is
Second, there is a function by SO user Stibu here that uses
xml2 package. And the function is:
Third, I took the code from Christoph (here) and wrote a R wrapper for the C function. This function is
Now, let’s test the performance and I use
bench package here.
bench::mark( html_decode(strings), unescape_html2(strings), textutils::HTMLdecode(strings) ) #> # A tibble: 3 x 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 html_decode(strings) 6.28µs 7.03µs 137740. 2.49KB 13.8 #> 2 unescape_html2(strings) 101.68µs 106.37µs 9256. 445.59KB 18.9 #> 3 textutils::HTMLdecode(strings) 4.33ms 4.47ms 223. 379.19KB 35.1
Clearly, the speed of
html_decode function is unparalleled.
When testing the results, I discovered a bug in
textutils::HTMLdecode and reported it here. The maintainer fixed it immediately. As of this writing (Feb. 16, 2021), the development version of
textutils has this bug fixed, but the CRAN version may not. This means that if you test the performance yourself with a previous version of
textutils, you may run into error and installing the development version will solve for it.