Fastest way to parse regex in R

1

I need to parse around 1.6k REGEX expressions such as the pair I am writing below.

I have also around 7k documents (1/2 page long each in average) that need to be parsed according to the REGEX expressions.

Right now I am using

library(rebus)
library(stringr)

regex_exp <- rebus::or1("(?i-mx:\\b(?:actroid\\b))", "(?i-mx:\\b(?:robot\\*w\\b)))")

regex_exp <- BOUNDARY %R% regex_exp %R% BOUNDARY

stringr::str_extract_all("This is my text talking about technology, but also about the actroid", regex_exp)

to found matches, but it takes approx. 3.5 minutes per file, which is of course not scalable.

Is there a more efficient library/method to parse regex expression in R? I am also naive about whether using reticulate to parse in Python and go back to R could be faster.

Luisda

Posted 2020-06-11T18:33:21.077

Reputation: 31

This is an example of a task that can be executed in an "embarassingly parallel" manner. Scanning each document is an independent task. If your device has multiple cores, there are R packages that allow you to allocate more than one core to the task. There are a couple of ways to do this, and I'm not an expert. Some good information and simple examples can be found here: https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html#embarrassing-parallelism.

– Ben Norris – 2020-06-14T03:53:40.450

No answers