Quite often it useful to have some random data that has a certain structure to test and develop algorithms when no real data is readily available.
library(dplyr) # set the random number seed to make the data set repeatable set.seed(0) # two parameters for the size of the internal and external nodes externalN <- 5 internalN <- 2 maxNodeName <- 10 minNodeName <- 4 # We want to create two lists with node names # and an Id, the name names are random # strings between 4 and 10 characters long. # http://stackoverflow.com/questions/29344795/generating-a-random-sequence-of-multiple-characters-in-r # stringi::stri_rand_strings(10, as.integer(runif(10,4,10))) # externalNodes <- data.frame(idExternal=1:externalN, nodeExternal=stringi::stri_rand_strings(externalN, as.integer(runif(externalN,minNodeName,maxNodeName))), stringsAsFactors = F) internalNodes <- data.frame(idInternal=1:internalN, nodeInternal=stringi::stri_rand_strings(internalN, as.integer(runif(internalN,minNodeName,maxNodeName))), stringsAsFactors = F) # now we do a full merge of the dataframes to get all possible combinations # http://stackoverflow.com/questions/10600060/how-to-do-cross-join-in-r combinations <- (merge(internalNames,externalNames,all = T))
This script creates two lists with names that are between 4 and 10 characters long with a numerical id. The merge()
generates a list with possible combinations (aka cross join) of the names in both lists.
Now it’s possible to compare the names in the list one by one. We create a simple function for this:
# http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/ psim <- function(string1, string2) { fsimp <- NA fsimp <- as.numeric(adist(string1, string2, partial = T)) fsim <- NA fsim <- as.numeric(adist(string1, string2)) simfun <- data.frame(fsimp = fsimp, fsim = fsim) return(simfun) }
and then apply it using dyplr.
combinations <- combinations %>% group_by(id,idI) %>% mutate(psim = as.numeric(adist(nameI, name, partial = T))) %>% mutate(sim = as.numeric(adist(nameI, name)))
On a Linux machine this could also be done in parallel (assuming there is more than one CPU) using the parallel library.
library(parallel) system.time( combinations <- bind_cols(combinations, as.data.frame(t(mcmapply(psim,combinations$nameI, combinations$name, mc.cores = 32)))) )
This results in two new columns in the combination data frame listing the two different distance measures calculated in psim()
.