Random data in R

Quite often it useful to have some random data that has a certain structure to test and develop algorithms when no real data is readily available.

# set the random number seed to make the data set repeatable

# two parameters for the size of the internal and external nodes
externalN <- 5
internalN <- 2
maxNodeName <- 10
minNodeName <- 4

# We want to create two lists with node names 
# and an Id, the name names are random
# strings between 4 and 10 characters long. 
# http://stackoverflow.com/questions/29344795/generating-a-random-sequence-of-multiple-characters-in-r
# stringi::stri_rand_strings(10, as.integer(runif(10,4,10)))
externalNodes <- data.frame(idExternal=1:externalN, nodeExternal=stringi::stri_rand_strings(externalN, as.integer(runif(externalN,minNodeName,maxNodeName))), 
stringsAsFactors = F)

internalNodes <- data.frame(idInternal=1:internalN, nodeInternal=stringi::stri_rand_strings(internalN, as.integer(runif(internalN,minNodeName,maxNodeName))),
stringsAsFactors = F)

# now we do a full merge of the dataframes to get all possible combinations
# http://stackoverflow.com/questions/10600060/how-to-do-cross-join-in-r

combinations <- (merge(internalNames,externalNames,all = T))

This script creates two lists with names that are between 4 and 10 characters long with a numerical id. The merge() generates a list with possible combinations (aka cross join) of the names in both lists.

Now it’s possible to compare the names in the list one by one. We create a simple function for this:

# http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
psim <- function(string1, string2)
  fsimp <- NA
  fsimp <- as.numeric(adist(string1, string2, partial = T))
  fsim <- NA
  fsim <- as.numeric(adist(string1, string2))
  simfun <- data.frame(fsimp = fsimp, fsim = fsim)

and then apply it using dyplr.

combinations <- combinations %>% group_by(id,idI) %>%  
mutate(psim = as.numeric(adist(nameI, name, partial = T))) %>%  
mutate(sim = as.numeric(adist(nameI, name)))

On a Linux machine this could also be done in parallel (assuming there is more than one CPU) using the parallel library.

  combinations <- bind_cols(combinations, as.data.frame(t(mcmapply(psim,combinations$nameI,
combinations$name, mc.cores = 32))))

This results in two new columns in the combination data frame listing the two different distance measures calculated in psim().