# Random data in R

Quite often it useful to have some random data that has a certain structure to test and develop algorithms when no real data is readily available.

```library(dplyr)
# set the random number seed to make the data set repeatable
set.seed(0)

# two parameters for the size of the internal and external nodes
externalN <- 5
internalN <- 2
maxNodeName <- 10
minNodeName <- 4

# We want to create two lists with node names
# and an Id, the name names are random
# strings between 4 and 10 characters long.
# http://stackoverflow.com/questions/29344795/generating-a-random-sequence-of-multiple-characters-in-r
# stringi::stri_rand_strings(10, as.integer(runif(10,4,10)))
#
externalNodes <- data.frame(idExternal=1:externalN, nodeExternal=stringi::stri_rand_strings(externalN, as.integer(runif(externalN,minNodeName,maxNodeName))),
stringsAsFactors = F)

internalNodes <- data.frame(idInternal=1:internalN, nodeInternal=stringi::stri_rand_strings(internalN, as.integer(runif(internalN,minNodeName,maxNodeName))),
stringsAsFactors = F)

# now we do a full merge of the dataframes to get all possible combinations
# http://stackoverflow.com/questions/10600060/how-to-do-cross-join-in-r

combinations <- (merge(internalNames,externalNames,all = T))
```

This script creates two lists with names that are between 4 and 10 characters long with a numerical id. The `merge()` generates a list with possible combinations (aka cross join) of the names in both lists.

Now it’s possible to compare the names in the list one by one. We create a simple function for this:

```# http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
psim <- function(string1, string2)
{
fsimp <- NA
fsimp <- as.numeric(adist(string1, string2, partial = T))
fsim <- NA
simfun <- data.frame(fsimp = fsimp, fsim = fsim)
return(simfun)
}
```

and then apply it using dyplr.

```combinations <- combinations %>% group_by(id,idI) %>%
mutate(psim = as.numeric(adist(nameI, name, partial = T))) %>%
```

On a Linux machine this could also be done in parallel (assuming there is more than one CPU) using the parallel library.

```library(parallel)
system.time(
combinations <- bind_cols(combinations, as.data.frame(t(mcmapply(psim,combinations\$nameI,
combinations\$name, mc.cores = 32))))
)
```

This results in two new columns in the combination data frame listing the two different distance measures calculated in `psim()`.