A simple benchmark implementation

Let's create a simple benchmark implementation and simulate text generation as a Markov chain. The idea is the following, we will estimate the probability of character c appearing after history h has been observed, where h has a fixed length. This length is called memory. For example, if we have a tiny corpus consisting of:

"My name is Pablo"

And we fix a memory length of 4, we get a training set that looks like this:

h	c
`My n`	`a`
`y na`	`m`
`nam`	`e`

Our task is to estimate the conditional probability distribution:

This conditional probability is obtained simply by estimating the number of times c appears after h, divided by the number of times history h appears.

The goal of this chapter is to build a benchmark model. We should clarify what we mean, as there is no golden standard for assessing performance in generative models. The way we will evaluate the model is by looking at the quality of the text generated, for which we need information about the context of the problem, namely, the corpus we want to learn to generate text from.

For this example, we will use the text of Alice in Wonderland, the book by Lewis Carroll, which is available online thanks to Project Gutenberg. You can find it on their website: https://www.gutenberg.org/.

We start by loading some libraries:

library(readr)
library(stringr)
library(purrr)
library(tokenizers)
library(dplyr)

Now, we load the data (available on the book's website), and set up a memory length of 5:

orig <- read_lines("./data/alice.txt")
maxlen <- 2

Next, we should clean our text from newlines and convert everything to lowercase for simplicity:

text <- orig %>%
  str_to_lower() %>%
  str_c(collapse = "
") %>%
  tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)

We set the variable chars as our set of tokens:

chars <- text %>% unique %>% sort

We now initialize an empty data frame and define a function to convert the tokenized vector input into a string:

records <- data.frame()
vec2str <- function(history){
 history <- toString(history)
 history <- str_replace_all(history,",","")
 history <- str_replace_all(history," ","")
 history <- str_replace_all(history,"
"," ")
 history
}

Now, we need to loop through the history and store in the records data frame:

idxs <- seq(1, length(text) - max_length - 1, by=3)
for(i in idxs){
 history <- text[i:(i+max_length-1)]
 next_char <- text[i+max_length]
 history <- vec2str(history)
 records <- rbind(data.frame(history=history, next_char=next_char), records)
 tot_rows <- length(idxs)
}

Finally, we calculate the conditional probabilities:

Introduced previously. This can be done very easily using the dplyr package, as follows:

library(dplyr)
tot_histories <- records %>% 
                  group_by(history) %>% 
                  summarize(total_h=n())
tot_histories_char <- records %>% 
                       group_by(history, next_char) %>% 
                       summarize(total_h_c=n())
probas <- left_join(tot_histories, tot_histories_char)
probas$prob <- probas$total_h_c/probas$total_h

Now, we are ready to start generating text! We define a text generating function that will sample the next character conditional on the history:

generate_next <- function(h){
  sub_df <- probas%>%filter(history==h)
  if(nrow(sub_df)>0){
    prob_vector <- sub_df %>% select(prob)%>%as.matrix %>%c()
    char_vector <- sub_df %>% select(next_char)%>%as.matrix %>%c()
    char_vector <- as.vector(char_vector)
    sample(char_vector,size=1,prob=prob_vector)
  }
}

The preceding function helps us sample characters from the distribution we estimated.

With the following code, we can generate words of different lengths, to make it more interesting:

n_iter <- 100
for(iter in 1:n_iter){
  
  # Generate random initialization
  generated <- " "
  start_index <- sample(1:(length(text) - maxlen), size = 1)
  h <- text[start_index:(start_index + maxlen - 1)]
  h <- vec2str(h)
  
  random_len <- sample(5:10,1)
  
  for(i in 1:random_len){
    c <- generate_next(h)
    h <- paste0(h,c) 
    generated <- str_c(generated,c)
    h <- substr(h,i,i+maxlen)
    
  }
  cat(generated)
  cat("
")
}

Table of Contents for A simple benchmark implementation

Create new playlist

Sign In

Sign Up

Table of Contents for
A simple benchmark implementation