Generating new text from old

Let's compare our simple, Markov chain model benchmark against LSTM networks. You can use the implementation we used earlier in this chapter. We will show how to use the Keras API for this task, as we did in the previous chapter.

We will illustrate this with an example close to one of the authors' hearts, generating names in Spanish.

First, we should load the required libraries:

library(keras)
library(readr)
library(stringr)
library(purrr)
library(tokenizers)

Then, define a sampling function based on the probabilities we will estimate:

sample_mod <- function(preds, temperature = 0.8){
 preds <- log(preds)/temperature
 exp_preds <- exp(preds)
 preds <- exp_preds/sum(exp(preds))
 rmultinom(1, 1, preds) %>% 
 as.integer() %>%
 which.max()
}

Strictly speaking, we do not estimate probabilities, rather some score between 0 and 1 which we can understand as such. The point is that this score will help us generate text, and the score has the same order as the real probabilities. A higher score equates to higher probability.

Now, we read the file and do some parsing. This is very similar to before:

orig <- read_lines("./data/Spanish.txt") 
text <- orig %>%
     str_to_lower() %>%
     str_c(collapse = "
") %>%
     tokenize_characters(strip_non_alphanum = FALSE, simplify = TRUE)

We define the vocabulary, that is, the collection of tokens in our text:

chars <- text %>% 
     str_c(collapse="
")%>%
     tokenize_characters(simplify=TRUE) %>% 
     unique %>% sort
chars
 [1] "a" "á" "à" "b" "c" "d" "e" "é" "f" "g" "h" "i" "í" "j" "l" "m" "n" "ñ" "o" "ó" "p" "q" "r" "s" "t" "u"
[27] "ú" "v" "x" "y" "z"

We keep the accentuated vowels, as they are part of the language. Next, we set up the window size and cut the text into overlapping sequences of that size:

max_length <- 5
dataset <- map(
 seq(1, length(text) - max_length - 1, by = 3), 
 ~list(name = text[.x:(.x + max_length - 1)], next_char = text[.x + max_length])
)
dataset <- transpose(dataset)

We use one-hot vectorization to encode our data into numerical inputs for the neural network that we will build using the keras library:

# One-hot vectorization
X <- array(0, dim = c(length(dataset$name), max_length, length(chars)))
y <- array(0, dim = c(length(dataset$name), length(chars)))

Finally, we set up the training set:

for(i in 1:length(dataset$name)){
     X[i,,] <- sapply(chars, function(x){
     as.integer(x == dataset$name[[i]])
     })
 y[i,] <- as.integer(chars == dataset$next_char[[i]])
}

Then, we define the architecture of our network:

model <- keras_model_sequential()
model %>%
 layer_lstm(128, input_shape = c(max_length, length(chars))) %>%
 layer_dense(length(chars)) %>%
 layer_dropout(0.1)%>%
 layer_activation("softmax")

We require a softmax activation in the end, as we would like to interpret the final scores as probabilities; hence, it will be useful to have them between 0 and 1.

We now call the optimizer and compile our model:

optimizer <- optimizer_rmsprop(lr = 0.01)
model %>% compile(
 loss = "categorical_crossentropy", 
 optimizer = optimizer
)

We have to specify a sampling function, which is a version of softmax:

sample_mod <- function(preds, temperature = 0.8){
 preds <- log(preds)/temperature
 exp_preds <- exp(preds)
 preds <- exp_preds/sum(exp(preds))
 rmultinom(1, 1, preds) %>% 
 as.integer() %>%
 which.max()
}

Finally, we train our model:

history <- model %>% fit(
 X, y,
 batch_size = 128,
 epochs = 20
)
plot(history)

If everything went well, the history plot will show us a learning curve that looks like the following. This suggests that our model is training correctly:

Training of our model in Keras

Perhaps we could increase the number of epochs? That's something you can try as an exercise!

It is now time to generate some of the samples:

start_idx <- sample(1:(length(text) - max_length), size = 1)
name <- text[start_idx:(start_idx + max_length - 1)]
generated <- ""
for(i in 1:10){
 x <- sapply(chars, function(x){
 as.integer(x == name)
 })
 dim(x) <- c(1, dim(x))
 preds <- predict(model, x)
 next_idx <- sample_mod(preds, 0.3)
 next_char <- chars[next_idx]
 
 generated <- str_c(generated, next_char, collapse = "")
 name <- c(name[-1], next_char)
 cat(generated)
 cat("

")
}

With a simple modification of the preceding code you can generate names of different lengths. For this, we need to replace our preceding fixed length with a randomly sampled length, noted random_len in the following code snippet:

n_iter <- 100
for(iter in 1:n_iter){
  start_idx <- sample(1:(length(text) - max_length), size = 1)
  name <- text[start_idx:(start_idx + max_length - 1)]
  generated <- " "
  
  random_len <- sample(5:10,1)
  
  for(i in 1:random_len){
    
    x <- sapply(chars, function(x){
      as.integer(x == name)
    })
    dim(x) <- c(1, dim(x))
    
    preds <- predict(model, x)
    next_idx <- sample_mod(preds, 0.1)
    next_char <- chars[next_idx]
    
    generated <- str_c(generated, next_char)
    name <- c(name[-1], next_char)
    # cat(generated)
    # cat("

")
  }
  cat(generated)
}

Among the generated samples (which might be different in your computer) are:

Asarara, Laralaso

Which sound rather convincingly like Spanish last names.

Of course, you can adapt the preceding model to any other text, perhaps including text like HTML or LaTeX.

A subtlety here is that we used character-level prediction. You can try to adapt the text to word-level prediction, to make it work like the auto-complete function of a smartphone.

Table of Contents for Generating new text from old

Create new playlist

Sign In

Sign Up

Table of Contents for
Generating new text from old