RNN without derivatives — the cross-entropy method

We will replace the backward propagation part on the preceding neural network with a Monte Carlo algorithm, called the cross-entropy method. This is a general-purpose algorithm introduced by Reuven Rubinstein which is quite helpful in many cases, especially for rare event simulation. It has been proven efficient for many reinforcement learning tasks, so why not give it a try?

The method consists of two parts:

  1. Generate a random data sample (trajectories, vectors) according to a specified mechanism.
  2. Update the parameters of the random mechanism based on the data to produce a better sample in the next iteration. This step involves minimizing the cross-entropy or Kullback–Leibler divergence.

Let's first illustrate the situation with a small sample code. Suppose we want to minimize the function:

# the function we need to maximize
f <- function(theta){
reward = -sum((solution - theta)**2)
return(reward)
}

The maximum of this function, as a function of theta, is reached when theta==solution.

Let's fix a value for solution as the vector:

solution <- c(0.5, 0.1, -0.3)

And set up some initial parameters:

dim_theta <- 3
theta_mean <- matrix(0,dim_theta,1)
theta_std <- diag(dim_theta)

The cross-entropy method we will use works as follows: 

  • Generate some sample solutions using an initial estimate for mean and standard deviation of the parameters.
  • Calculate as many as batch_size of those (where batch_size is a hyper-parameter).
  • Consider only a fraction of them, called the elite fraction (elite_frac). This is another hyper-parameter.
  • Get the top elite_frac of those, that means, those generated sample solutions that gave the highest reward in terms of the function we want to maximize.
  • Find the mean and covariance of those parameters to generate new candidate solutions.

The following code example shows how to implement the logic described previously:

cem <- function(f, n_iter, theta_mean, theta_std, batch_size=25, elite_frac=0.2){
for(it in 1:n_iter){
# Get a sample using the previous parameters
thetas <- matrix(mvrnorm(n=batch_size*dim_theta, mu= theta_mean, Sigma=theta_std), ncol = dim_theta)
rewards <- apply(thetas,1,f)
# Now choose the best
n_elite <- as.integer(batch_size * elite_frac)
elite_inds <- sort(rewards, decreasing = T, index.return=T)$ix[1:n_elite]
elite_thetas <- thetas[elite_inds,]
# Update theta_mean, theta_std
theta_mean <- apply(elite_thetas, 2,mean)
theta_std <- 0.01*diag(dim_theta)+0.99*cov(elite_thetas)
}
return(theta_mean)
}

We call our function:

cem(f,300, theta_mean, theta_std)

And we get a very reasonable approximation of those values, in only a few iterations.

For the text generation problem, we need to gather together our matrices, self$U,self$V,self$W, into a big vector, theta, run the forward pass of the network, and calculate the negative of the log loss. So, we have a function that maps our vector into a scalar value, which is the same situation as shown prior. The train function is the only one that gets modified, and will look as follows:

train = function(text){
n <- 1
p <- 1
smooth_loss = -log(1.0/self$vocab_size)*self$seq_length # loss at iteration 0
for(n in 1:self$n_iter){
if(p + self$seq_length + 1 >= length(text) || n == 1){
# reset RNN memory
## s_prev is the hidden state of RNN
s_prev <- matrix(0,self$hidden_size, 1)
# go from start of data
p <- 1
}
inputs <- unlist(lapply(text[p:(p+self$seq_length)],function(c){which(self$chars==c)}))
targets <- unlist(lapply(text[(p+1):(p+self$seq_length+1)],function(c){which(self$chars==c)}))
if(n %% 100 == 0){
txt <- self$sample_char(s_prev, inputs[[1]], 200)
## Find the in the string
line_breaks <- which(txt==" ")
if(length(line_breaks)<2){
print(txt)
}
else{
for(ix in 2:(length(line_breaks-1))){
first_ix <- line_breaks[ix-1]+1
last_ix <- line_breaks[ix]-1
print(paste(txt[first_ix:last_ix], collapse=""))
}
}
print('---- sample -----')
cat("Iteration number: ",n, " ")
cat("Loss: ", smooth_loss)
}
## UPDATES
theta_m <- c(as.vector(self$U), as.vector(self$V), as.vector(self$W))
new_m <- cem(-self$forward, theta_m, diag(length(theta_m))*0.01)

self$U <- as.matrix(theta_m[1:hidden_size*vocab_size],nrow=self$hidden_size)
self$W <- as.matrix(theta_m[(hidden_size*vocab_size+1):(hidden_size*(vocab_size+hidden_size)+1)],
nrow=self$hidden_size)
self$V <- as.matrix(theta_m[(hidden_size*(vocab_size+hidden_size)+1):length(theta_m)], nrow=self$vocab_size) # hidden to output

loss <- self$forward(inputs,targets,s_prev)
p <- p+self$seq_length
n <- n+1
}
return(1)
}

You can try this on your own! The results are comparable, although note that it takes much longer to run! This is, unfortunately, the situation with many of the algorithms relying on stochastic optimization, whether Monte Carlo or evolutionary algorithms. However, often the waiting time is worth it, because they are able to avoid local extrema and have at least a high probability of reaching the optimum.

With the preceding code, you have a good template to start trying different evolutionary algorithms. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset