Hyper-parameters tuning

Following the design of our deep neural network according to the previous sections, we would end up with a bunch of parameters to tune. Some of them have default or recommended values and do not require expensive fine-tuning. Others strongly depends on the underlying data, specific application domain, and a set of other components. Thus, the only way to find best values is to perform a model selection by validating based on the desired metric computed on the validation data fold.

Now we will list a table of parameters that we might want to consider tuning. Please consider that each library or framework may have additional parameters and a custom way of setting them. This table is derived from the available tuning options in H2O. It summarizes the common parameters, but not all of them, when building a deep auto-encoder network in production:

Parameter

Description

Recommended value(s)

activation

The differentiable activation function.

Depends on the data nature. Popular functions are: Sigmoid, Tanh, Rectifier and Maxout.

Each function can then be mapped into the corresponding drop-out version. Refer to the network design section.

hidden

Size and number of layers.

Number of layers is always odd and symmetric between encoding and decoding when the network is an autoencoder.

The size depends on both the network design and the regularization technique.

Without regularization the encoding layers should be consecutively smaller than the previous layer.

With regularization we can have higher capacity than the input size.

epochs

Number of iterations over the training set.

Generally, between 10 and a few hundreds. Depending on the algorithm, it may require extra epochs to converge.

If using early stopping we don't need to worry about having too many epochs.

For model selection using grid search, it is better to keep it small enough (less than 100).

train_samples_per_iteration

Number of training examples for Map/Reduce iteration.

This parameter applies only in the case of distributed learning.

This strongly depends on the implementation.

H2O offers an auto-tuning option.

Please refer to the Distributed learning via Map/Reduce section.

adaptive_rate

Enable the adaptive learning rate.

Each library may have different strategies. H2O implements as default ADADELTA.

In case of ADADELTA, additional parameters rho (between 0.9 and 0.999) and epsilon (between 1e-10 and 1e-4) must be specified.

Please refer to the Adaptive Learning section.

rate, rate_decay

Learning rate values and decay factor (if not adaptive learning).

High values of the rate may lead to unstable models, lower values will slow down the convergence. A reasonable value is 0.005.

The decay factory represents the Rate at which the learning rate decays across layers.

momentum_start, momentum_ramp, momentum_stable

Parameters of the momentum technique (if not adaptive learning).

When exists a gap between the momentum start and the stable value, the momentum ramp is measured in number of training samples. The default is typically a large value, for example, 1e6.

Input_dropout_ratio, hidden_dropout_ratio

Fraction of input nodes for each layer to omit during training.

Default values are 0 for input (all features) and a value around 0.5 for hidden layers.

l1, l2

L1 and L2 regularization parameters.

High values of L1 will cause many weights to go to 0 while high values of L2 will reduce but keep most of the weights.

max_w2

Maximum value of sum of squared weights incoming for a node.

A useful parameter for unbounded activation functions such as ReLU or Maxout.

initial_weight_distribution

The distribution of initial weights.

Typical values are Uniform, Normal, or UniformAdaptive. The latter is generally preferred.

loss

The loss function to use during back-propagation.

It depends on the problem and nature of data.

Typical functions are CrossEntropy, Quadratic, Absolute, Huber. Please refer to the Network design section.

rho_sparsity, beta_sparsity

Parameters of the sparse auto-encoders.

Rho is the average activation frequency and beta is the weight associated to the sparsity penalty.

These parameters can be tuned using search space optimization techniques. Two of the most basic, popular and supported by H2O techniques, are grid search and random search.

Grid search is an exhaustive approach. Each dimension specifies a limited number of possible values and the Cartesian product generates the search space. Each point will be evaluated in a parallel fashion and the point that scores the lowest will be selected. The scoring function is defined by the validation metric.

On the one hand, we have a computational cost equals to the power of the dimensionality (the curse of dimensionality). On the other hand, it is embarrassingly parallel. That is, each point is perfectly parallelizable and its run is independent from the others.

Alternatively, randomly choosing points in a dense search space could be more efficient and can lead to similar results with much less computation. The number of wasted grid search trials is exponential in the number of search dimensions that turned out to be irrelevant for a particular dataset. Not every parameter has the same importance during tuning. Random search is not affected by those low-importance dimensions.

In random search, each parameter must provide a distribution, continuous or discrete depending on the values of the parameter. The trials are points sampled independently from those distributions.

The main advantages of random search are:

  • You can fix the budget (maximum number of points to explore or maximum allowed time)
  • You can set a convergence criterion
  • Adding parameters that do not influence the validation performance does not affect the efficiency
  • During tuning, you could add extra parameters dynamically without have to adjust the grid and increase the number of trials
  • If one trial run fails for any reason, it could either be abandoned or restarted without jeopardizing the entire tuning algorithm

Common applications of random search are associated with early stopping. Especially in high-dimensional spaces with many different models, the number of trials before to converge to a global optimum can be a lot. Early stopping will stop the search when the learning curve (training) or the validation curve (tuning) flattens out.

Because we can also constrain the computation budget we could set criteria like: stop when RMSE has improved over the moving average of the best 5 models by less than 0.0001, but take no more than 1 hours.

Metric-based early stopping combined with max runtime generally gives the best tradeoff.

It also common to have multi-stage tuning where, for example, you run a random search to identify the sub-space where the best configuration might exist and then have further tuning stages only in the selected subspace.

More advanced techniques also exploit sequential, adaptive search/optimization algorithms, where the result of one trial affects the choice of next trials and/or the hyper-parameters are optimized jointly. There is ongoing research trying to predetermine the variable importance of hyper-parameters. Also, domain knowledge and manual fine-tuning can be valuable for those systems where automated techniques struggle to converge.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset