Evolutionary Strategies (ES) - Evolutionary Computation Elective by Prof. Dennis Wilson, ISAE-SUPAERO

Introduction

Evolutionary Strategies (ES) are a powerful class of stochastic search algorithms that have been used to solve a wide range of optimization problems. While traditional optimization often relies on calculating exact derivatives, ES excels in environments where the objective function is unknown, non-differentiable, or noisy. In the notebook of the elective, we used standard test functions like the Himmelblau function or the Rosenbrock function.

Evolutionary Strategies

Building on what we previously explored in the Genetic Algorithms (GA) section, we now dive into Evolutionary Strategies (ES). While GA often focuses on discrete populations, ES is the fundamental tool for continuous optimization.

From Individual Survival to Population Exploration

As we’ve seen before :

The (1 + 1) ES : is the simplest form of ES, a single parent produces one child and the parent is replaced only if the child is better.
The (1 + \lambda) ES : is a more complex form of ES, a single parent produces $\lambda$ children and the parent is replaced only if the best child is better.

Exploration vs Speed

While the (1 + 1) ES moves quickly, the (1 + \lambda) ES is more exploratory, which avoid getting stuck in local optima. But be careful, generating too many points (will give you better data on the landscape but :) can become computationally expensive (at least for complex simulations).

Approximating the Gradient

What is brilliant with ES is that it allows us to approximate the gradient of a function without calculating its derivative.

Based on Performance

The algorithm samples multiple points around the parent using for example a normal distribution and by observing which offspring perform better than the others, we can identify a promising direction to move in the search space.

Normalization

In order to make this process robust, we normalize the fitness function.

The Approximation

After some non intuitive calculus, we can approximate the gradient of the function using the following formula :

$$ \nabla f = \pm \frac{A \cdot N}{\lambda} $$

with :

$N$ is the matrix of noise vectors sampled from a normal distribution $N(0, 1)$. Here, for the whole population, $N$ is a matrix of size $[\lambda, d]$ where $d$ is the dimension of the problem.
$A$ is the standardized fitness score of the population. It is a vector that contains the standardized fitness score of each individual in the population. This means that for each individual $i$, $A_i = \frac{f(x_i) - \mu(f(x))}{\sigma(f(x))}$ where $f(x)$ is the fitness function and $\mu(f(x))$ and $\sigma(f(x))$ are the mean and standard deviation of the population.
$\lambda$ is the number of offspring.

The Learning Rate

Lastly we define the new center of the future normal distribution by :

$$ x = x + \alpha \frac{A \cdot N}{\lambda} $$

The Algorithm

With all of this we can build our algorithm :

Algorithm 1: Evolutionary Strategies $(\mu, \lambda)$

Initializing parent x in the search space
Set learning rate alpha and the population size of the offspring lambda and the standard deviation of the noise sigma

For each generation g from 1 to G : 

    # 1. Generating the offsprings 
    N = [N1, N2, ..., Nlambda] from a normal distribution N(0, 1)
    F = [f1=0, f2=0, ..., flambda=0]

    For each offspring i from 1 to lambda : 
        ind_i = x + sigma *N[i]
        F[i] = f(ind_i)

    # Here we can keep track of the best individual in the population. 

    End For 

    # 2. Normalization
    mu_f = mean(F)
    sigma_f = std(F)

    A = [A1, A2, ..., Alambda] = [(f1 - mu_f) / sigma_f, (f2 - mu_f) / sigma_f, ..., (flambda - mu_f) / sigma_f]

    # 3. Approximation of the gradient
    gradient_approx = dot(A, N) / lambda

    # 4. Update parent 
    x = x + alpha * gradient_approx

End For 

Return x or the best individual in the population.

To Summary

In this approach we turned a blind and random search into a gradient descent. We no longer need to know the derivative of our landscape, we just feel where we should move.

Rank-Based Updates

In ES, rank-based updates are a robust method for moving toward a solution by focusing on the relative ordering of individuals rather than their absolute fitness scores.

The concept

Instead of focusing on fitness values to approximate a gradient, the algorithm sorts the individuals (offsprings here) from the best to the worst based on their performance.

Selection of $\mu$ individuals : Usually, not all individuals are useful for finding the center of the search. In the notebook, we typically select the best 50% percent of the population.
Weight update : Then, we create a set of weights that decreases logarithmically from the best to the worst selected individual. This ensures that the most promising individuals are given more importance and influence on the algorithm moves.

Why

Reducing noises : By only using ranks, worst individuals do not count as the best ones, and the algorithm is not bothered by individuals with bad performance.
Avoiding local optima : focusing only on the top individuals prevent from getting pull away from a nearby optimum.
Invariance property : I didn’t really notice this one but : rank-based updates are invariant to order preserving transformations. That means if you scale the fitness values (multiply by a scalar take the log), the update remains exactly the same as long as the ranking remains the same.

The Algorithm

Here is the associated algorithm :

Algorithm 2: Rank-based Evolutionary Strategies $(\mu, \lambda)$

Initializing parent x in the search space
Set the learning rate sigma (or standard deviation) and the population size of the offspring lambda and the number of elites mu 

# Compute the weights based on log rank 
w = [w1 = (log(mu + 1/2) - log(1)), w2 = (log(mu + 1/2) - log(2)), ..., wmu = (log(mu + 1/2) - log(mu))] 
w = w / sum(w)

For each generation g from 1 to G : 

    epsilon = Normal(0, 1, size=(lambda, d)) # where d is the dimension of the problem
    f = [f1, f2, ..., flambda]

    # 1. Generating the offsprings
    For each individual i from 1 to lambda : 

        x_i = x + sigma * epsilon[i]
        f[i] = f(x_i)

    end For 

    # 2. Sorting the offsprings
    sorted_f = argsort(f) # where the best is at index 0
    
    # 3. Approximation of the gradient
    gradient_approx = dot(w, epsilon[sorted_f[:mu]]) = N[first]*w1 + N[second]*w2 + ... + N[last]*wmu

    # 4. Update parent 
    x = x + sigma * gradient_approx

End For 

Return x or the best individual in the population.

This algorithm is called Canonical ES.

Introduction#

Evolutionary Strategies#

From Individual Survival to Population Exploration#

Exploration vs Speed#

Approximating the Gradient#

Based on Performance#

Normalization#

The Approximation#

The Learning Rate#

The Algorithm#

To Summary#

Rank-Based Updates#

The concept#

Why#

The Algorithm#

Introduction

Evolutionary Strategies

From Individual Survival to Population Exploration

Exploration vs Speed

Approximating the Gradient

Based on Performance

Normalization

The Approximation

The Learning Rate

The Algorithm

To Summary

Rank-Based Updates

The concept

Why

The Algorithm