//
you're reading...
data visualisation, R

Visualising the Path of a Genetic Algorithm

The paths of 8 runs of a genetic algorithm optimising a multivariate normal mixture function. Three of the runs break out from a local maxima (presumably through mutation) to find the global solution. The circles are the starting points.

We quite regularly use genetic algorithms to optimise over the ad-hoc functions we develop when trying to solve problems in applied mathematics. However it’s a bit disconcerting to have your algorithm roam through a high dimensional solution space while not being able to picture what it’s doing or how close one solution is to another. With this in mind, and also just out of curiosity, I’ve tried to visualise the path of a genetic algorithm by using principal components analysis.

If you are not familiar with this extremely useful technique then there’s plenty of information online. To put it very very briefly PCA rotates the axes in whatever dimensional space you are working to find a solution where the first axis points in the direction of the greatest variation in your data, the second axis in the direction that captures the greatest part of the leftover variation and so on. The upshot is that if we take the first two axes we should get the best two dimensional view of the shape of our data.

So what I’ve done is run a genetic algorithm several times on the same function, recorded the paths of each run and then applied PCA to visualise these paths. To run the genetic algorithm I’ve used the excellent rgenoud package. The code below implements a function that I hope will be flexible enough to apply this process to any function that rgenoud can handle. You just need to specify the function, the number of parameters, the number of runs you want and a few other optional parameters.

I’ve tried it on a multivariate normal mixture function (above) and Colville’s function (below). I’d like to try some others when I’ve time and would love to see the results if anyone else would like to make use of it. I would recommend using the parallel option if you can as otherwise plotting many runs can take some time.

Eight runs eventually converge on the global minima of the Colville function. The circles are the starting points of the algorithm

If you enjoyed this then please check out more visualisation in Mark’s latest post.

Now here’s the code for the general function.


# *--------------------------------------------------------------------
# | FUNCTION: visGAPath
# | Function for visualising the path of a genetic algorithmn using 
# | principal components analysis
# *--------------------------------------------------------------------
# | Version |Date      |Programmer  |Details of Change                
# |     01  |18/04/2012|Simon Raper |first version.      
# *--------------------------------------------------------------------
# | INPUTS:  func        The function to be optimised
# |          npar        The number of parameters to optimise over
# |          nruns       The number of runs of the GA to visualise
# |          parallel    If true runs on multiple cores
# |          file.path   File name and path for saving GA output
# |          domains     If needed for retricted optimisation
# |                      (see rgenoud documentation for more info)
# |          max         If true searches for maxima. Default=TRUE
# |          mut.weight  Option to upweight the mutation operator
# |                      (see rgenoud documentation for more info)
# |
# *--------------------------------------------------------------------
# | OUTPUTS: A ggplot object
# *--------------------------------------------------------------------
# | USAGE:   visGAPath(func, npar, nruns, parallel, file.path, 
# |          domains, max, mutation.weight)
# |
# *--------------------------------------------------------------------
# | DEPENDS: rgenoud, plyr, foreach, doSNOW, ggplot2
# |
# *--------------------------------------------------------------------
# | NOTES:   You may need to customise the multicore part of the code
# |          to suit your machine
# |
# *--------------------------------------------------------------------


visGAPath<-function(func, npar, nruns, parallel=TRUE, file.path='C:\\pro', domains=NULL, max=TRUE, mut.weight=50){
  
  #Function for creating data set of GA populations
  
  createGenDS<-function(gen.obj, path, run.name){
    gens<-gen.obj$generations
    pop<-gen.obj$popsize
    var.num<-length(gen.obj$par)
    project.in<-readLines(path)
    z<-2
    q<-1:2
    for (i in 1:gens){
      z<-z+pop
      t<-1:4+z
      q<-rbind(q,t(t))
      z<-z+4
    }
    numeric.only<-read.table(textConnection(project.in[-q]))
    generation<-rep(0:gens, each=pop)
    gen.df<-data.frame(generation, numeric.only)
    names(gen.df)<-c("Generation", "Ind", "Fit", paste("Var", 1:var.num, sep=""))
    
    bestInGen<-function(gen){
      best<-which.max(gen$Fit)
      best.ind<-gen[best,3:(var.num+3)]
      avg.ind<-colMeans(gen[,3:(var.num+3)])
      names(avg.ind)<-paste("avg.", names(avg.ind), sep="")
      ret<-data.frame(c(best.ind, avg.ind))
    }
    
    bestGen<-ddply(gen.df, .(Generation) ,bestInGen)
    run<-rep(run.name, (gens+1))
    bestGen<-data.frame(run, bestGen)
    bestGen
    
  }
  
  #ddply version
  
  wrap.gen<-function(arg){
    seed<-arg[1,2]
    run<-arg[1,1]
    pro.path<-paste(file.path, run, ".txt", sep="")
    gen <- genoud(func, nvars=npar,pop.size=500,max=max, project.path=pro.path, print.level=2, max.generations=100, unif.seed=seed, Domains=domains, P2=mut.weight)
    run<-createGenDS(gen, pro.path, run)
    run
  }
  
  runs<-data.frame(run=paste("run", 1:nruns, sep=" "), seed=floor(runif(nruns)*10000))
  
  if (parallel==FALSE){
    
    #Single core
    all<-ddply(runs, .(run), wrap.gen, .parallel = FALSE)
    
  } else {
    #Multi core
    cl <- makeCluster(getOption("cl.cores", min(8, nruns)))
    clusterExport(cl, objects(.GlobalEnv))
    clusterEvalQ(cl, library(plyr))
    clusterEvalQ(cl, library(mvtnorm))
    clusterEvalQ(cl, library(rgenoud))
    registerDoSNOW(cl)
    all<-ddply(runs, .(run), wrap.gen, .parallel = TRUE)
    
    stopCluster(cl)
    
  }
  
  #Visualise
  
  pc<-princomp(all[(5+npar):(5+npar*2-1)])
  scores<-data.frame(Run=factor(all$run), Fit=all$Fit, pc$scores)
  start<-scores[which(all$Generation==0),]
  
  g<-ggplot(data=scores, aes(x=Comp.1, y=Comp.2, colour=Run, group=Run))+geom_path()
  g+geom_point(data=start, size=5, aes(x=Comp.1, y=Comp.2)) + scale_x_continuous('Principal Component 1') + scale_y_continuous('Principal Component 2') 
}

… and here is the code for the examples I’ve included.


#Maximize a mixture of multivariate normal distributions

library(mvtnorm)

mnMix<-function(args){
  
  mean.vec.d1<-rep(0.3,5)
  std.vec.d1<-diag(rep(1,5))
  
  mean.vec.d2<-rep(1,5)
  std.vec.d2<-diag(rep(1.5,5))
  
  mean.vec.d3<-c(1, 5, 2, 1, 0)
  std.vec.d3<-diag(rep(0.5, 5))
  
  if (args[1]<0){
    y<-dmvnorm(args, mean.vec.d1, std.vec.d1)+dmvnorm(args, mean.vec.d2, std.vec.d2)+dmvnorm(args, mean.vec.d3, std.vec.d3)
  }
  else {
    y<-dmvnorm(args, mean.vec.d1, std.vec.d1)*dmvnorm(args, mean.vec.d2, std.vec.d2)*dmvnorm(args, mean.vec.d3, std.vec.d3)
  }
  
  y
  
}

visGAPath(mnMix, 5, 8, parallel=TRUE, file.path='B:\\RFiles\\RInOut\\pro')

Colville<-function(args){
  x0<-args[1]
  x1<-args[2]
  x2<-args[3]
  x3<-args[4]
  ret<-100*(x0-x1^2)^2+(1-x0)^2+90*(x3-x2^2)^2+(1-x2)^2+10.1*((x1-1)^2+(x3-1)^2)+19.8*(x1-1)*(x3-1)
  ret
}

domains<-matrix(rep(c(-10,10),4), nrow=4, byrow=TRUE)

visGAPath(Colville, 4, 8, parallel=TRUE, file.path='B:\\RFiles\\RInOut\\pro', domains=domains, max=FALSE)


About these ads

About Simon Raper

I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

Discussion

No comments yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog Stats

  • 320,692 hits

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 518 other followers

Follow

Get every new post delivered to your Inbox.

Join 518 other followers

%d bloggers like this: