//
you're reading...
EC2, machine learning, mahout, R

Mahout for R Users

I have a few posts coming up on Apache Mahout so I thought it might be useful to share some notes. I came at it as primarily an R coder with some very rusty Java and C++ somewhere in the back of my head so that will be my point of reference. I’ve also included at the bottom some notes for setting up Mahout on Ubuntu.

What is Mahout?

A machine learning library written in Java that is designed to be scalable, i.e. run over very large data sets. It achieves this by ensuring that most of its algorithms are parallelizable (they fit the map-reduce paradigm and therefore can run on Hadoop.) Using Mahout you can do clustering, recommendation, prediction etc. on huge datasets by increasing the number of CPUs it runs over. Any job that you can split up into little jobs that can done at the same time is going to see vast improvements in performance when parallelized.

Like R it’s open source and free!

So why use it?

Should be obvious from the last point. The parallelization trick brings data and tasks that were once beyond the reach of machine learning suddenly into view. But there are other virtues. Java’s strictly object orientated approach is a catalyst to clear thinking (once you get used to it!). And then there is a much shorter path to integration with web technologies. If you are thinking of a product rather than just a one off piece of analysis then this is a good way to go.

How is it different from doing machine learning in R or SAS?

Unless you are highly proficient in Java, the coding itself is a big overhead. There’s no way around it, if you don’t know it already you are going to need to learn Java and it’s not a language that flows! For R users who are used to seeing their thoughts realised immediately the endless declaration and initialisation of objects is going to seem like a drag. For that reason I would recommend sticking with R for any kind of data exploration or prototyping and switching to Mahout as you get closer to production.

What do you need to do to get started?

You’ll need to install the JDK (Jave Development Kit) and some kind of Java IDE (I like netbeans). You’ll also need maven (see below) to organise your code and its dependencies. A book is always useful. The only one around it seems is Mahout in Action but its good and all the code for examples is available for download. If you plan to run it on hadoop (which is recommended) then of course you need that too. If you’re going to be using hadoop in seriousness you’ll need an AWS account (assuming you haven’t your own grid). Finally you’ll need the mahout package itself. I found this all a lot easier on Linux with its natural affinity with other open source projects. You are welcome to follow my notes below on how to get this all up and running on an AWS Ubuntu instance.

Object Orientated

R is a nice gentle introduction object-orientated programming. If you’ve declared your own classes and methods using S3 you’re on your way. Even better more so if you’ve used S4 (must admit I haven’t). Even so there’s a big jump to the OO world of Java. Here’s a few tips:

  • To get something that executes include a method inside your class that begins public static void main(String[] args). An IDE like netbeans will pick this up and allow you to run that file. See here for a Hello World example
  • Remember every variable needs to be both declared and initialised and for everything that is not a Java literal this means creating a new instance of an object (I keep forgetting to include new when initialising.)
  • The easy R world of a script and a few functions is not an option. Everything should be an object or something pertaining to it. I find the easiest way to make this jump is to imagine I’m making bits of a machine and make an effort to keep this in my head. Everything is now like a little robot with data on the inside and predefined actions and responses.

Some useful terms

Maven a piece of software used by Mahout for managing project builds. It is similar to the package writing tools in R but more flexible.

JDK and JRE The first is the Java Development Kit, the software needed to write code in Java, the second is the Java Runtime Environment, the software that executes Java code. JRE will be on the machines of anyone who runs anything that uses Java (i.e. most people)

AWS Amazon web services – a cloud computing platform. We’ve quite a few posts on this subject. Here it is significant because it’s what you’ll need to run hadoop if you’ve not got your own grid.

Hadoop and map reduce There’s a million online tutorials on these but very quickly map-reduce is a powerful algorithm for parallelizing a very large class of tasks and Hadoop is an open source software framework that implements it. If you’ve used the parallel library in R then it does something similar on a much smaller scale (although I’m not sure whether it is formally map-reduce).

netbeans A good IDE for Java (there are many others). If you use R Studio for R it’s the same kind of thing but less stripped down, if you use Eclipse (which can also be used for Java) then you are already familiar with the set up.

Some general tips

  • When Mahout installs it does a lot of checks. I found it kept failing certain ones and this prevented the whole thing from installing. I disabled the checks with the option -DskipTests install and so far I’ve had no issues
  • I found it very useful when running the examples in Mahout In Action to explore the objects using the Netbeans debugger. This allows you to inspect the objects giving you good sense of how it all hangs together
  • Here’s a nice post explaining the map-reduce algorithm
  • Don’t forget to install the Maven plug-in in netbeans otherwise you’ll be struggling when executing the Mahout examples
  • Do a bit of Java programming to get your head into it (it might not be your thing but I downloaded and adapted this space invaders example)

My notes for setting up Mahout and running a simple example

This worked for me as of April 2013 on an AWS Ubuntu image (see earlier posts for setting this up). Obviously I’m referring to my particular directory set up. You’ll need to change it appropriately here and there and in particular change the versions of hadoop, maven and mahout to the latest. Thanks to the following post for the example.

Apologies, it’s a bit raw but it gets you from the beginning to the end.

Install Java JDK 7

sudo java -version [check whether java is installed]

sudo apt-get update

sudo apt-get install openjdk-7-jdk

Download and install hadoop

cd home/ubuntu

wget http://mirror.rmg.io/apache/hadoop/common/hadoop-1.0.4/hadoop-1.0.4.tar.gz

sudo cp hadoop-1.0.4.tar.gz /usr/local [Move the file to usr/lib]

cd /usr/local

sudo tar -zxvf  hadoop-1.0.4.tar.gz [unzip the package]

sudo rm hadoop-1.0.4.t

Set up environment variables

printenv

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64

export HADOOP_HOME=/usr/local/hadoop-1.0.4

export PATH=$PATH:$HADOOP_HOME/bin

Set up variable permanently

sudo vi /etc/environment

Add

    JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64

    HADOOP_HOME=/usr/local/hadoop-1.0.4

Append to the path line “:HADOOP_HOME/bin”

Test hadoop is working

$HADOOP_HOME/bin/hadoop [displays help files]

Runs stand alone example

cd /usr/local/hadoop-1.0.4

sudo mkdir input

sudo cp conf/*.xml input

sudo bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

sudo cat output/*

Install maven

sudo apt-get update

sudo apt-get install maven

mvn -version [to check it installed ok]

Install mahout

cd home/ubuntu

wget http://apache.mirrors.timporter.net/mahout/0.7/mahout-distribution-0.7-src.tar.gz

sudo tar -zxvf  mahout-distribution-0.7-src.tar.gz

sudo cp -r /home/ubuntu/mahout-distribution-0.7 /usr/local  

sudo mv mahout-distribution-0.7 mahout

cd mahout/core

sudo mvn -DskipTests install

cd mahout/examples

sudo mvn install

Create a maven project

cd /usr/local/mahout

sudo mvn archetype:create -DarchetypeGroupId=org.apache.maven.archetypes -DgroupId=com.unresyst -DartifactId=mahoutrec

cd mahoutrec

sudo mvn compile

sudo mvn exec:java -Dexec.mainClass="com.unresyst.App" [to print hello world]

sudo vi pom.xml

Then insert into pom.xml

<dependency>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

<version>3.8.1</version>

<scope>test</scope>

</dependency>

<dependency>

<groupId>org.apache.mahout</groupId>

<artifactId>mahout-core</artifactId>

<version>0.7</version>

</dependency>

<dependency>

<groupId>org.apache.mahout</groupId>

<artifactId>mahout-math</artifactId>

<version>0.7</version>

</dependency>

Also add

<relativePath>../pom.xml</relativePath> 

to the parent clause

Create recommender

Create datasets directory in mahoutrec folder

Add the csv file in https://code.google.com/p/unresyst/wiki/CreateMahoutRecommender

Create the java file listed in the post above in src/main/java/com/unresyst/

Go back to the project directory and run

sudo mvn compile

sudo mvn exec:java -Dexec.mainClass="com.unresyst.UnresystBoolRecommend"

Follow steps in D&L post to set up NX server

Set up netbeans

Download and install netbeans using the Ubuntu software centre

Tools>>Plugins>>>Settings

Enable all centres

Install the Maven plug in

Install Git

sudo apt-get install git

Download the repository for Analysing Data with Hadoop

cd /home/ubuntu

mkdir repositories

cd repositories

git clone https://github.com/tomwhite/hadoop-book.git

Download the repository for Mahout in Action

git clone https://github.com/tdunning/MiA.git

Running the hadoop maxtemperature example

Set up a new directory and copy across example files :

cp /home/ubuntu/repositories/hadoop-book/ch02/src/main/java/* /home/ubuntu/hadoopProjects/maxTemp

Make a build/classes directory within maxTemp

javac -verbose -classpath /usr/local/hadoop-1.0.4/hadoop-core-1.0.4.jar MaxTemperature*.java -d build/classes

export HADOOP_CLASSPATH=/home/ubuntu/hadoopProjects/maxTemp/build/classes

hadoop MaxTemperature /home/ubuntu/repositories/hadoop-book/input/ncdc/sample.txt output

To run the mahout example through netbeans just go to the mahoutrec maven directory and execute

About these ads

About Simon Raper

I am an RSS accredited statistician with over 15 years’ experience working in data mining and analytics and many more in coding and software development. My specialities include machine learning, time series forecasting, Bayesian modelling, market simulation and data visualisation. I am the founder of Coppelia an analytics startup that uses agile methods to bring machine learning and other cutting edge statistical techniques to businesses that are looking to extract value from their data. My current interests are in scalable machine learning (Mahout, spark, Hadoop), interactive visualisatons (D3 and similar) and applying the methods of agile software development to analytics. I have worked for Channel 4, Mindshare, News International, Credit Suisse and AOL. I am co-author with Mark Bulling of Drunks and Lampposts - a blog on computational statistics, machine learning, data visualisation, R, python and cloud computing. It has had over 310 K visits and appeared in the online editions of The New York Times and The New Yorker. I am a regular speaker at conferences and events.

Discussion

5 thoughts on “Mahout for R Users

  1. Interesting introduction.

    I think there is a mitake regarding the available hadoop version. There is no version 1.0.4 available (anymore?). Therefore, the wget command 404s. The latest non-alpha version available is 1.2.0. So the wget command should be:

    wget http://mirror.rmg.io/apache/hadoop/common/hadoop-1.2.0/hadoop-1.2.0.tar.gz

    Subsequently, all other version related entries should be changed. I did not tested the whole installation procedure. So, if there are version dependend changes the post may have some more issues that should be changed.

    Since your wrote, that the installation procedures worked for you in June I am a little confused, wether I am missing some point here?!

    Anyway, goot starting point – thx for sharing!

    Posted by Robert Adams | June 10, 2013, 11:27 am
    • That’s a good point! When I started the notes it was some time in April and when I finished June so it was inaccurate to say that it works in June. I’ll make that clear.

      Thanks for feedback

      Simon

      Posted by simonraper | June 12, 2013, 11:34 am
  2. Thanks for posting. This worked very well for me on EC2, even without installing Hadoop. It’s the best step by step guide I could find anywhere online.

    Posted by Tom R | June 12, 2013, 6:04 pm

Trackbacks/Pingbacks

  1. Pingback: Book Recommendations from Beyond the Grave: A Mahout Example | Drunks&Lampposts - August 26, 2013

  2. Pingback: Mahout for R Users - R Project Aggregate - January 1, 2014

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog Stats

  • 313,860 hits

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 508 other followers

Follow

Get every new post delivered to your Inbox.

Join 508 other followers

%d bloggers like this: