- It was interesting to see the role of Zookeeper in this diagram of the Hadoop Ecosystem.
I was playing around with Mahout, and one of the things I wanted to try out was to use Mahout’s Spark Shell on my local machine
There is a nice example for doing this. But I hit a stack dump the moment I tried to start up the mahout shell using
<br />java.lang.RuntimeException: java.io.InvalidClassException: org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream classdesc serialVersionUID = -2221986757032131007, local class serialVersionUID = -5447855329526097695 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:616) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1630) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1521)
The problem is because the spark version that Mahout was looking for was 1.6.2 (specified in the POM file). The spark cluster I had started up was with the latest version 2.0.1
Here are the steps I did to get it going:
git clone https://github.com/apache/mahout mahout
mahoutdirectory and build mahout using
mvn -DskipTests clean install
sbt/sbt assembly`to build it
sbin/start-all.sh`to locally start Spark
<br />abgoswam@abgoswam-ubuntu:~/repos/mahout$ cat mymahoutsparksettings.sh #!/usr/bin/env bash export MAHOUT_HOME=/home/abgoswam/repos/mahout export SPARK_HOME=/home/abgoswam/packages/spark-1.6.2 export MASTER=spark://abgoswam-ubuntu:7077 echo "Set variables for Mahout" abgoswam@abgoswam-ubuntu:~/repos/mahout$
bin/mahout spark-shell`, you should see the shell starting and get the prompt
The best way to get started with hadoop is to play with it in a single node setting.
These couple of links give a good intro to the Apache Beam Model
I went to this talk titled ‘Apache Mahout – What’s next?’ by Trevor Grant
Few things struck me after attending the talk:
Windowing is a very common operation in stream analytics.
Beneath the surface, there is a whole bunch of complex data structuring that’s going on to support the windowing operations. I would love to dig deeper into these someday.
Here is an example of a query I wrote recently using windowing operators in azure stream analytics. It shows 3 interesting things :
3. Aggregation over string columns (using TopOne)
WITH ContextReward AS ( SELECT eventid, TopOne() OVER (ORDER BY [EventEnqueuedUtcTime] ASC) CR, MAX (reward) AS reward FROM Input GROUP BY eventid, HoppingWindow(Duration(hour, 2), Hop(hour, 1)) ) SELECT reward, eventid, CR.actionname AS actionname, CR.age AS age, CR.gender AS gender, CR.weight AS weight, CR.actionprobability INTO OutputWindow FROM ContextReward SELECT * INTO Output FROM Input SELECT * INTO OutputCSV FROM Input