Toe Dipping Into Apache Mahout

I went to this talk titled ‘Apache Mahout – What’s next?’ by Trevor Grant

Few things struck me after attending the talk:

  • apache mahout seems to be a pretty interesting framework for distributed matrix operations.
    • the ppt can be found here
  • trevor’s blog post has great pointers for getting started with a lot of the technologies on the fringes like  flink, mahot etc

kafka-spark integration

Am trying to get to a point whereby i will have  kafka + spark streaming running locally on my machine.

there are several things to figure out along the way:

  • kafka
  • scala
  • spark
  • spark-streaming

References:

 

Distributed Locks

One reason why Redis has custom locking, instead of using operating system–level locks, language-level locks, and so forth, is a matter of scope. Clients want to have exclusive access to data stored on Redis, so clients need to have access to a lock defined in a scope that all clients can see—Redis.

Redis does have a basic sort of lock already available as part of the command set (SETNX), which we use, but it’s not full-featured and doesn’t offer advanced functionality that users would expect of a distributed lock.

In fact there are two patterns which have emerged for locking in Redis.

  1. Locking with SETNX
    • it’s not full-featured and doesn’t offer advanced functionality
  2. Redlock
    • the distributed locking algorithm.

 

References:

Redis Internals

References:

 

Data Types:

On Persistence:

Pub / Sub:

Memory:

Redis Cluster Design and Specification

Pipelining/Transactions

 

Scaling real time processing jobs in Azure

I recently was faced with an issue about how to scale real time processing jobs in Azure.

I finally managed to do it by using the concept of partitions.  Using partitions in EventHub along with Azure Stream Analytics got the job done for me.

References:

802.3 v/s 803.11

This gives a nice overview of the differences between  Ethernet and Wifi at a protocol level.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.456.9874&rep=rep1&type=pdf

  • The crux of the problem is this :  “The CSMA/CD protocol is not used in a wireless environment due to the user has no capability to sense/listen to the channel for collision while sending the packet [12].
  • This necessitates things like Collision Avoidance techniques to be used for Wifi.  And that imposes limits on how fast you can transmit packets at a certain frequency band leading to slower speeds.

 

Making REST calls to send data to an Azure EventHub

I recently encountered a situation where I had to use pure REST Calls to send data to an Azure Event Hub.

Tips:

  • If you are used to using libraries (C#, Python) you will find that the libraries are doing a lot behind the scenes. Its not trivial to go from using the library to making pure REST calls
  • The first approach – using Fiddler to capture the traffic and re-purpose those calls – failed.
    • I am not sure why the calls fail to show up on fiddler. I tried out a few things like decrypt HTTPS and stuff. But I wasn’t able to get the sending traffic to show up on Fiddler
  • The references below give a good of how I made some progress.

REST Call to send data:

I finally got it to work with something like this:

POST https://simplexagpmeh.servicebus.windows.net/simplexagpmeh/messages?timeout=60&api-version=2014-01 HTTP/1.1
User-Agent: Fiddler
Authorization: SharedAccessSignature sr=http%3a%2f%2fsimplexagpmeh.servicebus.windows.net%2f&sig=RxvSkhotfGEwERdiaA8oLr7X9u5XLeDI8TCK5DhDPP8%3d&se=1476214239&skn=RootManageSharedAccessKey
ContentType: application/atom+xml;type=entry;charset=utf-8
Host: simplexagpmeh.servicebus.windows.net
Content-Length: 153
Expect: 100-continue

{ "DeviceId" : "ArduinoYun",
  "SensorData" : [ { "SensorId" : "awk",
        "SensorType" : "temperature",
        "SensorValue" : 24.5
      } ]
}

References:

Code:

 

Getting Started with Apache Kafka

Am doing some toe-dipping into Apache Kafka.

Linux:

Windows:

Commands from the Apache quick start documentation:

  • This gave me a good overview of what the system is doing.
> bin/zookeeper-server-start.sh config/zookeeper.properties
> bin/kafka-server-start.sh config/server.properties
> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
> bin/kafka-topics.sh --list --zookeeper localhost:2181
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
> cp config/server.properties config/server-1.properties
> cp config/server.properties config/server-2.properties
[Now edit these new files and set the following properties:
config/server-1.properties:
    broker.id=1
    listeners=PLAINTEXT://:9093
    log.dir=/tmp/kafka-logs-1
config/server-2.properties:
    broker.id=2
    listeners=PLAINTEXT://:9094
    log.dir=/tmp/kafka-logs-2]
> bin/kafka-server-start.sh config/server-1.properties &
> bin/kafka-server-start.sh config/server-2.properties &
> bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 3 --partitions 1 --topic my-replicated-topic
> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topic
> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic test
> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my-replicated-topic
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from-beginning --topic my-replicated-topic
> ps | grep server-1.properties
> kill -9 7564
[Leadership has switched to one of the slaves and node 1 is no longer in the in-sync replica set:]
> bin/kafka-topics.sh --describe --zookeeper localhost:2181 --topic my-replicated-topic
> bin/kafka-console-consumer.sh --zookeeper localhost:2181 --from-beginning --topic my-replicated-topic