Adventures in HttpContext All the stuff after 'Hello, World'

Running an Akka Cluster with Docker Containers

Update! You can now use SBT-Docker with SBT-Native Packager for a better sbt/docker experience. Here’s the new approach with an updated GitHub repo.

We recently upgraded our vagrant environments to use docker. One of our projects relies on akka’s cluster functionality. I wanted to easily run an akka cluster locally using docker as sbt can be somewhat tedious. The example project is on github and the solution is described below.

The solution relies on:

  • Sbt Native Packager to package dependencies and create a startup file.
  • Typesafe’s Config library for configuring the app’s ip address and seed nodes. We setup cascading configurations that will look for docker link environment variables if present.
  • A simple bash script to package the app and build the docker container.

I’m a big fan of Typesafe’s Config library and the environment variable overrides come in handy for providing sensible defaults with optional overrides. It’s the preferred way we configure our applications in upper environments.

The tricky part of running an akka cluster with docker is knowing the ip address each remote node needs to listen on. An akka cluster relies on each node listening on a specific port and hostname or ip. It also needs to know the port and hostname/ip of a seed node the cluster. As there’s no catch-all binding we need specific ip settings for our cluster.

A simple bash script within the container will figure out the current IP for our cluster configuration and docker links pass seed node information to newly launched nodes.

First Step: Setup Application Configuration

The configuration is the same as that of a normal cluster, but I’m using substitution to configure the ip address, port and seed nodes for the application. For simplicity I setup a clustering block with defaults for running normally and environment variable overrides:

clustering {
 ip = "127.0.0.1"
 ip = ${?CLUSTER_IP}
 port = 1600
 port = ${?CLUSTER_PORT}
 seed-ip = "127.0.0.1"
 seed-ip = ${?CLUSTER_IP}
 seed-ip = ${?SEED_PORT_1600_TCP_ADDR}
 seed-port = 1600
 seed-port = ${?SEED_PORT_1600_TCP_PORT}
 cluster.name = clustering-cluster
}

akka.remote {
    log-remote-lifecycle-events = on
    netty.tcp {
      hostname = ${clustering.ip}
      port = ${clustering.port}
    }
  }
  cluster {
    seed-nodes = [
       "akka.tcp://"${clustering.cluster.name}"@"${clustering.seed-ip}":"${clustering.seed-port}
    ]
    auto-down-unreachable-after = 10s
  }
}

As an example the clustering.seed-ip setting will use 127.0.0.1 as the default. If it can find a _CLUSTERIP or a SEED_PORT_1600_TCP_ADDR override it will use that instead. You’ll notice the latter override is using docker’s environment variable pattern for linking: that’s how we set the cluster’s seed node when using docker. You don’t need the _CLUSTERIP in this example but that’s the environment variable we use in upper environments and I didn’t want to change our infrastructure to conform to docker’s pattern. The cascading settings are helpful if you’re forced to follow one pattern depending on the environment. We do the same thing for the ip and port of the current node when launched.

With this override in place we can use substitution to set the seed nodes in the akka cluster configuration block. The expression "akka.tcp://"${clustering.cluster.name}"@"${clustering.seed-ip}":"${clustering.seed-port} builds the proper akka URI so the current node can find the seed node in the cluster. Seed nodes avoid potential split-brain issues during network partitions. You’ll want to run more than one in production but for local testing one is fine. On a final note the cluster-name setting is arbitrary. Because the name of the actor system and the uri must match I prefer not to hard code values in multiple places.

I put these settings in resources/reference.conf. We could have named this file application.conf, but I prefer bundling configurations as reference.conf and reserving application.conf for external configuration files. A setting in application.conf will override a corresponding reference.conf setting and you probably want to manage application.conf files outside of the project’s jar file.

Second: SBT Native Packager

We use the native packager plugin to build a runnable script for our applications. For docker we just need to run universal:stage, creating a folder with all dependencies in the target/ folder of our project. We’ll move this into a staging directory for uploading to the docker container.

Third: The Dockerfile and Start script

The dockerfile is pretty simple:

FROM dockerfile/java

MAINTAINER Michael Hamrah m@hamrah.com

ADD tmp/ /opt/app/
ADD start /opt/start
RUN chmod +x /opt/start

EXPOSE 1600

ENTRYPOINT [ "/opt/start" ]

We start with Dockerfile’s java base image. We then upload our staging tmp/ folder which has our application from sbt’s native packager output and a corresponding executable start script described below. I opted for ENTRYPOINT instead of CMD so the container is treated like an executable. This makes it easier to pass in command line arguments into the sbt native packager script in case you want to set java system properties or override configuration settings via command line arguments.

The start script is how we tee up the container’s IP address for our cluster application:

#!/bin/bash

CLUSTER_IP=/sbin/ifconfig eth0 | grep 'inet addr:' | cut -d: -f2 | awk '{ print $1}' /opt/app/bin/clustering $@

The script sets an inline environment variable by parsing ifconfig output to get the container’s ip. We then run the clustering start script produced from sbt native packager. The $@ lets us pass along any command line settings set when launching the container into the sbt native packager script.

Fourth: Putting It Together

The last part is a simple bash script named dockerize to orchestrate each step. By running this script we run sbt native packager, move files to a staging directory, and build the container:

#!/bin/bash

echo "Build docker container"

#run sbt native packager
sbt universal:stage

#cleanup stage directory
rm -rf docker/tmp/

#copy output into staging area
cp -r target/universal/stage/ docker/tmp/

#build the container, remove intermediate nodes
docker build -rm -t clustering docker/

#remove staged files
rm -rf docker/tmp/

With this in place we simply run

bin/dockerize

to create our docker container named clustering.

Running the Application within Docker

With our clustering container built we fire up our first instance. This will be our seed node for other containers:

$ docker run -i -t -name seed clustering
2014-03-23 00:20:39,918 INFO  akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2014-03-23 00:20:40,392 DEBUG com.mlh.clustering.ClusterListener - starting up cluster listener...
2014-03-23 00:20:40,403 DEBUG com.mlh.clustering.ClusterListener - Current members:
2014-03-23 00:20:40,418 INFO  com.mlh.clustering.ClusterListener - Leader changed: Some(akka.tcp://clustering-cluster@172.17.0.2:1600)
2014-03-23 00:20:41,404 DEBUG com.mlh.clustering.ClusterListener - Member is Up: akka.tcp://clustering-cluster@172.17.0.2:1600

Next we fire up a second node. Because of our reference.conf defaults all we need to do is link this container with the name seed. Docker will set the environment variables we are looking for in the bundled reference.conf:

$ docker run -name c1 -link seed:seed -i -t clustering
2014-03-23 00:22:49,332 INFO  akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2014-03-23 00:22:49,788 DEBUG com.mlh.clustering.ClusterListener - starting up cluster listener...
2014-03-23 00:22:49,797 DEBUG com.mlh.clustering.ClusterListener - Current members:
2014-03-23 00:22:50,238 DEBUG com.mlh.clustering.ClusterListener - Member is Up: akka.tcp://clustering-cluster@172.17.0.2:1600
2014-03-23 00:22:50,249 INFO  com.mlh.clustering.ClusterListener - Leader changed: Some(akka.tcp://clustering-cluster@172.17.0.2:1600)
2014-03-23 00:22:50,803 DEBUG com.mlh.clustering.ClusterListener - Member is Up: akka.tcp://clustering-cluster@172.17.0.3:1600

You’ll see the current leader discovering new nodes and the appropriate broadcast messages sent out. We can even do this a third time and all nodes will react:

$ docker run -name c2 -link seed:seed -i -t clustering
2014-03-23 00:24:52,768 INFO  akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2014-03-23 00:24:53,224 DEBUG com.mlh.clustering.ClusterListener - starting up cluster listener...
2014-03-23 00:24:53,235 DEBUG com.mlh.clustering.ClusterListener - Current members:
2014-03-23 00:24:53,470 DEBUG com.mlh.clustering.ClusterListener - Member is Up: akka.tcp://clustering-cluster@172.17.0.2:1600
2014-03-23 00:24:53,472 DEBUG com.mlh.clustering.ClusterListener - Member is Up: akka.tcp://clustering-cluster@172.17.0.3:1600
2014-03-23 00:24:53,478 INFO  com.mlh.clustering.ClusterListener - Leader changed: Some(akka.tcp://clustering-cluster@172.17.0.2:1600)
2014-03-23 00:24:55,401 DEBUG com.mlh.clustering.ClusterListener - Member is Up: akka.tcp://clustering-cluster@172.17.0.4:1600

Try killing a node and see what happens!

Modifying the Docker Start Script

There’s another reason for the docker start script: it opens the door for different seed discovery options. Container linking works well if everything is running on the same host but not when running on multiple hosts. Also setting multiple seed nodes via docker links will get tedious via environment variables; it’s doable but we’re getting into coding-cruft territory. It would be better to discover seed nodes and set that configuration via command line parameters when launching the app.

The start script gives us control over how we discover information. We could use etcd, serf or even zookeeper to manage how seed nodes are set and discovered, passing this to our application via environment variables or additional command line parameters. Seed nodes can easily be set via system properties set via the command line:

-Dakka.cluster.seed-nodes.0=akka.tcp://ClusterSystem@host1:2552
-Dakka.cluster.seed-nodes.1=akka.tcp://ClusterSystem@host2:2552

The start script can probably be configured via sbt native packager but I haven’t looked into that option. Regardless this approach is (relatively) straight forward to run akka clusters with docker. The full project is on github. If there’s a better approach I’d love to know!

Testing Akka’s FSM: Using setState for unit testing

I wrote about Akka’s Finite State Machine as a way to model a process. One thing I didn’t discuss was testing an FSM. Akka has great testing support and FSM’s can easily be tested using the TestFSMRef class.

An FSM is defined by its states and the data stored between those states. For each state in the machine you can match on both an incoming message and current state data. Our previous example modeled a process to check data integrity across two systems. We’ll continue that example by adding tests to ensure the FSM is working correctly. These should have been before we developed the FSM but late tests are (arguably) better than no tests.

It’s important to test combinations of messages against various states and data. You don’t want to be in a position to run through a state machine to the state you want for every test. Luckily, there’s a handy setState method to explicitly set the current state and data of the FSM. This lets you “fast forward” the FSM to the exact state you want to test against.

Let’s say we want to test a DataRetrieved message in the PendingComparison state. We also want to test this message against various Data combinations. We can set this state explicitly:

"The ComparisonEngine" - {
  "in the PendingComparison state" - {
    "when a DataRetrieved message arrives from the old system" - {
      "stays in PendingComparison with updated data when no other data is present" in {
        val fsm = TestFSMRef(new ComparisonEngine())
        
        //set our initial state with setState
        fsm.setState(PendingComparison, ComparisonStatus("someid", None, None))

        fsm ! DataRetrieved("oldSystem", "oldData")

        fsm.stateName should be (PendingComparison)
        fsm.stateData should be (ComparisonStatus("someid", Some("oldData"), None))
      }
    }
  }
}

It may be tempting to send more messages to continue verifying the FSM is working correctly. This will yield a large, unwieldy and brittle test. It will make refactoring difficult and make it harder to understand what should be happening.

Instead, to further test the FSM, be explicit about the current state and what should happen next. Add a test for it:

"when a DataRetrieved message arrives from the old system" - {
  "moves to AllRetrieved when data from the new system is already present" in {
    val fsm = TestFSMRef(new ComparisonEngine())
        
    //set our initial state with setState
    fsm.setState(PendingComparison, ComparisonStatus("someid", None, Some("newData")))

    fsm ! DataRetrieved("oldSystem", "oldData")
    fsm.stateName should be (AllRetrieved)
    fsm.stateData should be (ComparisonStatus("someid", Some("oldData"), Some("newData")))
  }
}

By mixing in ImplicitSender or using TestProbes we can also verify messages the FSM should be sending in response to incoming messages.

Testing is an essential part of developing applications. Unit tests should be explicit and granular. For higher level orchestration integration tests, taking a black-box approach, provide ways to oversee entire processes. Don’t let your code become too unwieldy to manage: use the tools at your disposal and good coding practices to stay lean. Akka’s FSM provides ways of programming transitional behavior over time and Akka’s FSM Testkit support provides a way of ensuring that code works over time.

Using Typesafe’s Config for Scala (and Java) for Application Configuration

I recently leveraged Typesafe’s Config library to refactor configuration settings for a project. I was very pleased with the API and functionality of the library.

The documentation is pretty solid so there’s no need to go over basics. One feature I like is the clear hierarchy when specifying configuration values. I find it helpful to put as much as possible in a reference.conf file in the /resources directory for an application or library. These can get overridden in a variety of ways, primarily by adding an application.conf file to the bundled output’s classpath. The sbt native packager, helpful for deploying applications, makes it easy to attach a configuration file to an output. This is helpful if you have settings which you normally wouldn’t want to use during development, say using remote actors with akka. I find placing a reasonable set of defaults in a reference.conf file allows you to easily transport a configuration around while still overriding it as necessary. Otherwise you can get into copy and paste hell by duplicating configurations across multiple files for multiple environments.

Alternative Overrides

There are two other interesting ways you can override configuration settings: using environment variables or java system properties. The environment variable approach comes in very handy when pushing to cloud environments where you don’t know what a configuration is beforehand. Using the ${?VALUE} pattern a property will only be set if a value exists. This allows you to provide an option for overriding a value without actually having to specify one.

Here’s an example in a conf file using substitution leveraging this technique:

http {
 port = 8080
 port = ${?HTTP_PORT}
}

We’re setting a default port of 8080. If the configuration can find a valid substitute it will replace the port value with the substitute; otherwise, it will keep it at 8080. The configuration library will look up its hierarchy for an HTTP_PORT value, checking other configuration files, Java system properties, and finally environment variables. Environment variables aren’t perfect, but they’re easy to set and leveraged in a lot of places. If you leave out the ? and just have ${HTTP_PORT} then the application will throw an exception if it can’t find a value. But by using the ? you can override as many times as you want. This can be helpful when running apps on Heroku where environment variables are set for third party services.

Using Java System Properties

Java system properties provide another option for setting config values. The shell script created by sbt-native-packager supports java system properties, so you can also set the http port via the command line using the -D flag:

bin/bash_script_from_native_packager -Dhttp.port=8081

This can be helpful if you want to run an akka based application with a different log level to see what’s going on in production:

bin/some_akka_app_script -Dakka.loglevel=debug

Unfortunately sbt run doesn’t support java system properties so you can’t tweak settings with the command line when running sbt. The sbt-revolver plugin, which allows you to run your app in a forked JVM, does allow you to pass java arguments using the command line. Once you’re set up with this plugin you can change settings by adding your Java overrides after ---:

re-start --- -Dhttp.port=8081

With c3p0

I was really excited to see that the c3p0 connection pool library also supports Typesafe Config. So you can avoid those annoying xml-based files and merge your c3p0 settings directly with your regular configuration files. I’ve migrated an application to a docker based development environment and used this c3p0 feature with docker links to set mysql settings:

app {
 db {
  host = localhost
  host = ${?DB_PORT_3306_TCP_ADDR}
  port = "3306"
  port = ${?DB_PORT_3306_TCP_PORT}
 }
}

c3p0 {
 named-configs {
  myapp {
      jdbcUrl = "jdbc:mysql://"${app.db.host}":"${app.db.port}"/MyDatabase"
  }
 }
}

When I link a mysql container to my app container with --link mysql:db Docker will inject the DB_PORT_3306_TCP_* environment variables which are pulled by the above settings.

Accessing Values From Code

One other practice I like is having a single “Config” class for an application. It can be very tempting to load a configuration node from anywhere in your app but that can get messy fast. Instead, create a config class and access everything you need through that:

object MyAppConfig {
  private val config =  ConfigFactory.load()

  private lazy val root = config.getConfig("my_app")

  object HttpConfig {
    private val httpConfig = config.getConfig("http")

    lazy val interface = httpConfig.getString("interface")
    lazy val port = httpConfig.getInt("port")
  }
}

Type safety, Single Responsibility, and no strings all over the place.

Conclusion

When dealing with configuration think about what environments you have and what the actual differences are between those environments. Usually this is a small set of differing values for only a few properties. Make it easy to change just those settings without changing–or duplicating–anything else. This could done via environment variables, command line flags, even loading configuration files from a url. Definitely avoid copying the same value across multiple configurations: just distill that value down to a lower setting in a hierarchy. By minimizing configuration files you’ll be making your life a lot easier.

If you’re developing an app for distribution, or writing a library, providing a well-documented configuration file (spray’s spray-can reference.conf is an excellent example) you can allow users to override defaults easily in a manner that is suitable for them and their runtimes.

Using Akka’s ClusterClient

I’ve been spending some time implementing a feature which leverages Akka’s ClusterClient (api docs). A ClusterClient can be useful if:

  • You are running a service which needs to talk to another service in a cluster, but you don’t that service to be in the cluster (cluster roles are another option for interconnecting services where separate hosts are necessary but I’m not sold on them just yet).
  • You don’t want the overhead of running an http client/server interaction model between these services, but you’d like similar semantics. Spray is a great akka framework for api services but you may not want to write a Spray API or use an http client library.
  • You want to use the same transparency of local-to-remote actors but don’t want to deal with remote actorref configurations to specific hosts.

The documentation was a little thin on some specifics so getting started wasn’t as smooth sailing as I’d like. Here are some gotchas:

  • You only need akka.extensions = ["akka.contrib.pattern.ClusterReceptionistExtension"] on the Host (Server) Cluster. (If your client isn’t a cluster, you’ll get a runtime exception).
  • “Receptionist” is the default name for the Host Cluster actor managing ClusterClient connections. Your ClusterClient connects first to the receptionist (via the set of initial contacts) then can start sending messages to actors in the Host Cluster. The name is configurable.
  • The client actor system using the ClusterClient needs to have a Netty port open. You must use either actor.cluster.ClusterActorRefProvider or actor.remote.RemoteActorRefProvider. Otherwise the Host Cluster and ClusterClient can’t establish proper communication. You can use the ClusterActorRefProvider on the client even you’re not running a cluster.
  • As a ClusterClient you wrap messages with a ClusterClient.send (or sendAll) message first. (I was sending vanilla messages and they weren’t going through, but this is in the docs).

ClusterClients are worth checking out if you want to create physically separate yet interconnected systems but don’t want to go through the whole load-balancer or http-layer setup. Just another tool in the Akka toolbelt!

First Class Function Example in Scala and Go

Go and Scala both make functions first-class citizens of their language. I recently had to recurse a directory tree in Go and came across the Walk function which exemplifies first-class functions. The Walk function talks a path to start a directory traversal and calls a function WalkFunc for everything it finds in the sub-tree:

func Walk(root string, walkFn WalkFunc) error 

If you’re coming from the Kingdom of Nouns you may assume WalkFunc is a class or interface with a method for Walk to call. But that cruft is gone; WalkFunc is just a regular function with a defined signature given its own type, WalkFunc:

type WalkFunc func(path string, info os.FileInfo, err error) error

Why is this interesting? I wasn’t surprised Go would have a built-in method for crawling a directory tree. It’s a pretty common scenario, and I’ve written similar code many times before. What’s uncommon about directory crawling is what you want to do with those files: open them up, move them around, inspect them. Separating the common from the uncommon is where first-class functions come into play. How much code have you had to write to just write the code you want?

Scala hides the OOP-ness of its underlying runtime by compile-time tricks, putting a first-class function like:

val walkFunc = (file: java.io.File) => { /* do something with the file */ }

into a class of Function1. C# does something similar with its various function classes and delegate constructs. Go makes the interesting design decision of forcing function declarations outside of structs, putting an emphasis on stand-alone functions and struct extensibility. There are no classes in Go to encapsulate functions.

We can write a walk method for our walkFunc in Scala by creating a method which takes a function as a parameter (methods and functions have nuanced differences in Scala, but don’t worry about it):

object FileUtil {
  def walk(file: File, depth: Int, walkFunc: (File, Int) => Unit): Unit = {
    walkFunc(file, depth)
    Option(file.listFiles).map(_.map(walk(_, depth + 1, walkFunc)))
  }
}

In our Scala walk function we added a depth parameter which tracks how deep you are in the stack. We’re also wrapping the listFiles method in an Option to avoid a possible null pointer exception.

We can tweak our walkFunc and use our Scala walk function:

import FileUtil._
val walkFunc = (path: File, depth: Int) => { println(s"$depth, ${path}") }
walk(new File("/path/to/dir"), 0, walkFunc)

Because typing (File, Int) => Unit is somewhat obscure, type aliases come in handy. We can refactor this with a type alias:

type WalkFunc = (File, Int) => Unit

And update our walk method accordingly:

def walk(file: File, depth: Int, walkFunc: WalkFunc): Unit = { ... }

First class functions are powerful constructs making code flexible and succinct. If all you need is to call a function than pass that function as a parameter to your method. Just as classes have the single responsibility principle functions can have them too; avoid doing too much at once like file crawling and file processing. Instead pass a file processor call to your file crawling function.

Programming Akka’s Finite State Machines in Scala

Over the past few months my team has been building a new suite of services using Scala and Akka. An interesting aspect of Akka

we leverage is its Finite State Machine support. Finite State Machines

are a staple of computer programming although not often used in practice. A conceptual process can usually be represented with a finite state machine: there are a defined number of states with explicit transitions between states. If we have a vocabulary

around these states and transitions we can program the state machine.

A traditional implementation of an FSM is to check and maintain state explicitly via if/else conditions, use the state design pattern, or implement some other construct. Using Akka’s FSM support, which explicitly defines states and offers transition hooks, allows us to easily implement our conceptual model of a process. FSM is built on top of Akka’s actor model giving excellent concurrency controls so we can run many of these state machines simultaneously. You can implement your own FSM with Akka’s normal actor behavior with the become method to change the partial function handling messages. However FSM offers some nice hooks plus data management in addition to just changing behavior.

As an example we will use Akka’s FSM support to check data in two systems. Our initial process is fairly simplistic but provides a good overview of leveraging Finite State Machines. Say we are rolling out a new system and we want to ensure data flows to both the old and new system. We need a process which waits a certain

amount of time for data to appear in both places. If data is found in both systems we will check the data for consistency,

if data is never found after a threshold we will alert data is out of sync.

Based on our description we have four states. We define our states using a sealed trait:

sealed trait ComparisonStates
case object AwaitingComparison extends ComparisonStates
case object PendingComparison extends ComparisonStates
case object AllRetrieved extends ComparisonStates
case object DataUnavailable extends ComparisonStates

Next we define the data we manage between state transitions. We need to manage an identifier with data from

the old and new system. Again we use a sealed trait:

sealed trait Data
case object Uninitialized extends Data
case class ComparisonStatus(id: String, oldSystem: Option[SomeData] = None, newSystem: Option[SomeData] = None) extends Data

A state machine is just a normal actor with the FSM trait mixed in. We declare our

ComparisonEngine

actor with FSM support,

specifying our applicable state and data types:

class ComparisonEngine extends Actor with FSM[ComparisonStates, Data] {
}

Instead of handling messages directly in a receive method FSM support creates an additional layer of messaging handling.

When using FSM you match on both message and current state. Our FSM only handles two messages: Compare(id: Int) and

DataRetrieved(system: String, someData: SomeData). You can construct your data types and messages any way

you please. I like to keep states abstract as we can generalize on message handling. This

prevents us from dealing with too many states and state transitions.

Let’s start implementing the body of our ComparisonEngine. We will start with our initial state:

startWith(AwaitingComparison, Uninitialized)

when(AwaitingComparison) {
  case Event(Compare(id), Uninitialized) =>
    goto(PendingComparison) using ComparisonStatus(id)
}

We simply declare our initial state is AwaitingComparison, and the only message we are willing to process is a Compare.

When we receive this message we go to a new state–PendingComparison–and set some data. Notice how we aren’t actually doing anything else?

A great aspect of FSM is the ability to listen on state transitions. This allows us to separate state transition logic from state transition

actions. When we transition from an initial state to a PendingComparison state we want to ask our two systems for data. We simply match

on state transitions and add our applicable logic:

onTransition {
    case AwaitingComparison -> PendingComparison =>
      nextStateData match {
        case ComparisonStatus(id, old, new) => {
          oldSystemChecker ! VerifyData(id)
          newSystemChecker ! VerifyData(id)
        }
      }
    }

oldSystemChecker and newSystemChecker are actors responsible for verifying data in their respective systems. These can be passed in to the FSM as constructor arguments, or you can have the FSM create the actors and supervise their lifecycle.

These two actors will send a DataRetrieved message back to our FSM when data is present. Because we are now in the PendingComparison

state we specify our new state transition actions against a set of possible scenarios:

when(PendingComparison, stateTimeout = 15 minutes) {
  case Event(DataRetrieved("old", old), ComparisonStatus(id, _, None)) => {
    stay using ComparisonStatus(id, Some(old), None)
  }
  case Event(DataRetrieved("new", new), ComparisonStatus(id, None, _)) => {
    stay using ComparisonStatus(id, None, Some(new))
  }
  case Event(StateTimeout, c: ComparisonStatus) => {
    goto(IdUnavailable) using c
  }
  case Event(DataRetrieved(system, data), cs @ ComparisonStatus(_, _, _)) => {
    system match {
      case "old" => goto(AllRetrieved) using cs.copy(old = Some(data))
      case "new" => goto(AllRetrieved) using cs.copy(new = Some(data))
    }
  }
}

Our snippet says we will wait 15 minutes for our systemChecker actors to return with data, otherwise, we’ll timeout and go to the unavailable state. Either the old

system or new system will return first, in which case, one set of data in our ComparisonStatus will be None. So we stay in the PendingComparison state until

the other system returns. If our previous pattern matches do not match, we know the current message we are processing is the final message. Notice how we don’t care how these actors are getting their data. That’s the responsibility of the child actors.

Once we have all our data,

so we go to the AllRetrieved state with the data from the final message.

There are a couple of ways we could have defined our states. We could have a state for the oldSystem returned or newSystem returned. I find it easier to

create a generic PendingComparison state to keep our pattern matching for pending comparisons consolidated in a single partial function.

Our final states are pretty simple: we just stop our state machine!

when(IdUnavailable) {
  case Event(_, _) => {
    stop
  }
}
when(AllRetrieved) {
  case Event(_, _) => {
    stop
  }
}

Our last step is to add some more onTransition checks to handle our final states:

case PendingComparison -> AllRetrieved =>
    nextStateData match {
      case ComparisonStatus(id, old, new) => {
        //Verification logic
     }
   }
 case _ -> IdUnavailable =>
   nextStateData match {
     case ComparisonStatus(id, old, new) => {
      //Handle timeout
      }
   }

We don’t care how we got to the AllRetrieved state; we just know we are there and we have the data we need. We can offload our verification logic

to another actor or inline it within our FSM as necessary.

Implementing processing workflows can be tricky involving a lot of boilerplate code. Conditions must be checked, timeouts handled, error handling implemented.

The Akka FSM approach provides a foundation for implementing workflow based processes on top of Akka’s great supervision support. We create a ComparisonEngine

for every piece of data we need to check. If an engine dies we can supervise and restart. My favorite feature is the separation of what causes a state transition

with what happens during a state transition. Combined with isolated behavior amongst actors this creates a cleaner, isolated and composable application to manage.

Overview on Web Performance and Scalability

I recently gave a talk to some junior developers on performance and scalability. The talk is relatively high-level, providing an overview of non-programming topics which are important for performance and scalability. The original deck is here and on speaker deck.

A few months ago I also gave a talk on spdy which is also on speaker deck.

Spray API Development: Getting Started with a Spray Web Service Using JSON

Spray is a great library for building http api’s with Scala. Just like Play! it’s built with Akka and provides numerous low and high level tools for http servers and clients. It puts Akka and Scala’s asynchronous programming model first for high performance, composable application development.

I wanted to highlight the spray-routing library which provides a nice DSL for defining web services. The routing library can be used with the standalone spray-can http server or in any servlet container.

We’ll highlight a simple entity endpoint, unmarshalling Json data into an object and deferring actual process to another Akka actor. To get started with your own spray-routing project, I created a giter8 template to bootstrap your app:

$g8 mhamrah/sbt -b spray

The documentation is quite good and the source code is worth browsing. For a richer routing example check out Spray’s own routing project which shows off http-streaming and a few other goodies.

Creating a Server

We are going to create three main structures: An actor which contains our Http Service, a trait which contains our route definition, and a Worker actor that will do the work of the request.

The service actor is launched in your application’s main method. Here we are using Scala’s App class to launch our server feeding in values from typesafe config:

#!scala
val service= system.actorOf(Props[SpraySampleActor], "spray-sample-service")
IO(Http) ! Http.Bind(service, system.settings.config.getString("app.interface"), system.settings.config.getInt("app.port"))

println("Hit any key to exit.")
val result = readLine()
system.shutdown()

Because Spray is based on Akka, we are just creating a standard actor system and passing our service to Akka’s new IO library. This is the high performance foundation for our service built on the spray-can server.

The Service Actor

Our service actor is pretty lightweight, as the functionality is deferred to our route definition in the HttpService trait. We only need to set the actorRefFactory and call runRoutes from our trait. You could simply set routes directly in this class, but the separation has its benefits, primarily for testing.

#!scala
class SpraySampleActor extends Actor with SpraySampleService with SprayActorLogging {
  def actorRefFactory = context
  def receive = runRoute(spraysampleRoute)
}

The Service Trait – Spray’s Routing DSL

Spray’s Routing DSL is where Spray really shines. It is similar to Sinatra inspired web frameworks like Scalatra, but relies on composable function elements so requests pass through a series of actions similar to Unfiltered. The result is an easy to read syntax for routing and the Dont-Repeat-Yourself of composable functions.

To start things off, we’ll create a simple get/post operation at the /entity path:

#!scala
trait SpraySampleService extends HttpService {
  val spraysampleRoute = {
    path("entity") {
      get { 
        complete("list")
      } ~
      post {
        complete("create")
      }
    }
  }
}

The path, get and complete operations are Directives, the building blocks of Spray routing. Directives take the current http request and process a particular action against it. The above snippet doesn’t much except filter the request on the current path and the http action. The path directive also lets you pull out path elements:

#!scala
path ("entity" / Segment) { id =>
    get {
      complete(s"detail ${id}")
    } ~
    post {
      complete(s"update ${id}")
    }
  }

There are a number ways to pull out elements from a path. Spray’s unit tests are the best way to explore the possibilities.

You can use curl to test the service so far:

#!bash
curl -v http://localhost:8080/entity
curl -v http://localhost:8080/entity/1234

Unmarshalling

One of the nice things about Spray’s DSL is how function composition allows you to build up request handling. In this snippet we use json4s support to unmarshall the http request into a JObject:

#!scala
/* We need an implicit formatter to be mixed in to our trait */
object Json4sProtocol extends Json4sSupport {
  implicit def json4sFormats: Formats = DefaultFormats
}

trait SpraySampleService extends HttpService {
  import Json4sProtocol._

  val spraysampleRoute = {
    path("entity") {
      /* ... */
      post {
        entity(as[JObject]) { someObject =>
          doCreate(someObject)
        }
      } 
     /* ... */
  }
}

We use the Entity to directive to unmarshall the request, which finds the implicit json4s serializer we specified earlier. SomeObject is set to the JObject produced, which is passed to our yet-to-be-built doCreate method. If Spray can’t unmarshall the entity an error is returned to the client.

Here’s a curl command that sets the http method to POST and applies the appropriate header and json body:

#!bash
curl -v -X POST http://localhost:8080/entity -H "Content-Type: application/json" -d "{ \"property\" : \"value\" }"

Leveraging Akka and Futures

We want to keep our route structure clean, so we defer actual work to another Akka worker. Because Spray is built with Akka this is pretty seamless. We need to create our ActorRef to send a message. We’ll also implement our doCreate function called within the earlier POST /entity directive:

#!scala
//Our worker Actor handles the work of the request.
val worker = actorRefFactory.actorOf(Props[WorkerActor], "worker")

def doCreate[T](json: JObject) = {
  //all logic must be in the complete directive
  //otherwise it will be run only once on launch
  complete {
    //We use the Ask pattern to return
    //a future from our worker Actor,
    //which then gets passed to the complete
    //directive to finish the request.
    (worker ? Create(json))
                .mapTo[Ok]
                .map(result => s"I got a response: ${result}")
                .recover { case _ => "error" }
  }
}

There’s a couple of things going on here. Our worker class is looking for a Create message, which we send to the actor with the ask (?) pattern. The ask pattern lets us know the task completed so we call then tell the client. When we get the Ok message we simply return the result; in the case of an error we return a short message. The response future returned is passed to Spray’s complete directive, which will then complete the request to the client. There’s no blocking occurring in this snippet: we are just wiring up futures and functions.

Our worker doesn’t do much but out the message contents and return a random number:

#!scala
class WorkerActor extends Actor with ActorLogging {
import WorkerActor._

def receive = {
  case Create(json) => {
    log.info(s"Create ${json}")
    sender ! Ok(util.Random.nextInt(10000))
    }
  }
}

You can view how the entire request is handled by viewing the source file.

Wrapping Up

Reading the documentation and exploring the unit tests are the best way to understand the power of Spray’s routing DSL. The performance of the standalone spray-can service is outstanding, and the Akka platform adds resiliency through its lifecycle management tools. Akka’s remoting feature allows systems to build out their app tiers. A project I’m working on is using Spray and Akka to publish messages to a pub/sub system for downstream request handling. It’s an excellent platform for high performance API development. Full spray-sample is on GitHub.

Updating Flickr Photos with Gpx Data using Scala: Getting Started

If you read this blog you know I’ve just returned from six months of travels around Asia, documented on our tumblr, The Great Big Adventure with photos on Flickr. Even though my camera doesn’t have a GPS, I realized toward the second half of the trip I could mark GPS waypoints and write a program to link that data later. I decided to write this little app in Scala, a language I’ve been learning since my return. The app is still a work in progress, but instead of one long post I’ll spread it out as I go along.

The Workflow

When I took a photo I usually marked the location with a waypoint in my GPS. I accumulated a set of around 1000 of these points spread out over three gpx (xml) files. My plan is to:

  1. Read in the three gpx files and combine them into a distinct list.
  2. For each day I have at least one gpx point, get all of my flickr images for that data.
  3. For each image, find the waypoint timestamp with the least difference in time.
  4. Update that image with the waypoint data on Flickr.

Getting Started

If you’re going to be doing anything with Scala, learning sbt is essential. Luckily, it’s pretty straightforward, but the documentation across the internet is somewhat inconsistent. As of this writing, Twitter’s Scala School SBT Documentation, which I used as a reference to get started, incorrectly states that SBT creates a template for you. It no longer does, with the preferred approach to use giter8, an excellent templating tool. I created my own simplified version which is based off of the excellently documented template by Yuvi Masory. Some of the versions in build.sbt are a outdated, but it’s worthwhile reading through the code to get a feel for the Scala and SBT ecosystem. The g8 project also contains a good working example of custom sbt commands (like g8-test). One gotcha with SBT: if you change your build.sbt file, you must call reload in the sbt console. Otherwise, your new dependencies will not be picked up. For rubyists this is similar to running bundle update after changing your gemfile.

Testing

I’m a big fan of TDD, and strive for a test-first approach. It’s easy to get a feel for the small stuff in the scala repl, but orchestration is what programming is all about, and TDD allows you to design and throughly test functionality in a repeatable way. The two main libraries are specs (actually, it’s now specs2) and ScalaTest. I originally went with specs2. It was fine, but I wasn’t too impressed with the output and not thrilled with the matchers. I believe these are all customizable, but to get a better feel for the ecosystem I switched to ScalaTest. I like ScalaTest’s default output better and the flexible composition of testing styles (I’m using FreeSpec) and matchers (ShouldMatchers) provide a great platform for testing. Luckily, both specs2 and scalatest integrate with SBT which provides continuous testing and growl support, so you don’t need to fully commit to either one too early.

Six Months of Computer Science Without Computers

A few weeks ago I returned from a six month trip around Asia. I didn’t have a computer while abroad, but I was able to catch up on several tech books I never had time for previously. Reading about programming without actually programming was an interesting and rewarding circumstance. It provided a unique mental model: it was no longer about “how you do this” but about “why would you do this”. Accomplishment of a task via implementation was not an end goal. The end goal was simply absorbing information; once read, it didn’t need to be applied. It only needed to be reasoned about and hypothetically applied under a specific situation (which I usually did on a trek or on a beach). Before I would have been eager to try it out, hacking away, but without a computer, I couldn’t. It was liberating. Given a problem, and a set of constraints, what’s the ideal solution? I realize this is somewhat of an ivory-tower mentality, however, I also realized some of the best software has emerged from an idealism to solve problems in an opinionated way. Sometimes we are too consumed by the here-and-now we fail to step back for the bigger picture. Conversely, we hold onto our ideals and fail to adapt to changing circumstances.

My favorite aspect of learning technology while traveling abroad did not come from any book or video. A large part of computer science is about optimizing systems under the pressure of constraints. Efficient algorithms, clean code, improving performance. The world is full of sub-optimal processes. Burmese hotels, the Lao transportation system, and Nepalese immigration to name a few. On a larger scale sun-optimal problems are created by geographic, socio-economic, or political constraints. People try the best they can to improve their way of life, and unfortunately, the processes are often “implemented” with a “naïve” solution. Some are also inspiring. It was powerful to see these systems up close, with cultural and historical factors so foreign. One thing is certain: when you optimize for efficiency, everyone wins.

Below are a selection of books and resources I found particularly interesting. I encourage you to check them out, hopefully away from a computer in a foreign land:

97 Things Every Programmer Should Know : A great selection of tidbits from a variety of sources. Nothing new for the experienced programmer, but reading through the sections is a great refresher to keep core principles fresh. Worthwhile to randomly select a chapter now and again for those “oh yeah” moments.

Beautiful Code by Andy Oram and Greg Wilson: My favorite book. Not so much about code, but the insight about solving problems makes it a great read. I appreciate the intelligent thought process which went into some of the chapters. Python’s hashtable implementation and debugging prioritization in the Linux kernel are two highlights.

Exploring Everyday Things with R and Ruby by Sau Sheong Chang: This is a short book with great content. You only need an elementary knowledge of programming and mathematics to appreciate the concepts. It’s also a great way to get a taste of R. The book covers a variety of topics from statistics, machine learning, and simulations. My favorite aspect is how to use modeling to verify a hypothesis or create a simulation. The chapters involving emergent behavior are particularly interesting.

Machine Learning for Hackers by Drew Conway and John Myles White: I’ve been interested in machine learning for a while, and I was very happy with this read. Far more technical and mathematical than Exploring Everyday Things, this book digs into supervised and unsupervised learning and several aspects of statistics. If you’re interested in data science and are comfortable with programming, this book is for you.

Scala in Action by Nilanjan Raychaudhuri: Scala and Go have been on my radar for a while as new languages to learn. It’s funny to learn a new programming language without being able to test-drive it, but I appreciated the separation. My career has largely been focused on OOP: leveraging design patterns, class composition, SOLID principles, enterprise architecture. After reading this book I realize I was missing out on great functional programming paradigms I was only unconsciously using. Languages like Clojure and Haskell are gaining steam for a radically different approach to OOP, and Scala provides a nice balance between the two. It’s also wonderfully expressive: traits, the type system, and for-comprehension are beautiful building blocks to managing complex behavior. Since returning I’ve been doing Scala full-time and couldn’t be happier. It’s everything you need with a statically typed language with everything you want from a dynamic one (well, there’s still no method-missing, at least not yet). I looked at a few Scala books and this is easily on the top of the list. Nilanjan does an excellent job balancing language fundamentals with applied patterns.

HBase: The Definitive Guide by Lars George: I’ve been deeply interested in distributed databases and performance for some time. I purchased this book a few years ago when first exploring NoSQL databases. Since then, Cassandra has eclipsed the distributed hashtable family of databases (Riak, Hbase, Voldemort) but I found this book a great read. No matter what implementation you go with, this book will help you think in a column-orientated way, offering great tidbits into architectural tradeoffs which went into HBase’s design. At the very least, this book will give you a solid foundation to compare against other BigTable/Dynamo clones.

The Architecture of Open Source Applications: I was excited when I stumbled upon this website. It offers a plethora of information from elite contributors. The applied-practices and deep architectural insight are valuable lessons to learn from. Andrew Alexeev on Nginx, Kate Matsudaira on Scalable Web Architecture and Martin Sústrik on ZeroMQ are highlights.

iTunes U

I was also able to check out some courses on iTunes U while traveling. The MIT OCW Performance Engineering of Software Systems was my favorite. Prof. Saman Amarasinghe and Prof. Charles Leiserson were both entertaining lecturers, and the course provided great insight into memory management, parallel programming, hardware architecture, and bit hacking. I also watched several lectures on algorithms giving me a new found appreciation for Big-O notation (I wish I remembered more while on the job interview circuit). I’ve been gradually neglecting the importance of algorithmic design since graduating ten years ago, but found revisiting sorting algorithms, dynamic programming, and graph algorithms refreshing. Focusing on how well code runs is as important as how well it’s written. Like most things, there’s a naïve brute-force solution and an elegant, efficient other solution. You may not know what the other solution is, but knowing there’s one lurking behind the curtain will make you a better engineer.

So, if you can (and you definitely can!) take a break, grab a book, read it distraction free, gaze out in space and think. You’ll like what you’ll find!