Klout Engineering

Iteratees in Big Data at Klout

January 25th, 2013 by Naveen Gattu

At Klout we calculate social influence of users across several social networks and we must be able to collect and aggregate data from these networks scalably and within a meaningful amount of time. We then deliver these metrics to our users visually.

Our users and clients expect data to be up to date and accurate and it has been a significant technical challenge to reliably meet these goals. In this blog post we describe the usage of Play! Iteratees in our redesigned data collection pipeline. This post is not meant to be a tutorial on the concept of Iteratees, for which there are many great posts already such as James Roper’s post and Josh Suereth’s post. Rather, this post is a detailed look at how Klout uses Iteratees in the context of large scale data collection and why it is an appropriate and effective programming abstraction for this use case. In a later post we will describe our distributed Akka-based messaging infrastructure, which allowed us to scale and distribute our Iteratee based collectors across clusters of machines.

Iteratees in a Nutshell

In a sentence, Iteratees are the functional way to model producers and consumers of streams of data. Single chunks of data are iteratively produced by an Enumerator, optionally mapped/adapted by an Enumeratee and then consumed by an Iteratee. Each stage can be composed and combined together in a pipeline like fashion.

Enumerator (produce data) → Enumeratee (map data) → Iteratee (consume data)

Other compositions are also possible with the Play! Iteratee library, such as Enumerator interleaving (multiple concurrent enumerators), Enumeratee chaining, and Iteratee folding (grouping enumerated data into larger chunks).

Legacy Data Collection

Our legacy collection framework was written in Java and built on the java.util.concurrent library. As a consequence, data collection consisted of isolated nodes fetching user data sequentially with convoluted succeed-or-retry semantics in blocking threads. As our datasets grew with an ever increasing user base, the inefficiencies of the system started to become apparent.

Data was fetched and written to disk in much the same way an Apache based web-server services requests; with a single thread responsible for all possible IO encountered in the code path for a single user collection request. At a high level this IO consists of the following 3 stages (for every user/network/data-type combination):

1. Dispatch the appropriate api calls and parse json responses
2. Write data to disk / HDFS
3. Update progress metrics

These three stages are necessarily sequential, but individually they are highly paralelizable, and more importantly may be executed asynchronously. For instance, in stage 1 we can issue multiple simultaneous API calls to construct a particular user activity, i.e. John Doe posted “I love pop-tarts for breakfast!” to Facebook and received 20 comments and 200 likes. The activity, which consists of the status message and all 20 comments and 200 likes, can be constructed asynchronously and in parallel.

New Collection

Recognizing the gains to be made from a non-blocking, parallelized implementation, we decided to re-architect the collector framework with the awesome Play!+Scala+Akka stack. This stack has many nice feature sets and libraries, of particular interest is the Typesafe implementation of Iteratees. The nice thing about this implementation are the pre-written utilities such as Iteratee.foreach and its rich Enumeratee implementation. We also made very heavy use of the Play! WebServices library, which provides a thin scala wrapper around the Ning asynchronous http client and integrates beautifully with the iteratee library using Play Promises (to be completely integrated with Akka promises in the Play 2.1 release).

Paging Enumerator

Data from a network, accessible via api calls, is conveniently returned as paginatable json. To support this we needed a generic and abstract paging mechanism for each type of data we were fetching, being posts, likes or comments. We exploited the fact that each page of data included a next url for easy pagination, which is very elegantly handled with a fromCallback Enumerator:

This example leaves out error handling, retry and backoff logic but provides a good initial intuition. Given a starting url, the enumerator simply keeps track of one piece of mutable state, the next url, which it continually updates each time the retriever function is invoked. This paging enumerator gives us a way to reactively fetch data, where we do not fetch more data than requested. For instance we could apply this enumerator to a file writing Enumeratee and be assured that we will not overwhelm our disk. Or use a ‘take’ Enumeratee to limit the number of ws calls to a predefined limit. We could then attach a status updating iteratee to this chain and be assured that we will not overwhelm our database. An Enumeratee, for those squinting their eyes, can be thought of as an adapter which transforms data types produced by an Enumerator, to types consumed by an Iteratee.

Enumerator of Enumerators

The paging enumerator is great for keeping track of next page urls, however the json data on each page is typically a list of posts which need individual processing. Each post typically contains an associated set of likes and comments with corresponding fetch urls, which also need to be paged through and joined to the original post json in order to construct a full activity which we can then finally consume. We want to be able to generate and process each activity as a single json document, with all its associated likes and comments meta-data, while still maintaining our requirement of not overwhelming our system, with the additional goal of parallelizing API calls as much as possible. Exploiting the highly composable nature of the Iteratee library, we can process a stream of posts while fetching the associated likes and comments and build each activity in parallel using a combination of Enumeratee.interleave and Iteratee.fold:

We can now apply our buildActivity method to each post in each list of posts:

As a final step, we need to ‘flatten’ our Enumerator of Enumerators to create an Enumerator of Activities. However at the time of this writing, flattening Enumerators was not part of the standard Play! Iteratee library, so we took the liberty of writing one ourselves:

File Writing Enumeratee

Armed with our shiny new reactive, paginated and parallelized Activity Enumerator, we now need to hook it up to our file writing logic. Lets assume the internals of the file writing have been abstracted away into one function:

writeToFile(json: JsValue): Promise[Either[Failure, Success]]

From the type signature of writeToFile we can assume it executes asynchronously, finally returning either Failure or Success objects. From this we can construct an Enumeratee which we could then apply our Activity Enumerator too (as part of the overall Iteratee pipeline):

Again, flatMap is not part of the standard Iteratee library:

The file writing Enumeratee simply maps each Activity to an Either, containing either the Failure from writeToFile if it failed, or the Activity for further processing. Note that although conceptually, file writing seems more like the job of an Iteratee, an Enumeratee is the appropriate construct since we do not want to ‘consume’ input from the Enumerator, we want to map input and pass on for later processing. We now have stage 2 of our 3 stage pipeline.

Status Updating Iteratee

As we process each Activity, we want to iteratively collect and report stats, cursor information, errors and other meta-data. Since this is the final stage, we appropriately model it as an Iteratee which acts as the ‘sink’ in our pipeline. For reasons of clarity and brevity, this is a simplified version of the actual Iteratee we use, but it illustrates the point:

Putting it All Together

The final step is to hook all these guys up so that it can actually do something meaningful:

The beauty is in the simplicity, and more importantly, the composability. We could add other pipeline stages simply by implementing an Enumeratee or Iteratee with the proper types, and we will get all the other benefits for free.

Data collection is the foundation of the Klout experience, enabling us to aggregate, analyze and track influence across our social lives. It’s what allows us to highlight our most influential moments.

Want to dabble with Iteratees in production? Come join us and ping us on Twitter at @navgattu@AdithyaRao, @prantik, and @dyross.

This entry was posted on Friday, January 25th, 2013 at 9:00 am and is filed under Big Data, Platform Engineering, Science. You can follow any responses to this entry through the RSS 2.0 feed.

You can leave a response, or trackback from your own site.