At Klout, we love data and as Dave Mariani, Klout’s VP of Engineering, stated in his latest blog post, we’ve got lots of it! Klout currently uses Hadoop to crunch large volumes of data but what do we do with that data? You already know about the Klout score, but I want to talk about a new feature I’m extremely excited about — search!
Problem at Hand
I just want to start off by saying, search is hard! Yet, the requirements were pretty simple: we needed to create a robust solution that would allow us to search across all scored Klout users. Did I mention it had to be fast? Everyone likes to go fast! The problem is that 100 Million People have Klout (and that was this past September—an eternity in Social Media time) which means our search solution had to scale, scale horizontally.
So how did we accomplish that?
Share Nothing and Don’t Block
We use Node.js in our front end to help scale to thousands of concurrent users. We follow the same philosophy in our backend for search. Given the size of our dataset and its substantial growth rate, we needed to choose a search solution which would allow us to scale horizontally; On the application side we wanted to have a stateless Web layer, not only for performance, but also for manageability. So share nothing and block as little as possible!
Let’s Play! and be “cool, bonsai cool”
The technology stack chosen to address the problem was ElasticSearch and the Play! Framework. Why did we choose that stack? At Klout, we like to choose the right tool for the job, regardless of the platform it runs under or the company that’s behind it. We chose ElasticSearch and Play! because both of these were designed to use fast, non-blocking IO, both of these provide powerful infrastructure, and both of these were designed to be easy to extend. These tools help us build powerful search now, and continue improving search to give you more relevant results.
ElasticSearch is a powerful, scalable and distributed search solution built on strong foundations like JBoss Netty and Apache Lucene. ElasticSearch builds off of Apache Lucene, a personal favorite of mine, created by Doug Cutting. Doug Cutting has had a huge impact on many tools we use at Klout; He is also the creator of Hadoop (and Nutch for that matter!). Lucene is a search library—more than 10 years old—that provides powerful search capabilities such as relevancy ranking, fuzzy matching, wildcard, proximity operators, fielded searching, spell-checking, multi-lingual and all that jazz—all while still being completely portable since it’s a JVM-based solution; most important, it’s blazing fast!
ElasticSearch uses JBoss Netty as its network library for async/non-blocking IO. In a traditional blocking IO model, performing a search across multiple shards would be extremely expensive. We could retrieve results serially, meaning that our search would become slower as our data size increased, or execute results in parallel threads, which would require ever increasing processing resources. Netty allows ElasticSearch to retrieve results from multiple search nodes in parallel; there are no blocking threads waiting for it to finish.
We used Play! Framework for the Web layer, which also uses JBoss Netty as its network library. Why? To find out more about this great framework, watch my Dreamforce presentation from this past September here in San Francisco, CA: “Introducing Play! Framework: Painless Java and Scala Web Applications”. Just recently, Play! has joined Typesafe, the creators of Scala, as an official part of its Scala-based technology stack and providers of the Web solution for Scala.
Akka is also part of Typesafe’s stack and provides an event-driven and self-healing concurrency platform based on an Erlang-style, actor-based concurrency model for the JVM. In summary, Akka helps Klout’s search go fast! We have actors for the different searches we support, messages are dispatched to their mailboxes as Play’s controller actions are invoked. Akka actors, which are pretty similar to Scala actors, allow us to effortlessly execute parallel searches to minimize overall response time to provide our users the best experience possible.