Klout Engineering

Scoozie: Creating Big Data Workflows at Klout

August 1st, 2013 by Matthew Johnson

Workflow specification is a major pain point for us as at Klout. Workflows, a way of chaining together Big Data jobs (mapreduces, hive scripts, distributed copies, file system tasks, etc.) are an essential tool in organizing large batch jobs into scalable and maintainable pipelines. One standard Hadoop Workflow system is Apache Oozie, an XML-based technology. Scoozie is our solution and allows us to improve upon the developer experience of using Oozie.

Oozie requires writing copious amounts of XML and often leads to copying and pasting a mashup of jobs and configurations from file to file. These specifications are not type safe, prone to typos, difficult to read and understand, and often difficult to test. We wanted to make workflows easy to write, understand, and maintain while encouraging best-practices.

On the path to a solution to this problem, the team evaluated several options, including replacing Oozie and implementing a new workflow engine. We decided, however, that Oozie actually works great—the major pain is the specification. So we thought, how can we make Oozie easier to use? The answer was to develop Scoozie.

The Philosophy Behind Scoozie

Screen shot 2013-08-01 at 9.36.26 AM

The natural choice to solve these problems was to implement a Scala DSL on top of Oozie’s XML schema (Scala + Oozie = Scoozie). Scala’s strong typing and support for common-sense readability (examples to come), as well as the fact that Klout is already a huge Scala shop, made Scala the perfect fit. Developers with no experience with Scala at Klout have found Scoozie and Scala easy to pick up and intuitive to use.

In the traditional Oozie workflow, nodes are forward-chaining; the developer specifies where a node goes to after it is finished. This can be hard to track, and requires the developer to hold in mind every node in the chain when developing the workflow. Instead of following this design philosophy, we decided to focus the Scoozie DSL around dependencies.  In this way the developer only needs to look at one node in a workflow at a time, and think only about what that node depends on. Scoozie will figure out the rest.  This greatly reduces cognitive overhead and amplifies productivity.

What it Looked Like Before

Oozie XML is nasty, and in order to give a taste of exactly how nasty it is, here’s an example of what a very simple pipeline of two parallel jobs looks like without Scoozie.

Screen shot 2013-08-01 at 10.09.09 AM

Clearly many of these parameters will be common throughout most if not every Map-Reduce Job that needs to be specified, yet these parameters and configurations must be manually copied from file to file or even from job to job, such as in the example above. A simple case of missing a parameter or a typo in a minor change could lead to hours of frustrated debugging later on.

What it Looks Like Now 

Focus on Dependencies

Screen shot 2013-08-01 at 9.37.54 AM

This is an example of the general structure of any Scoozie workflow specification. As can be seen, the developer only needs to worry about dependencies and hence only needs to focus on one node at a time.

Don’t Worry About Forks and Joins

Screen shot 2013-08-01 at 9.37.44 AM

Again, the Scoozie developer only needs to worry about the dependencies of any given node.

Scoozie is smart enough to figure out fork/join structures for you, and even verify workflows against Oozie’s strict fork/join requirements. Already, it can be seen how much clearer this is than the XML example provided above.

Modularity and Scalability

Screen shot 2013-08-01 at 9.37.37 AM

Scoozie lets the developer nest workflows, allowing for better readability, abstraction, and reuse of code (notice “Subwf” on line 4 is the entire workflow specified in the previous example) . This is another step to make best practices easy to follow in the specification of big-data workflows.

Using Scala’s Super-Flexible Syntax

Screen shot 2013-08-01 at 9.37.27 AM

Scoozie strives to create a syntax that makes the DSL intuitive to read and pick up. The scoozie engine makes this common-sense syntax possible by doing the legwork for you in the background.

In addition, custom job factory methods are easy to create and use, making it natural to reuse jobs to prevent the duplication of configuration code in workflow specifications. Scoozie helps you minimize boilerplate!

Open Sourcing Scoozie

Scoozie has been a success at Klout, so we hope others in the Scala and Big Data communities can take advantage of this project as well. Please check out the Scoozie repo for more code examples and an in-depth tutorial of how to get started using Scoozie. And if you’re interested in working with Scala and big data at Klout, visit our careers page or take a look at Klout’s other Big Data open source project called Brickhouse.

Additionally, we would like to provide a shout-out to a great open-source project, scalaxb, an sbt plugin that takes .xsd files and creates matching Scala case classes. Scoozie populates these case classes, which are then automatically converted to XML by scalaxb. This plugin saved a lot of headache in the actual process of conversion to XML.

This entry was posted on Thursday, August 1st, 2013 at 4:21 pm and is filed under Big Data, Open Source. You can follow any responses to this entry through the RSS 2.0 feed.

You can leave a response, or trackback from your own site.

  • Abhishek Gadiraju

    I wanna be like Matt Johnson when I grow up