By continuing to use this site you are agreeing to our use of cookies, as set out in our privacy policy.

« go back Tech comment

Keeping up with data streaming demands

Article by Chris Sawyer Data_streaming.jpg

Clients with a continuous, large stream of data coming in that needs to be processed really quickly and must always be up – regardless of hardware reliability – is becoming an increasingly common scenario. So what’s the solution?

I’m glad you asked. There are several quite different solutions and the one that’s chosen will depend on a variety of factors. Yet all of the solutions have some common factors:

  • Java/Scala. Mostly due to its networking and scalability and general Java Virtual Machine flexibility and underpinnings.
  • Little or no concurrency. Concurrency is hard. Really hard. To err is human; to really foul things up requires a developer who thinks it is OK to toss threads around willy-nilly. Industry experience over a couple of decades has demonstrated that performance of hand-written multi-threading systems is rarely what was expected and it’s too difficult to foresee circumstances where threads require synchronisation or yielding.
  • Anything with ‘Streams’ in the name usually has a Domain Specific Language (DSL) to describe how the data flows from Producers, through Processors, to Sinks. They all look very similar, in a kind of fluent-y way.
  • Data can be streamed because that’s the way the world is going – no one wants to wait any longer.

The solutions to scalable performance lies in writing mostly single-threaded classes that are controlled by multi-threaded frameworks or libraries, which are based on the idea of it doing all the tricky threading stuff for you. It’s also preferable to have little or no state. So here are four good solutions:

  1. Akka and Akka Streams. Akka is a great solution. It is quite difficult to get heads around the Actor paradigm for describing ‘units of work’, but it’s marvellous once they do. Each Actor has its own message queue and state, so careful thought needs to go into how they’ll communicate and how the Actor hierarchy is structured. Akka Streams has a DSL to facilitate describing how data flows through the system.
  2. Micro-service architecture with Kafka as the communication bus. Kafka is brilliant. Think standard messaging but far higher performance and scalability/fault tolerance. LinkedIn uses it to pump over 750TB of data per day. A cluster of three mediocre PCs can write 2 million Kafka messages per second to disk. There are also Kafka Streams if you need true streaming rather than micro-batching and millisecond response. It also has a DSL for the usual reasons. For a micro-service architecture, it’s essential that you deal with fault-tolerance and service failures, so you should also investigate the Netflix libraries and the Spring Cloud services as they’re fantastic when used together. Eureka, Hystrix, Zuul and Cloud Config – fantastic.
  3. JEE with Enterprise JavaBeans (EJBs). Java Enterprise Edition also has scalability and resilience sewn up. EJBs work on the principle that you describe in Java the work they do and once you’ve deployed them to an EJB container server – such as IBM WebSphere or Oracle WebLogic – they’ll run them concurrently. JEE lost the enterprise space a long time ago to Spring Framework due to very poor design and implementation. So you’ll need a good reason to adopt them.
  4. Apache Storm and Storm Trident. In many ways, this is similar to Akka and Akka Streams but with a more intelligent failover and routing. It has the tremendous advantage of guaranteed message delivery. Storm is based on Task-Based Parallelism whereas, for example, Apache Spark Streaming is based on Data-Based Parallelism; one distributes tasks (often threads) across the same data and the other runs the same task on different data.
  5. Lambda Architecture with Apache Spark. Lambda architecture is a ‘have your cake and eat it’ approach in that it’s designed to give batch processing, like Hadoop’s MapReduce while also carrying out Stream processing while you wait. A lot of Lambda solutions use Apache Spark under the covers with some other framework floating on top to force these two separate paradigms together.
  6. Other solutions include Apache Samza, Druid and Apache Ignite. They are all worth a quick glance, but are unlikely to make it into any tier-one financial institutions.