Research Post

The big data Spark

The big data Spark

11 April 2016

The open source big data processing framework Apache Spark has become the one-size-fits-all solution for big data and big calc problems. Chris Sawyer offers an insight on this development.

Market Analysis has forecast that Apache Spark’s market will grow at a compound annual growth rate of 67% per year between 2017 and 2020 to be worth approximately $4.2bn (£3bn) by 2020, with a cumulative market value of $9.2bn (£6.5bn, 2017-2020).

The main reason is obvious; it's so flexible that it can be a one-size-fits-all solution for big data or big calc problems, it supports a wide range of languages (yes, even C# via the Microsoft Mobius project) and if you've ever been frustrated by the time taken for a batch process running on Hadoop, then Spark provides near real-time responsiveness.

It also doesn't hurt that it's open source and using just commodity hardware (which could come from the standard churn of desktop computers), so it is possible to achieve almost free supercomputer processing abilities.

Its flexibility is due to the supplied extension modules. These cover SQL emulation, machine learning, graph databases and streaming data – and it’s the latter that’s picking up the most interest – taking big data into the more responsive, near real-time arena.

Through its use, it is possible to build an intraday risk engine on commodity hardware with amazing responsiveness and storage, and still query the results using ANSI(-ish) SQL. It can run on a range of distributed file systems, such as Hadoop’s HDFS, Amazon’s S3 or just plain file storage, so anyone with an existing Hadoop implementation has a head start.

The majority of the sales calls I have been involved with in recent months have featured several questions about our Spark skills – no mention of Hadoop any longer, just Spark.

The core of Spark has been written by five guys at Databricks, who saw that if you could remove the file reads and writes from Hadoop’s MapReduce process then you could substantially speed things up – roughly 200 times, it turned out.

The initial developer setup is reasonably straightforward – it's a bit messy if you're a Windows client, but great if you're running a Linux client. Either way, you can be testing out ideas pretty quickly. Watch out when using a local installation and referencing files as those files might not be available at the same location on all the worker nodes once you scale up.

The programming model isn’t just the standard ‘write a program that controls Spark and returns the results to you’ type. You can send your transformational or query code via the command line and then retrieve the file(s) of results. You can also start the interactive shell and enter one line at a time.

If you're more interested in SQL then there are other choices such as Cloudera Impala, which is faster and solely focuses on SQL over a Hadoop-based mesh. Facebook Presto also focuses on SQL processing and is even more impressive. If SQL is less important and you just need a graph database, then Titan is probably the best solution. If you simply require streaming, then maybe Apache Storm or Akka would be a better fit.

Yet the bottom line is that none of these other platforms have the capabilities of Apache Spark. Furthermore, if you're not sure what other requirements your big data grid may need to meet in a year’s time or it is possible that a new requirement could arrive to move from batch-driven processing to streaming, then Spark may well be the safest choice.

Back to Articles