Spark Summit Europe 2016 review
Spark Summit Europe took place in Brussels, Belgium just about a week ago. I had a pleasure to be there for conference days where I attended mostly Data Science track, as this is our bread and butter in Semantive. This summit could be summarized in a couple of words: Spark 2.0 and Streaming, as it was the hottest topic around the sessions.
The conference opened with a keynote from Matei Zaharia about Spark 2.0 features and how fast is it due to Catalyst Optimizer, which can greatly optimize all transformations iff they are done using SQL, DataFrames or Datasets. RDDs do not benefit from optimizations because you can execute an arbitrary code in your functions so that Spark has no idea what is going on, whereas if you use DataFrames and you filter then Spark knows about a filter operation so it can use that knowledge to optimize execution.
The first keynote ended with Greg Owen showing a great demo of Spark Streaming, which processed tweets related to Brexit and performed sentiment analysis of each tweet. We looked at a popularity of tweets, which mentioned terms such as #theresa, #boris and surprisingly #marmite (a dark savory spread made from yeast extract and vegetable extract), which is loved by Brits.
Marmite had indeed positive sentiment and it happened to be more popular than expected in relation to Brexit tweets.
What’s interesting in the demo is that sentiment analysis was done simply in scikit-learn pipeline trained prior to the demo. This kind of mix of scikit-learn and Spark is a nice combination, where you have a trained model, which you just apply to huge amounts of incoming data.
In the next session, Ion Stoica talked about a history of Spark and the future, which is all about streaming, real-time decisions and security of data. Projects which are supposed to deliver the technology are Drizzle and Opaque. More info about it here.
Sessions I like the most are:
- Making the switch: predictive maintenance on railway switches
- Vegas, the Missing MatPlotLib for Spark
- OrderedRDD: A distributed time series analysis framework for Spark
- Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs
- Extreme-scale Ad-Tech using Spark and Databricks at MediaMath
- Prediction as a Service with Ensemble Model Trained in SparkML on 1 billion Observed Flight Prices Daily
And here are my key takeaways from each one:
Chris Pool and Jeroen Vlek from Anchorman talked about their recent project for a Dutch railway subcontractor Strukton. Thing is, you need to keep switches which move rails working. Otherwise, you get delays and cancellations of trains. Dutch railway has a reward-fine policy so that a subcontractor gets a reward when it’s all working fine, but gets a fine if there are malfunctions. In the end, it’s better to replace or repair a potentially malfunctioning switch than to pay a fine.
Anchorman had data about switches gathered from sensors. They were supposed to predict switch failure using that data, which happen once or twice a year. They came up with a simple model, but not a simplistic one, because the client wanted to actually understand all the features so that they used really simple features, which represent time-series of a flip, such as min, max, average, length per each segment in the curve of a flip. The model was also supposed to be understandable so they used a decision tree model, which is explainable and can be visualized. They used Spark for all data preparation, model training, and running predictions. They normalized data using a sliding window so that seasonality is not discarded. The whole project right now is just MS-SQL and Spark.
Guys from Netflix presented a plotting library Vegas, which is declarative so that most of the goodness comes out of the box and it is made for Scala. They based their solution on Vega, which is JS plotting lib. Vegas can be used from a Zeppelin notebook or a Scala console. It has built-in time-series support and of course Spark support. Charts look good so just take a look at their docs!
This was a funny one, presented by Larisa Sawyer from Two Sigma Investments. She presented a library Flint for working with time-series data, which allows you effectively do temporal joins. Temporal join is a join which just matches criteria over time and it can look forward or look backward to find a match. Compared to spark-ts, Flint can work with time-series which do not fit into a single machine and it has support for streaming.
TimeSeriesRDD, a new RDD that comes with Flint, preserves temporal order so that after temporal joins data stays in order so that sorts and shuffles are not required. This yields 50-100 times speed up compared to just working with a standard RDDs and time-series data.
And what is funny about that? Well, it’s just how Larisa presented and made jokes about trading based on Anne Hathaway’s tweets.
This reminded me of my time in CERN. Luca Canali presented how flame graphs can help you to investigate where CPU cycles are consumed. This technique requires some experience to know what to look at, but if you do, then you can understand what is going on. Take a look at Brendan Gregg’s resources on flame graphs to learn more.
Prasad Chalasani presented a truly impressive use of Spark for optimized bidding on ad campaigns, which maximize impressions, such as ad clicks or page visits. They have massive amounts of data, as there are over 200 billion daily ad-opportunities, 1 billion of users, millions of features and this all sums up to terabytes of data a day.
Prediction as a Service with Ensemble Model Trained in SparkML on 1 billion Observed Flight Prices Daily
Another large-scale presentation was given by Josef Habdank from Infare Solutions. Josef talked about predicting airfares. I want to briefly summarize his solution: There are billions of time-series and we need to predict what will be next in each of them. Given a variety of time-series, it is better to cluster them into tens of thousands of clusters and train a prediction model for each cluster. However, clustering time-series is a problem itself because it has many features, many data points in time. To solve this problem, Josef first applied Linear Regression model to each time series and used coefficients of each model, as features for clustering. This way he had a few amount of features in a clustering algorithm. He used Guassian Mixture Model for the clustering. In the end, clustering segmented the space into smaller subspaces.
The next challenge is to train models for all subspaces in parallel. Josef said that simple models such as Linear Regression worked well. He also added extra features to time-series such as average price between all connections on a particular route like London-Berlin. The key functionality of Spark which allowed to train all models at the same time is collect_list, which allowed merging millions of rows into a single row with millions of values, to which an UDF function was applied that performed model training. The elegance of the solution is brilliant.
It was a great conference and I’ve learned quite a bit. I had a quick tour of Brussels, must-eat fries and waffles.
Spark 2.0 is a hit, which delivers its promise of being fast with all the optimizations that come from Catalyst optimizer and project Tungsten. We will have to wait a bit till streaming connectors arrive, to fully utilize structured streaming and performance gains, but that should happen by end of the year. Merci beaucoup Databricks!