Spark Summit Europe 2017 just concluded, here, in Dublin. More than 102 speakers, 1200 attendees, and an impressive Databricks team attended the 3-day long celebration.
Spark is reaching a new phase: more people are interested in monitoring, optimizing, extending… It is a clear sign that our favorite Apache product is gaining in maturity.
A lot of sessions spoke about benchmarking and performance, including a new version of Spark Bench, built and open sourced by IBM and Emily Curtin (@emilymaycurtin)’s team (from Atlanta, GA, as ATL matters to Emily). This is an impressive tool that can test various configurations (and variable within the configuration) of Apache Spark. It enables you to make sure you « automagically » find out what is a good (optimal) configuration for your workload. I absolutely need to convince my Product Owner to allocate time on this!
I attended sessions from Cern’s Luca Canali (@LucaCanaliDB) and Jakub Wozniak. The CERN team had a few sessions on how to optimize, go to production, architecture, benchmark Spark… with Java. Yes, in production with Spark and Java. Their goal is to process 900GB of data per day. Fair enough. You need to do what you have to do to help Sheldon Cooper, right?
Holden Karau (@holdenkarau), Boo (@BooProgrammer), and Nick Pentreath (@MLnick) talked about how to extend ML pipelines and add your own algorithms, as clearly, the Spark team will never be able to add them all. I contributed with a talk titled “Extending Apache Spark’s Ingestion: Building Your Own Java Data Source,” also is the general field of extending the product.
An ecosystem is maturing: more products are appearing like Databricks Delta announced by Matei Zaharia (@matei_zaharia), preceded by a few months by IBM Event Store, and GridGain’s commercial support for Apache Ignite, all in memory databases that plug in Spark (ok, I over simplify): there is a need for having a database attached closer to the processing engine.
Testing is also present in minds with some valuable examples and frameworks for both batch and streaming from Holden’s first presentation “Testing Apache Spark—Avoiding the Fail Boat Beyond RDDs.”
Monitoring is on every lips, but no new tool is actually out. However, RedHat’s Michael McCune demonstrated an interface with Prometheus. Luca also explained how you could easily access Spark’s log, in Spark, well, in a dataframe.
Data science is alive and kicking, with more and more tips and tricks including a few books… And not very subtle self-promotion – and no, I am not thinking about you, Holden.
All those signs are clearly showing that the product is maturing and that users are more demanding.
The community is also reinforcing itself with the help of Jules Damji (@2twitme). We will try to make next year even more interesting for the growing community, however, it starts as soon as this December in the Triangle area.
It is clear that for both Databricks and IBM, the major contributors of Spark: they now need to foster this growing and awesome community. Meet ups members around the world have almost doubled since San Francisco’s Spark Summit back in June, but are meet ups enough?