Next month, I’ll be heading to Dublin, the capital of Ireland. I have been to Ireland quite a few times – I was 3 the first time. However this time, I am going on there on a mission: convince more people of using Java with Apache Spark, specially in the ingestion part.
As a quick side note, Ireland is special to me: it is the first foreign country I have souvenirs from: a little boy walking on a wall in a green garden near the sea. I know, it sounds very postcardish. Ireland is most and foremost the country where I learnt English: I went there a few times, alone, as a tween (as we say now). Thanks again to Margaret, Les, and their son, Julian, for coping with me and my few blunders there.
There are many reasons why I love Spark. Some of my favorite features involve data ingestion, which, out of the box can read from CSV, Hive, JDBC, etc. However, you may have your own data sources or formats you want to use (e.g. all the HL7 bandwagon). Your solution could be to convert your data in a CSV or JSON file and then ask Spark to do ingest it through its built-in tools. This process can be very expensive in terms of I/O, subject to corruption and dataloss, and potentially an issue for security. In my talk, I will explore the way to build a custom data source, in Java. We will extend Spark’s ingestion capabilities. You will be able to reuse the code to plugin any data source and convert them in a Dataset<Row>, aka a dataframe.
Spark Summit Europe 2017 will be held from October 24th to October 26th in Dublin, Ireland. If you are interested to hear more about my talk, use discount code SPK20 to get 20% off. If you are not interested about my talk, come anyway and don’t use discount code SPK20, just pay full price…
See you in Dublin! Leave a message in the comments to schedule an enjoyable Guinness together; if you do not like Guinness, it’s totally ok, it’s completely different (read: good) in Dublin anyway.