Successful first deployment of Apache Spark on a production server. Yep… I could add the line on my resume. Right now, we have set 24 cores, 72 GB of memory to this very powerful engine. The data being analyzed is over 50 quadrillion datapoints (this is 50·1015 datapoints). This nice 2U server just beat my development Mac by a factor of ∞: the process was never finishing on the Mac, it takes 10 hours on the new server. But wait… Micha has 2 sisters: Gonza and Hopla, as well as a monitoring cousin, Mocka. Yes, you’re right, there is no need for a monitoring server in a Spark architecture, but Mocka’s role is also to host the database and she will be in charge of other tasks, leaving the three damsels to focus on data crunching, in their nice little computational cluster.
We will not touch “too much” the algorithms in the future to make sure the benchmark will mean something when we add the sisters in the mix. This should be a total of 72 cores and 216 GB of memory, why so much? We expect data to multiply by 10, in the next 6 months, with a natural growth rate of 2 or 3 every year. We will see down the road. Also, our ML (Machine Learning) algorithms and models are still pretty primitive and we know we will need more power. The data needs to be processed in less than 24 hours as new data comes every 24 hours…
Another way to boost the performance is to use more powerful datastores like Informix. You can read more in my friend’s Pradeep presentation. Although the data we are using is not coming from IoT, it fits the same model.
Another solution would be to use MongoDB’s new Spark connector.
More on this great project coming up soon.