Spark streaming reduceByKeyAndWindow unstable application

28 Oct 2016 [ spark spark-streaming ]

I wanted to run a job which runs 24×7 and which reports if certain keywords occur more than a N times in the stream. Spark streaming looked a ideal candidate for this task. Spark has a reduceByKeyAndWindow function which was exactly what I was looking for.

I decided to use a window length of 1 minute and slide interval of 30 minutes. I had assumed that it will discard all the keys every 30 minutes. While running the job I noticed that the memory usage and the processing time kept on increasing to a point that the application would never be stable i.e. each job was taking more time than the window interval. So new jobs kept on getting added to the processing queue.

Since the documentation on this function is so little that I had to ask around the mailing list for the . After a few days a guy responded that I needed to add a filter function which will discard the unwanted keys.

reduceByKeyAndWindow(reduceRows, invReduceRows, 1800, 60, filterFunc=filterOldKeys)

So now the actual function call looks something like this. The filterOldKey is a simple function where I check if count of the key is greater than 0. This solved my problem.

Roshan Singh

Spark streaming reduceByKeyAndWindow unstable application

Recent Posts

Laptop Review - Thinkpad X1 Carbon Gen 9 08 Aug 2021

Agile Retrospective for team improvement 12 Apr 2020

Things I learned in 2019 01 Feb 2020