Random notes from crashing and hanging EMR Spark job

It sometimes happens that your EMR job crashes or hangs indefinitely with no meaningful log. You can try to capture memory dump but it is not very useful when your cluster machines have hundreds gigabytes of memory each. Below are “fixes” which worked for me.

If it just crashes with lost slave or lost task, make sure that you are not running out of memory. Especially when you broadcast variables
Disable Kryo. In my case it caused application crashes. No clue what was wrong but default Java serializer didn’t have this problem. What’s more, Kryo was slower
Use one core per executor by changing spark.executor.cores, it helps when the job hangs in the middle
Execute System.exit(0) at the end when your job is done, sometimes the step doesn’t terminate even though everything is done
Do not use cache or persist
If you overwrite files in S3, make sure that you remove them early. It looks like sometimes you can error that file already exists even though you removed it
Catch Throwable at the top of your job. I know that it is a bad idea generally but otherwise you may not get any logs when you get OutOfMemoryError