It sometimes happens that your EMR job crashes or hangs indefinitely with no meaningful log. You can try to capture memory dump but it is not very useful when your cluster machines have hundreds gigabytes of memory each. Below are “fixes” which worked for me.

  • If it just crashes with lost slave or lost task, make sure that you are not running out of memory. Especially when you broadcast variables
  • Disable Kryo. In my case it caused application crashes. No clue what was wrong but default Java serializer didn’t have this problem. What’s more, Kryo was slower
  • Use one core per executor by changing spark.executor.cores, it helps when the job hangs in the middle
  • Execute System.exit(0) at the end when your job is done, sometimes the step doesn’t terminate even though everything is done
  • Do not use cache or persist
  • If you overwrite files in S3, make sure that you remove them early. It looks like sometimes you can error that file already exists even though you removed it
  • Catch Throwable at the top of your job. I know that it is a bad idea generally but otherwise you may not get any logs when you get OutOfMemoryError