Recently I was debugging the following crash in Spark: Disabling Kryo solves the issue. To do that just set spark.serializer to org.apache.spark.serializer.JavaSerializer. Other workaround is to change Kryo’s reference management, as explained on Github:
Tag: Spark
Spark and NullPointerException in UTF8String.contains
Recently I was debugging a NullPointerException in Spark. The stacktrace was indicating this: After some digging I found out that the following query causes the problem: If I commented out the line with the comment the NPE was no longer there. Also, when I replaced either df2(“ref”) or df1(“ref”) with lit(“ref”) it was not crashing … Continue reading Spark and NullPointerException in UTF8String.contains
Machine Learning Part 1 — Linear regression in MXNet
This is the first part of the Machine Learning series. For your convenience you can find other parts using the links below (or by guessing the address): Part 1 — Linear regression in MXNet Part 2 — Linear regression in SQL Part 3 — Linear regression in SQL revisited Part 4 — Linear regression in … Continue reading Machine Learning Part 1 — Linear regression in MXNet
Random notes from crashing and hanging EMR Spark job
It sometimes happens that your EMR job crashes or hangs indefinitely with no meaningful log. You can try to capture memory dump but it is not very useful when your cluster machines have hundreds gigabytes of memory each. Below are “fixes” which worked for me. If it just crashes with lost slave or lost task, … Continue reading Random notes from crashing and hanging EMR Spark job
Investigating AWS SDK conflicts in EMR
When you deploy your package to Amazon Elastic Map Reduce (EMR), you can access the AWS SDK provided by the platform. This gets tricky if you compile your code against different version of SDK because then you may get very cryptic bugs in runtime, like class not found or method not existing. You should always … Continue reading Investigating AWS SDK conflicts in EMR