Spark and NegativeArraySizeException

June 22, 2019April 22, 2019 ~ afish ~ Leave a comment

Recently I was debugging the following crash in Spark: Disabling Kryo solves the issue. To do that just set spark.serializer to org.apache.spark.serializer.JavaSerializer. Other workaround is to change Kryo’s reference management, as explained on Github:

Spark and NullPointerException in UTF8String.contains

June 15, 2019January 16, 2019 ~ afish ~ Leave a comment

Recently I was debugging a NullPointerException in Spark. The stacktrace was indicating this: After some digging I found out that the following query causes the problem: If I commented out the line with the comment the NPE was no longer there. Also, when I replaced either df2(“ref”) or df1(“ref”) with lit(“ref”) it was not crashing … Continue reading Spark and NullPointerException in UTF8String.contains

Machine Learning Part 1 — Linear regression in MXNet

October 20, 2018July 27, 2019 ~ afish ~ 4 Comments

This is the first part of the Machine Learning series. For your convenience you can find other parts using the links below (or by guessing the address): Part 1 — Linear regression in MXNet Part 2 — Linear regression in SQL Part 3 — Linear regression in SQL revisited Part 4 — Linear regression in … Continue reading Machine Learning Part 1 — Linear regression in MXNet

Random notes from crashing and hanging EMR Spark job

September 1, 2018October 24, 2018 ~ afish ~ Leave a comment

It sometimes happens that your EMR job crashes or hangs indefinitely with no meaningful log. You can try to capture memory dump but it is not very useful when your cluster machines have hundreds gigabytes of memory each. Below are “fixes” which worked for me. If it just crashes with lost slave or lost task, … Continue reading Random notes from crashing and hanging EMR Spark job

Investigating AWS SDK conflicts in EMR

August 25, 2018July 11, 2018 ~ afish ~ Leave a comment

When you deploy your package to Amazon Elastic Map Reduce (EMR), you can access the AWS SDK provided by the platform. This gets tricky if you compile your code against different version of SDK because then you may get very cryptic bugs in runtime, like class not found or method not existing. You should always … Continue reading Investigating AWS SDK conflicts in EMR