Recently I was debugging a NullPointerException
in Spark. The stacktrace was indicating this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
java.lang.NullPointerException at org.apache.spark.unsafe.types.UTF8String.contains(UTF8String.java:284) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) |
After some digging I found out that the following query causes the problem:
1 2 3 4 5 6 7 8 9 |
df1 .join(df2, df1("id1") === df2("id2") && !isnull(df1("ref")) && !isnull(df2("ref")) && df2("ref").contains(df1("ref")) // <--- this is the problem , "left_outer" ) .drop("id2") |
If I commented out the line with the comment the NPE was no longer there. Also, when I replaced either df2("ref")
or df1("ref")
with lit("ref")
it was not crashing as well so there was something wrong with the contains
running on two dataframes.
In my case removing the cache helped — I was caching df2
with cache()
method before running the join. When I removed the caching the problem disappeared. Spark version 2.1.0 with EMR 5.5.3.