Recently I was debugging a NullPointerException in Spark. The stacktrace was indicating this:

After some digging I found out that the following query causes the problem:

If I commented out the line with the comment the NPE was no longer there. Also, when I replaced either df2("ref") or df1("ref") with lit("ref") it was not crashing as well so there was something wrong with the contains running on two dataframes.

In my case removing the cache helped — I was caching df2 with cache() method before running the join. When I removed the caching the problem disappeared. Spark version 2.1.0 with EMR 5.5.3.