PySpark – Random IT Utensils https://blog.adamfurmanek.pl IT, operating systems, maths, and more. Sat, 22 Feb 2020 00:40:43 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 RuntimeError: generator raised StopIteration in PySpark in Python 3.7.3 https://blog.adamfurmanek.pl/2020/08/15/runtimeerror-generator-raised-stopiteration-in-pyspark-in-python-3-7-3/ https://blog.adamfurmanek.pl/2020/08/15/runtimeerror-generator-raised-stopiteration-in-pyspark-in-python-3-7-3/#respond Sat, 15 Aug 2020 08:00:28 +0000 https://blog.adamfurmanek.pl/?p=3426 Continue reading RuntimeError: generator raised StopIteration in PySpark in Python 3.7.3]]> Recently I was debugging this simple PySpark code:

someDataframe = ...
someDict = someDataframe.collectAsMap()
someOtherDataframe.filter(lambda x: x in someDict).take(1)

First, we get some DataFrame. Next, we collect it to dictionary. It doesn’t matter how you create the dictionary, actually it could be a set or list as well. Finally, we do some filtering with lambda using in operator.

I was running this in Python 3.7.3 and was getting this exception:

[Stage 8:>                                                          (0 + 1) / 1]20/02/20 20:55:37 WARN TaskSetManager: Lost task 0.0 in stage 8.0 (TID 343, ip-10-0-0-2.ec2.internal, executor 20): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/rdd.py", line 1371, in takeUpToNumLeft
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1/container_1/pyspark.zip/pyspark/worker.py", line 230, in main
    process()
  File "/mnt/yarn/usercache/hadoop/appcache/application_1/container_1/pyspark.zip/pyspark/worker.py", line 225, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1/container_1/pyspark.zip/pyspark/serializers.py", line 372, in dump_stream
    vs = list(itertools.islice(iterator, batch))
RuntimeError: generator raised StopIteration

It looks like this is related to in operator as replacing lambda with something like list(someDict.keys()).count(x) >= 0 stopped exceptions.
This is due to new generators behavior.
Similar issue was here, here and here.

]]>
https://blog.adamfurmanek.pl/2020/08/15/runtimeerror-generator-raised-stopiteration-in-pyspark-in-python-3-7-3/feed/ 0