Recently I was debugging this simple PySpark code: First, we get some DataFrame. Next, we collect it to dictionary. It doesn’t matter how you create the dictionary, actually it could be a set or list as well. Finally, we do some filtering with lambda using in operator. I was running this in Python 3.7.3 and … Continue reading RuntimeError: generator raised StopIteration in PySpark in Python 3.7.3
Tag: Spark
Data encryption in s3 in Spark in EMR with multiple encryption schemes
Spark supports multiple encryption schemes. You can use client side encryption, server side encryption, etc. What wasn’t working for me for a long time is reading encrypted data and writing as a plain text. Before reading I was configuring encryption and this was working fine. However, writing as plain text didn’t work (data was encrypted), … Continue reading Data encryption in s3 in Spark in EMR with multiple encryption schemes
Running Anaconda with DGL and mxnet on CUDA GPU in Spark running in EMR
Today I’m going to share my configuration for running custom Anaconda Python with DGL (Deep Graph Library) and mxnet library, with GPU support via CUDA, running in Spark hosted in EMR. Actually, I have Redshift configuration as well, with support for gensim, tensorflow, keras, theano, pygpu, and cloudpickle. You can also install more libraries if … Continue reading Running Anaconda with DGL and mxnet on CUDA GPU in Spark running in EMR
Running any query in Redshift or JDBC from Spark in EMR
Last time we saw how to connect to Redshift from Spark running in EMR. Provided solution was nice but allowed for reading data only. Sometimes we might want to run any DDL or DML query, not only simple read statements. To do that, we need to connect to Redshift directly over JDBC. I assume you … Continue reading Running any query in Redshift or JDBC from Spark in EMR
Connecting to Redshift from Spark running in EMR
Today I’ll share my configuration for Spark running in EMR to connect to Redshift cluster. First, I assume the cluster is accessible (so configure virtual subnet, allowed IPs and all network stuff before running this). I’m using Zeppelin so I’ll show two interpreters configured for the connection, but the same thing should work with standalone … Continue reading Connecting to Redshift from Spark running in EMR