Connecting to Redshift from Spark running in EMR

Today I’ll share my configuration for Spark running in EMR to connect to Redshift cluster. First, I assume the cluster is accessible (so configure virtual subnet, allowed IPs and all network stuff before running this).

I’m using Zeppelin so I’ll show two interpreters configured for the connection, but the same thing should work with standalone job (as long as it has the same libraries configured). I tested things with EMR 5.17.2 but it should work with other versions as well.

Redshift interpreter

First, let’s configure separate interpreter to use in Zeppelin. SSH into the master node of the cluster and install JDBC interpreter:

Next, download the driver:

Restart zeppelin:

Go to interpreters configuration in Zeppelin and add new JDBC named redshift. Use the following settings:

Now create new paragraph like below:

And it should work.

Spark interpreter

Download driver the same way as before. Now, go to interpreter settings and add dependency to Spark interpreter:

Now you can start reading data like this:

This is even nicer because you can use string interpolation to provide parameters for queries.