Today I’ll share my configuration for Spark running in EMR to connect to Redshift cluster. First, I assume the cluster is accessible (so configure virtual subnet, allowed IPs and all network stuff before running this).
I’m using Zeppelin so I’ll show two interpreters configured for the connection, but the same thing should work with standalone job (as long as it has the same libraries configured). I tested things with EMR 5.17.2 but it should work with other versions as well.
Redshift interpreter
First, let’s configure separate interpreter to use in Zeppelin. SSH into the master node of the cluster and install JDBC interpreter:
1 |
sudo /usr/lib/zeppelin/bin/install-interpreter.sh --name jdbc |
Next, download the driver:
1 2 |
cd /usr/lib/zeppelin/interpreter/jdbc/ sudo wget https://s3.amazonaws.com/redshift-downloads/drivers/jdbc/1.2.20.1043/RedshiftJDBC42-no-awssdk-1.2.20.1043.jar |
Restart zeppelin:
1 2 |
sudo stop zeppelin sudo start zeppelin |
Go to interpreters configuration in Zeppelin and add new JDBC named redshift
. Use the following settings:
1 2 3 4 |
default.driver com.amazon.redshift.jdbc42.Driver default.url jdbc:redshift://your-redshift-instance-address.redshift.amazonaws.com:5439/your-database default.user redshift_user default.password redshift_password |
Now create new paragraph like below:
1 2 3 |
%redshift SELECT * FROM table |
And it should work.
Spark interpreter
Download driver the same way as before. Now, go to interpreter settings and add dependency to Spark interpreter:
1 |
/usr/lib/zeppelin/interpreter/jdbc/RedshiftJDBC42-no-awssdk-1.2.20.1043.jar |
Now you can start reading data like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
%spark import org.apache.spark.sql._ def getConnectionString() = { val url = "cluster url" val port = 8192 val database = "database name" val user = "user" val password = "password" s"jdbc:redshift://${url}:$port/$database?user=$user&password=$password" } def runQuery(query: String) = { val df: DataFrame = sqlContext.read .format("jdbc") .option("driver", "com.amazon.redshift.jdbc42.Driver") .option("url", getConnectionString()) .option("dbtable", s"($query) tmp") .load() df } var table = runQuery(s""" SELECT * FROM Table AS t """) |
This is even nicer because you can use string interpolation to provide parameters for queries.