Reza Rashetnia, PhD
3 min readAug 29, 2020

--

Query Amazon Redshift Data Using Python (Pandas and NumPy)

Amazon Redshift data warehouse is part of the cloud-computing platform Amazon Web Services (AWS). Redshift name suggests a shift from Oracle. AWS is an efficient solution to collect and store all data of organizations and analyze them using various business intelligence tools. Python is widely used and is a most favorite tool in Data Science and Machine Learning community along being a flexible and open source language.

In this post, accessing your amazon redshift data using Python is reviewed. One easy way to access your data with python is using Pandas library. But one issue with this library may be its heavy memory usage for cloud applications. Therefore, in this post we will both consider using Pandas or NumPy. Generally, to access your data there is two main steps to follow:

  1. Connecting to the Redshift warehouse instance and Querying the data
  2. Using the Queried Data for Data Analysis

Amazon Redshift is based on PostgreSQL 8.0. Amazon Redshift is specifically designed for online analytical processing (OLAP) and business intelligence (BI) applications, which require complex queries against large datasets. We use the Python psycopg2 library to access and query the data from Redshift.

  1. Connecting to the Redshift warehouse instance and Querying the data:

To access your Redshift data using Python, we will first need to connect to our instance. To connect to our Redshift instance, we use psycopg2. Following are the parameters that you need for connecting to any database:

  • Name of the database
  • Host Name
  • Port
  • User Name
  • Password

Following is general description of how to connect to the Redshift warehouse instance and query data:

Connecting to the Redshift warehouse instance and Querying the data

We can then either store the query results as a dataframe in pandas using the SQLAlchemy library or NumPy array.

2. Using the Queried Data for Data Analysis:

So far we queried Redshift data and fetched it for further analysis. Now we have to use data analysis tools to work with data. When it comes to Python, the most popular libraries for data analysis are NumPy and Pandas. NumPy is the most popular Python library for numerical computing applications. On the other hand, Pandas is the most widely used data analysis library in Python for data manipulation and analysis. It provides a high-performance data structure called DataFrame which is widely used for Data Science projects. Here, we will consider both these well-known libraries to represent our initial data.

  • Loading the Data using NumPy

It is quite straightforward to turn your data into a NumPy array. We initialize a new NumPy array and pass the cursor containing the query results as a parameter. To do so, run the following code in your Python console:

  • Loading the Data using Pandas

You can also use Pandas instead of NumPy for your data analysis. For this, however, the steps involved are a bit different. Refer to the following code snippet:

--

--