Do you want your business intelligence (BI) users to be able to use “self-service” for analytics and reporting in a modern cloud architecture? Then this article is for you. Amazon has a perfect combination of storage, transformation and visualization that can solve this very common request, empowering data analysts to do their job without the long wait to load, model and prepare data.
First, let’s review the AWS services you'll need for this solution:
- AWS S3 is a simple storage service that can be used to store and retrieve any amount of data. It's an object store and very reliable.
- AWS Athena is a query service that makes it easy to analyze data directly from files in S3 using standard SQL statements.
- AWS QuickSight is a next generation Business Intelligence (BI) application that can help build interactive visualizations on top of various data sources hosted on the Amazon cloud infrastructure.
The solution for this self-service BI is described in the figure below. Let’s review the data flow from left to right.
- On the extreme left is the source for the data which is from the public data.gov website
- Next is S3 which acts as the data store
- Next is Athena which is the serverless query layer on top of S3
- Last is QuickSight for data preparation and visualization
Figure 1: Solution Architecture
In the next few sections, we will review each of the key tasks in detail.
Uploading Data to S3
For this demonstration, we will use all files from the Data.gov College Scorecard. The dataset is available from the following public URL: https://catalog.data.gov/dataset/college-scorecard
Here are the detailed steps to upload a file to S3 filesystem:
1. Download the CollegeScorecard_Raw_Data.zip to your local system (laptop) and unzip the file.
2. Next, upload the file to AWS S3, login to our account and select S3 from the Services menu.
3. Next, select the S3 bucket or create a new S3 bucket. In the following screenshot, I have the select
collegescorecard bucket that I created earlier.
4. Next, create a folder CollegeRaw and then sub-folders, one for each year 2010, 2011, 2012 and 2013 as shown in the following figure:
Figure 2: S3 Data loading
5. To each sub-folder, upload the corresponding data file; for example, MERGED2010_PP.csv file goes to year=2010 sub-folder and repeat the same for all years.
This completes the data loading to S3. Next, we'll see how to view this data in Athena.
Creating tables in Athena
Now that the data is loaded into S3, we can query it using Athena with the following steps:
1. Open the AWS management console for Athena using this link https://console.aws.amazon.com/athena/home OR search for Athena in the AWS services search bar.
2. Using the Query Editor, run the create database collegestatsathenadb; statement.
3. The new database, collegestatsathenadb, should appear in the drop down on the left-hand side. Select the new database.
4. Next, create a new table for the files in S3 on CollegeScorecard Raw [RP1] data with partition clause. The query is in GitHub here: https://github.com/rnadipalli/quicksight/blob/master/sqlscripts/loadtoAthena.sql
Figure 3: Athena create table
5. After the table is created, verify it by browsing for it on the left-hand panel.
6. Next, to load all partitions of the table, run the following command:
- MSCK REPAIR TABLE CollegeStatsAthenaDB.CollegeStats;
7. Next, you can query the table and view data as shown in the following figure:
- select * from CollegeStatsAthenaDB.CollegeStats limit 5;
Figure 4: Athena review results
This completes the creation of the table in Athena database. Next, we will see how to visualize this data in QuickSight.
Visualizing using QuickSight
Now that data is accessible via Athena, follow the steps below to create a new dataset in QuickSight:
1. From the QuickSight home page, click on Manage Data.
2. Next, select New data set and then select Athena option.
3. For the Data source name, enter the same name as the Athena database CollegeStatsAthenaDB.
4. Click on Validate connection to confirm QuickSight can connect to Athena. After it is Validated, click on Create data source to complete the data source creation as shown in the following figure:
Figure 5: QuickSight Athena data source
5. Next, select the table collegestats from the table selection and then select Edit/Preview data.
6. The table has over 1700 fields; for this demonstration we will focus only on enrollment related fields and only for public colleges. For this we will use the QuickSight option to report data based on custom SQL as checked in https://github.com/rnadipalli/quicksight/blob/master/sqlscripts/loadtoAthena.sql.
Figure 6: QuickSight data preparation
7. Next update the data type of all numeric fields to integer as this will help with the reporting and visualization.
8. Next, click on save and visualize to analyze this data.
9. Next, change the visualization type to vertical stacked bar chart; set the Enrollment_All_Count to a measure; select X axis as Year, Value as Enrollement_All_Count and Group Color as STABBAR. Finally, to focus only on top states by count, filter the chart with Enrollment_All_Count greater than 200,000.
10. This will give you a pretty useful trend that shows that the state of Ohio has a drop of enrollment from 219K in year 2010 to 217K in year 2013 in comparison to the state of Texas where enrollment grew from 438K to 467K in same time period as shown in the following figure:
Figure 7: QuickSight bar chart
This shows how we can go from data in flat files to meaningful charts with zero infrastructure administration, no custom software installation and easy to use cloud services.
Further, If you are interested to also move your complex transformations on S3 to a transient cluster, read "Save Money in the Cloud with Transient Clusters for Big Data".
If you are looking at strategies on Data Lake migration from a local data center to cloud-base Data Lakes, read "Migrating On-Premises Data Lakes to Cloud".
About the AuthorMore Content by Rajesh Nadipalli