PySpark - Glue

Now you are going to perform more advanced transformations using AWS Glue jobs.

Step 1: Go to AWS Glue jobs console, select n1_c360_dispositions, Pyspark job.

bp 0

The transformation inside this job performs a join between 3 tables, general banking, account and card, to calculate disposition type and acquisition information.

Step 2: Click on Edit job.

bp 1

Step 3: Change Glue version to Spark 2.4, Python 3 with improved startup times (Glue Version 2.0).

bp 1

Step 4: Select your stage S3 bucket c360view-us-west-2-your_account_id-stage + ‘/tmp/’ as Temporary directory.

bp 1

Save.

Step 5: Select n1_c360_dispositions and click on Action, Run job.

bp 1

Step 6: Wait for completion.

bp 1

Step 7: Check the script and logs.

bp 1

Notice that in this Python script that converts a query result from Amazon Athena to a Pandas data frame and then the result from Pandas transformation to parquet files in Amazon S3 using spark write operation.

Step 8: Now select the Jobcust360etlmftrans, another Pyspark job, Action and Edit job.

bp 1

Step 9: Change Glue version to Spark 2.4, Python 3 with improved startup times (Glue Version 2.0) and select your stage S3 bucket c360view-us-west-2-your_account_id-stage + ‘/tmp/’ as Temporary directory.

bp 1

Save.

Step 10: Now select the Jobcust360etlmftrans. Click on Action, and then on Run job.

bp 1

Step 11: Wait for completion.

bp 1

Step 12: Check the script and logs.

bp 1

In this pyspark script we are doing some aggregations with the transactions from relational database done by account_id in the last 3 months and also the last 6 months. For this we used the AWS Glue dynamic frame.