Amazon EMR

To create a denormalized table we are going to run a job on Amazon EMR.

Amazon EMR is a powerful cluster, that you can set with few machines like in this Workshop or tens to thousands of machines. Consider using spot instances for batch processing and terminate your clusters when you are not using them. It is also recommended to store job results on Amazon S3.

Step 1: Go to EMR console.

bp 1

Step 2: click on c360cluster.

bp 1

Step 3: click on Steps tab.

bp 1

Step 4: Add step.

  • Step type: Spark application
  • Name: denormalization
  • Deploy mode: Cluster
  • Spark-submit option: leave blank
  • Application location: s3://**your_stage_bucket**/library/c360_analytics.py

Use the bucket browser to select the application location.

bp 1

  • Arguments: --BucketName **your analytics bucket** Pick the name from Amazon S3 console Leave a space between --BucketName and your bucket name, without s3://.

bp 1

Then, click on Add.

Step 5: check the job status, going from pending to running.

bp 1

After completion the job has created a denormalized table using PySpark.

bp 1

Step 6: go to Lake formation console select the c360denormalized table from c360view_analytic databases.

bp 1

Step 7: Grant access to it to your user or role.

bp 1

Step 8: go to Athena console and check the new c360denormalized table on c360view_analytics database.

bp 1