Discover real-world use circumstances for Amazon CodeWhisperer powered by AWS Glue Studio notebooks

0
52


Many purchasers are considering boosting productiveness of their software program improvement lifecycle through the use of generative AI. Lately, AWS introduced the overall availability of Amazon CodeWhisperer, an AI coding companion that makes use of foundational fashions underneath the hood to enhance software program developer productiveness. With Amazon CodeWhisperer, you may rapidly settle for the highest suggestion, view extra strategies, or proceed writing your individual code. This integration reduces the general time spent in writing information integration and extract, rework, and cargo (ETL) logic. It additionally helps beginner-level programmers write their first traces of code. AWS Glue Studio notebooks permits you to writer information integration jobs with a web-based serverless pocket book interface.

On this put up, we focus on real-world use circumstances for CodeWhisperer powered by AWS Glue Studio notebooks.

Resolution overview

For this put up, you employ the CSV eSports Earnings dataset, out there to obtain through Kaggle. The info is scraped from eSportsEarnings.com, which gives data on earnings of eSports gamers and groups. The target is to carry out transformations utilizing an AWS Glue Studio pocket book with CodeWhisperer suggestions after which write the info again to Amazon Easy Storage Service (Amazon S3) in Parquet file format in addition to to Amazon Redshift.

Stipulations

Our answer has the next stipulations:

  1. Arrange AWS Glue Studio.
  2. Configure an AWS Id and Entry Administration (IAM) function to work together with CodeWhisperer. Connect the next coverage to your IAM function that’s hooked up to the AWS Glue Studio pocket book:
    {
        "Model": "2012-10-17",
        "Assertion": [{
            "Sid": "CodeWhispererPermissions",
            "Effect": "Allow",
            "Action": [
                "codewhisperer:GenerateRecommendations"
            ],
            "Useful resource": "*"
        }]
    }

  3. Obtain the CSV eSports Earnings dataset and add the CSV file highest_earning_players.csv to the S3 folder you can be utilizing on this use case.

Create an AWS Glue Studio pocket book

Let’s get began. Create a brand new AWS Glue Studio pocket book job by finishing the next steps:

  1. On the AWS Glue console, select Notebooks underneath ETL jobs within the navigation pane.
  2. Choose Jupyter Pocket book and select Create.
  3. For Job identify, enter CodeWhisperer-s3toJDBC.

A brand new pocket book will probably be created with the pattern cells as proven within the following screenshot.

We use the second cell for now, so you may take away all the opposite cells.

  1. Within the second cell, replace the interactive session configuration by setting the next:
    1. Employee sort to G.1X
    2. Variety of staff to three
    3. AWS Glue model to 4.0
  2. Furthermore, import the DynamicFrame module and current_timestamp perform as follows:
    from pyspark.sql.capabilities import current_timestamp
    from awsglue.dynamicframe import DynamicFrame

After you make these adjustments, the pocket book needs to be trying like the next screenshot.

Now, let’s guarantee CodeWhisperer is working as supposed. On the backside proper, one can find the CodeWhisperer possibility beside the Glue PySpark standing, as proven within the following screenshot.

You’ll be able to select CodeWhisperer to view the choices to make use of Auto-Options.

Develop your code utilizing CodeWhisperer in an AWS Glue Studio pocket book

On this part, we present easy methods to develop an AWS Glue pocket book job for Amazon S3 as an information supply and JDBC information sources as a goal. For our use case, we have to guarantee Auto-Options are enabled. Write your advice utilizing CodeWhisperer utilizing the next steps:

  1. Write a remark in pure language (in English) to learn Parquet recordsdata out of your S3 bucket:

After you enter the previous remark and press Enter, the CodeWhisperer button on the finish of the web page will present that it’s operating to write down the advice. The output of the CodeWhisperer advice will seem within the subsequent line and the code is chosen after you press Tab. You’ll be able to be taught extra in Consumer actions.

After you enter the previous remark, CodeWhisperer will generate a code snippet that’s much like the next:

df = (spark.learn.format("csv")
      .possibility("header", "true")
      .possibility("inferSchema", "true")
      .load("s3://<bucket>/<path>/highest_earning_players.csv"))

Word that that you must replace the paths to match the S3 bucket you’re utilizing as an alternative of the CodeWhisperer-generated bucket.

From the previous code snippet, CodeWhisperer used Spark DataFrames to learn the CSV recordsdata.

  1. Now you can strive some rephrasing to get a suggestion with DynamicFrame capabilities:
# Learn CSV file from S3 with the header format possibility utilizing DynamicFrame"

Now CodeWhisperer will generate a code snippet that’s near the next:

dyF = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": ["s3://<bucket>/<path>/highest_earning_players.csv"],
        "recurse": True,
    },
    format="csv",
    format_options={
        "withHeader": True,
    },
    transformation_ctx="dyF")

Rephrasing the sentences written now has proved that after some modifications to the feedback we wrote, we bought the right advice from CodeWhisperer.

  1. Subsequent, use CodeWhisperer to print the schema of the previous AWS Glue DynamicFrame through the use of the next remark:
    # Print the schema of the above DynamicFrame

CodeWhisperer will generate a code snippet that’s near the next:

We get the next output.

Now we use CodeWhisperer to create some transformation capabilities that may manipulate the AWS Glue DynamicFrame learn earlier. We begin by coming into code in a brand new cell.

  1. First, check if CodeWhisperer can use the right AWS Glue context capabilities like ResolveChoice:
    # Convert the "PlayerId" sort from string to integer

CodeWhisperer has really useful a code snippet much like the next:

dyF = dyF.resolveChoice(specs=[('PlayerId', 'cast:long')])
dyF.printSchema()

The previous code snippet doesn’t precisely characterize the remark that we entered.

  1. You’ll be able to apply sentence paraphrasing and simplifying by offering the next three feedback. Each has totally different ask and we use the withColumn Spark Body technique, which is utilized in casting columns varieties:
    # Convert the DynamicFrame to spark information body
    # Forged the 'PlayerId' column from string to Integer utilizing WithColumn perform
     # Convert the spark body again to DynamicFrame and print the schema

CodeWhisperer will choose up the previous instructions and suggest the next code snippet in sequence:

df = dyF.toDF()
df = df.withColumn("PlayerId", df["PlayerId"].forged("integer"))
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")
dyF.printSchema()

The next output confirms the PlayerId column is modified from string to integer.

  1. Apply the identical course of to the resultant AWS Glue DynamicFrame for the TotalUSDPrize column by casting it from string to lengthy utilizing the withColumn Spark Body capabilities by coming into the next feedback:
    # Convert the dynamicFrame to Spark Body
    # Forged the "TotalUSDPrize" column from String to lengthy
    # Convert the spark body again to dynamic body and print the schema

The really useful code snippet is much like the next:

df = dyF.toDF()
df = df.withColumn("TotalUSDPrize", df["TotalUSDPrize"].forged("lengthy"))
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")
dyF.printSchema()

The output schema of the previous code snippet is as follows.

Now we’ll attempt to suggest a code snippet that displays the common prize for every participant in line with their nation code.

  1. To take action, begin by getting the rely of the participant per every nation:
    # Get the rely of every nation code

The really useful code snippet is much like the next:

country_code_count = df.groupBy("CountryCode").rely()
country_code_count.present()

We get the next output.

  1. Be part of the primary DataFrame with the nation code rely DataFrame after which add a brand new column calculating the common highest prize for every participant in line with their nation code:
    # Convert the DynamicFrame (dyF) to dataframe (df)
    # Be part of the dataframe (df) with country_code_count dataframe with respect to CountryCode column
    # Convert the spark body again to DynamicFrame and print the schema

The really useful code snippet is much like the next:

df = dyF.toDF()
df = df.be a part of(country_code_count, "CountryCode")
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")
dyF.printSchema()

The output of the schema now confirms the each DataFrames the place appropriately joined and the Rely column is added to the primary DataFrame.

  1. Get the code advice on the code snippet to calculate the common TotalUSDPrize for every nation code and add it to a brand new column:
    # Get the sum of all of the TotalUSDPrize column per countrycode
    # Rename the sum column to be "SumPrizePerCountry" within the newly generated dataframe

The really useful code snippet is much like the next:

country_code_sum = df.groupBy("CountryCode").sum("TotalUSDPrize")
country_code_sum = country_code_sum.withColumnRenamed("sum(TotalUSDPrize)", "SumPrizePerCountry")
country_code_sum.present()

The output of the previous code ought to appear like the next.

  1. Be part of the country_code_sum DataFrame with the primary DataFrame from earlier and get the common of the prizes per participant per nation:
    # Be part of the above dataframe with the primary dataframe with respect to CountryCode
    # Get the common Complete prize in USD per participant per nation and add it to a brand new column referred to as "AveragePrizePerPlayerPerCountry"

The really useful code snippet is much like the next:

df = df.be a part of(country_code_sum, "CountryCode")
df = df.withColumn("AveragePrizePerPlayerPerCountry", df["SumPrizePerCountry"] / df["count"])

  1. The final half within the transformation part is to kind the info by the best common prize per participant per nation:
    # kind the above dataframe descendingly in line with the best Common Prize per participant nation
    # Present the highest 5 rows

The really useful code snippet is much like the next:

df = df.kind(df["AveragePrizePerPlayerPerCountry"].desc())
df.present(5)

The primary 5 rows will probably be much like the next.

For the final step, we write the DynamicFrame to Amazon S3 and to Amazon Redshift.

  1. Write the DynamicFrame to Amazon S3 with the next code:
    # Convert the info body to DynamicFrame
    # Write the DynamicFrame to S3 in glueparquet format

The CodeWhisperer advice is much like the next code snippet:

dyF = DynamicFrame.fromDF(df, glueContext, "dyF")

glueContext.write_dynamic_frame.from_options(
body=dyF,
connection_type="s3",
connection_options={
"path": "s3://<enter your individual>/",
"partitionKeys": [],
},
format = "glueparquet",
)

We have to right the code snippet generated after the advice as a result of it doesn’t comprise partition keys. As we identified, partitionkeys is empty, so we are able to have one other code block suggestion to set partitionkey after which write it to the goal Amazon S3 location. Additionally, in line with the most recent updates associated to writing DynamicFrames to Amazon S3 utilizing glueparquet, format = "glueparquet" is not used. As an alternative, that you must use the parquet sort with useGlueParquetWriter enabled.

After the updates, our code appears much like the next:

dyF = DynamicFrame.fromDF(df, glueContext, "dyF")

glueContext.write_dynamic_frame.from_options(
body=dyF,
connection_type="s3",
connection_options={
"path": "s3://<enter your individual>/",
},
format = "parquet",
 format_options={
 "useGlueParquetWriter": True,
    },
)

Another choice right here could be to write down the recordsdata to Amazon Redshift utilizing a JDBC connection.

  1. First, enter the next command to verify whether or not CodeWhisperer will perceive the remark in a single sentence and use the right capabilities or not:
    # Write the DynamicFrame to Redshift
    # Choose solely the next 3 columns within the DynamicFrame to write down to redshift: CountryCode, TotalUSDPrize, NameFirst, NameLast, AveragePrizePerPlayerPerCountr

The output of the remark is much like the next code snippet:

dyF = dyF.select_fields(["CountryCode", "TotalUSDPrize", "NameFirst","NameLast","AveragePrizePerPlayerPerCountry"])
glueContext.write_dynamic_frame.from_options(
    body=dyF,
    connection_type="redshift",
    connection_options={
        "url": "jdbc:redshift://omarglueblog.cjzxbkzmhwqm.us-east-1.redshift.amazonaws.com:5439/omarglueblog",
        "consumer": "omarglueblog",
        "password": "omarglueblog",
        "dbtable": "highest_earning_players_redshift",
        "driver": "com.amazon.redshift.jdbc42.Driver",
        "preactions": "truncate desk highest_earning_players_redshift",
        "redshiftTmpDir": "s3://awsgluetest/tmp/",
        "aws_iam_role": "0000000000000000000000000:function/glue-role",
    },
     format = "redshift",
     transformation_ctx="dyF"
    
)

As we are able to see, CodeWhisperer appropriately interpreted the remark by deciding on solely the desired columns to write down to Amazon Redshift.

  1. Now, use CodeWhisperer to write down the DynamicFrame to Amazon Redshift. We use the Preaction parameter to run a SQL question to pick out solely sure columns to be written to Amazon Redshift:
    # Write the resultant DynamicFrame to Redshift 
    # utilizing preaction that selects solely the next columns: CountryCode, TotalUSDPrize, NameFirst, NameLast, AveragePrizePerPlayerPerCountry

The CodeWhisperer advice is much like the next code snippet:

glueContext.write_dynamic_frame.from_options(
    body = dyf,
    connection_type = "redshift",
    connection_options = {
        "url": "jdbc:redshift://awsgluetest.cjw8y5zdqmhz.us-east-1.redshift.amazonaws.com:5439/dev",
        "consumer": "awsuser",
        "password": "awsuser",
        "dbtable": "gamers",
        "preactions": "SELECT CountryCode, TotalUSDPrize, NameFirst, NameLast, AveragePrizePerPlayerPerCountry FROM highest_earning_player",
        "redshiftTmpDir": "s3://awsgluetest/tmp/"
        },
    format = "glueparquet",
    transformation_ctx = "write_dynamic_frame")

After checking the previous code snippet, you may observe that there’s a misplaced format, which you’ll be able to take away. You can too add the iam_role as an enter in connection_options. You can too discover that CodeWhisperer has routinely assumed the Redshift URL to have the identical identify because the S3 folder that we used. Due to this fact, that you must change the URL and the S3 temp listing bucket to replicate your individual parameters and take away the password parameter. The ultimate code snippet needs to be much like the next:

glueContext.write_dynamic_frame.from_options(
body=dyF,
connection_type="redshift",
connection_options={
"url": "jdbc:redshift://<enter your individual>.cjwjn5pzxbhx.us-east-1.redshift.amazonaws.com:5439/<enter your individual>",
"consumer": "<enter your individual>",
"dbtable": "<enter your individual>",
"driver": "com.amazon.redshift.jdbc42.Driver",
"preactions": "SELECT CountryCode, TotalUSDPrize, NameFirst, NameLast, AveragePrizePerPlayerPerCountry FROM <enter your desk>",
"redshiftTmpDir": "<enter your individual>",
"aws_iam_role": "<enter your individual>",
}
)

The next is the entire code and remark snippets:

%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 3

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.capabilities import current_timestamp
from awsglue.DynamicFrame import DynamicFrame


sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)


# Learn CSV recordsdata from S3
dyF = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": ["s3://<bucket>/<path>/highest_earning_players.csv"],
        "recurse": True,
    },
    format="csv",
    format_options={
        "withHeader": True,
    },
    transformation_ctx="dyF")
    
# Print the schema of the above DynamicFrame
dyF.printSchema()


# Convert the DynamicFrame to spark information body
# Forged the 'PlayerId' column from string to Integer utilizing WithColumn perform
# Convert the spark body again to DynamicFrame and print the schema
df = dyF.toDF()
df = df.withColumn("PlayerId", df["PlayerId"].forged("integer"))
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")
dyF.printSchema()


# Convert the dynamicFrame to Spark Body
# Forged the "TotalUSDPrize" column from String to lengthy
# Convert the spark body again to dynamic body and print the schema
df = dyF.toDF()
df = df.withColumn("TotalUSDPrize", df["TotalUSDPrize"].forged("lengthy"))
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")
dyF.printSchema()

# Get the rely of every nation code
country_code_count = df.groupBy("CountryCode").rely()
country_code_count.present()

# Convert the DynamicFrame (dyF) to dataframe (df)
# Be part of the dataframe (df) with country_code_count dataframe with respect to CountryCode column
# Convert the spark body again to DynamicFrame and print the schema
df = dyF.toDF()
df = df.be a part of(country_code_count, "CountryCode")
df.printSchema()

# Get the sum of all of the TotalUSDPrize column per countrycode
# Rename the sum column to be "SumPrizePerCountry"
country_code_sum = df.groupBy("CountryCode").sum("TotalUSDPrize")
country_code_sum = country_code_sum.withColumnRenamed("sum(TotalUSDPrize)", "SumPrizePerCountry")
country_code_sum.present()

# Be part of the above dataframe with the primary dataframe with respect to CountryCode
# Get the common Complete prize in USD per participant per nation and add it to a brand new column referred to as "AveragePrizePerPlayerPerCountry"
df.be a part of(country_code_sum, "CountryCode")
df = df.withColumn("AveragePrizePerPlayerPerCountry", df["SumPrizePerCountry"] / df["count"])

# kind the above dataframe descendingly in line with the best Common Prize per participant nation
# Present the highest 5 rows
df = df.kind(df["AveragePrizePerPlayerPerCountry"].desc())
df.present(5)

# Convert the info body to DynamicFrame
# Write the DynamicFrame to S3 in glueparquet format
dyF = DynamicFrame.fromDF(df, glueContext, "dyF")

glueContext.write_dynamic_frame.from_options(
body=dyF,
connection_type="s3",
connection_options={
"path": "s3://<enter your individual>/",
},
format = "parquet",
 format_options={
 "useGlueParquetWriter": True,
    },
)

# Write the resultant DynamicFrame to Redshift 
# utilizing preaction that selects solely the next columns: CountryCode, TotalUSDPrize, NameFirst, NameLast, AveragePrizePerPlayerPerCountry
glueContext.write_dynamic_frame.from_options(
body=dyF,
connection_type="redshift",
connection_options={
"url": "jdbc:redshift://<enter your individual>.cjwjn5pzxbhx.us-east-1.redshift.amazonaws.com:5439/<enter your individual>",
"consumer": "<enter your individual>",
"dbtable": "<enter your individual>",
"driver": "com.amazon.redshift.jdbc42.Driver",
"preactions": "SELECT CountryCode, TotalUSDPrize, NameFirst, NameLast, AveragePrizePerPlayerPerCountry FROM <enter your desk>",
"redshiftTmpDir": "<enter your individual>",
"aws_iam_role": "<enter your individual>",
}
)

Conclusion

On this put up, we demonstrated a real-world use case on how AWS Glue Studio pocket book integration with CodeWhisperer helps you construct information integration jobs sooner. You can begin utilizing the AWS Glue Studio pocket book with CodeWhisperer to speed up constructing your information integration jobs.

To be taught extra about utilizing AWS Glue Studio notebooks and CodeWhisperer, take a look at the next video.


In regards to the authors

Ishan Gaur works as Sr. Large Information Cloud Engineer ( ETL ) specialised in AWS Glue. He’s enthusiastic about serving to clients constructing out scalable distributed ETL workloads and analytics pipelines on AWS.

Omar Elkharbotly is a Glue SME who works as Large Information Cloud Help Engineer 2 (DIST). He’s devoted to helping clients in resolving points associated to their ETL workloads and creating scalable information processing and analytics pipelines on AWS.