Apply CI/CD DevOps rules to Amazon Redshift growth

0
60


CI/CD within the context of software growth is a well-understood subject, and builders can select from quite a few patterns and instruments to construct their pipelines to deal with the construct, check, and deploy cycle when a brand new commit will get into model management. For saved procedures and even schema modifications which can be straight associated to the applying, that is usually a part of the code base and is included within the code repository of the applying. These modifications are then utilized when the applying will get deployed to the check or prod surroundings.

This put up demonstrates how one can apply the identical set of approaches to saved procedures, and even schema modifications to knowledge warehouses like Amazon Redshift.

Saved procedures are thought-about code and as such ought to bear the identical rigor as software code. Which means that the pipeline ought to contain operating checks in opposition to modifications to ensure that no regressions are launched to the manufacturing surroundings. As a result of we automate the deployment of each saved procedures and schema modifications, this considerably reduces inconsistencies in between environments.

Answer overview

The next diagram illustrates our answer structure. We use AWS CodeCommit to retailer our code, AWS CodeBuild to run the construct course of and check surroundings, and AWS CodePipeline to orchestrate the general deployment, from supply, to check, to manufacturing.

Database migrations and checks additionally require connection data to the related Amazon Redshift cluster; we exhibit methods to combine this securely utilizing AWS Secrets and techniques Supervisor.

We talk about every service part in additional element later within the put up.

You may see how all these elements work collectively by finishing the next steps:

  1. Clone the GitHub repo.
  2. Deploy the AWS CloudFormation template.
  3. Push code to the CodeCommit repository.
  4. Run the CI/CD pipeline.

Clone the GitHub repository

The CloudFormation template and the supply code for the instance software can be found within the GitHub repo. Earlier than you get began, it’s essential to clone the repository utilizing the next command:

git clone https://github.com/aws-samples/amazon-redshift-devops-blog

This creates a brand new folder, amazon-redshift-devops-blog, with the recordsdata inside.

Deploy the CloudFormation template

The CloudFormation stack creates the VPC, Amazon Redshift clusters, CodeCommit repository, CodeBuild initiatives for each check and prod, and the pipeline utilizing CodePipeline to orchestrate the change launch course of.

  1. On the AWS CloudFormation console, select Create stack.
  2. Select With new sources (commonplace).
  3. Choose Add a template file.
  4. Select Select file and find the template file (<cloned_directory>/cloudformation_template.yml).
  5. Select Subsequent.
  6. For Stack identify, enter a reputation.
  7. Within the Parameters part, present the first consumer identify and password for each the check and prod Amazon Redshift clusters.

The username have to be 1–128 alphanumeric characters, and it could possibly’t be a reserved phrase.

The password has the next standards:

  • Should be 8-64 characters
  • Should include at the least one uppercase letter
  • Should include at the least one lowercase letter
  • Should include at the least one quantity
  • Can solely include ASCII characters (ASCII codes 33–126), besides ‘ (single citation mark), ” (double citation mark), /, , or @

Please be aware that manufacturing credentials may very well be created individually by privileged admins, and you would cross within the ARN of a pre-existing secret as an alternative of the particular password should you so select.

  1. Select Subsequent.
  2. Depart the remaining settings at their default and select Subsequent.
  3. Choose I acknowledge that AWS CloudFormation would possibly create IAM sources.
  4. Select Create stack.

You may select the refresh icon on the stack’s Occasions web page to trace the progress of the stack creation.

Push code to the CodeCommit repository

When stack creation is full, go to the CodeCommit console. Find the redshift-devops-repo repository that the stack created. Select the repository to view its particulars.

Earlier than you’ll be able to push any code into this repo, it’s important to arrange your Git credentials utilizing directions right here Setup for HTTPS customers utilizing Git credentials. At Step 4 of the Setup for HTTPS customers utilizing Git credentials, copy the HTTPS URL, and as an alternative of cloning, add the CodeCommit repo URL into the code that we cloned earlier:

git distant add codecommit <repo_https_url> 
git push codecommit fundamental

The final step populates the repository; you’ll be able to verify it by refreshing the CodeCommit console. In the event you get prompted for a consumer identify and password, enter the Git credentials that you just generated and downloaded from Step 3 of the Setup for HTTPS customers utilizing Git credentials

Run the CI/CD pipeline

After you push the code to the CodeCommit repository, this triggers the pipeline to deploy the code into each the check and prod Amazon Redshift clusters. You may monitor the progress on the CodePipeline console.

To dive deeper into the progress of the construct, select Particulars.

You’re redirected to the CodeBuild console, the place you’ll be able to see the run logs in addition to the results of the check.

Parts and dependencies

Though from a high-level perspective the check and prod surroundings look the identical, there are some nuances close to how these environments are configured. Earlier than diving deeper into the code, let’s have a look at the elements first:

  • CodeCommit – That is the model management system the place you retailer your code.
  • CodeBuild – This service runs the construct course of and check utilizing Maven.
    • Construct – Through the construct course of, Maven makes use of FlyWay to hook up with Amazon Redshift to find out the present model of the schema and what must be run to carry it as much as the most recent model.
    • Take a look at – Within the check surroundings, Maven runs JUnit checks in opposition to the check Amazon Redshift cluster. These checks could contain loading knowledge and testing the habits of the saved procedures. The outcomes of the unit checks are printed into the CodeBuild check reviews.
  • Secrets and techniques Supervisor – We use Secrets and techniques Supervisor to securely retailer connection data to the assorted Amazon Redshift clusters. This consists of host identify, port, consumer identify, password, and database identify. CodeBuild refers to Secrets and techniques Supervisor for the related connection data when a construct will get triggered. The underlying CodeBuild service position must have the corresponding permission to entry the related secrets and techniques.
  • CodePipeline – CodePipeline is liable for the general orchestration from supply to check to manufacturing.

As referenced within the elements, we additionally use some extra dependencies on the code degree:

  • Flyway – This framework is liable for conserving completely different Amazon Redshift clusters in numerous environments in sync so far as schema and saved procedures are involved.
  • JUnit – Unit testing framework written in Java.
  • Apache Maven – A dependency administration and construct software. Maven is the place we combine Flyway and JUnit.

Within the following sections, we dive deeper into how these dependencies are built-in.

Apache Maven

For Maven, the configuration file is pom.xml. For an instance, you’ll be able to take a look at the pom file from our demo app. The pertinent a part of the xml is the construct part:

<construct>
        <plugins>
            <plugin>
                <groupId>org.flywaydb</groupId>
                <artifactId>flyway-maven-plugin</artifactId>
                <model>${flyway.model}</model>
                <executions>
                    <execution>
                        <section>process-resources</section>
                        <targets>
                            <purpose>migrate</purpose>
                        </targets>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <model>${surefire.model}</model>
            </plugin>
        </plugins>
    </construct>

This part describes two issues:

  • By default, the Surefire plugin triggers in the course of the check section of Maven. The plugin runs the unit checks and generates reviews primarily based on the outcomes of these checks. These reviews are saved within the goal/surefire-reports folder. We reference this folder within the CodeBuild part.
  • Flyway is triggered in the course of the process-resources section of Maven, and it triggers the migrate purpose of Flyway. Taking a look at Maven’s lifecycle, this section is at all times triggered first and deploys the most recent model of saved procedures and schemas to the check surroundings earlier than operating check circumstances.

Flyway

Modifications to the database are referred to as migrations, and these might be both versioned or repeatable. Builders can outline which sort of migration by the naming conference utilized by Flyway to find out which one is which. The next diagram illustrates the naming conference.

A versioned migration consists of the common SQL script that’s run and an optionally available undo SQL script to reverse the precise model. It’s a must to create this undo script with a purpose to allow the undo performance for a selected model. For instance, a daily SQL script consists of making a brand new desk, and the corresponding undo script consists of dropping that desk. Flyway is liable for conserving observe of which model a database is presently at, and runs N variety of migrations relying on how far again the goal database is in comparison with the most recent model. Versioned migrations are the commonest use of Flyway and are primarily used to keep up desk schema and preserve reference or lookup tables updated by operating knowledge hundreds or updates through SQL statements. Versioned migrations are utilized so as precisely one time.

Repeatable migrations don’t have a model; as an alternative they’re rerun each time their checksum modifications. They’re helpful for sustaining user-defined features and saved procedures. As a substitute of getting a number of recordsdata to trace modifications over time, we will simply use a single file and Flyway retains observe of when to rerun the assertion to maintain the goal database updated.

By default, these migration recordsdata are situated within the classpath underneath db/migration, the total path being src/fundamental/sources/db/migration. For our instance software, you’ll find the supply code on GitHub.

JUnit

When Flyway finishes operating the migrations, the check circumstances are run. These check circumstances are underneath the folder src/check/java. You could find examples on GitHub that run a saved process through JDBC and validate the output or the affect.

One other side of unit testing to contemplate is how the check knowledge is loaded and maintained within the check Amazon Redshift cluster. There are a few approaches to contemplate:

  • As per our instance, we’re packaging the check knowledge as a part of our model management and loading the info when the primary unit check is run. The benefit of this method is that you just get flexibility of when and the place you run the check circumstances. You can begin with both a very empty or partially populated check cluster and also you get with the best surroundings for the check case to run. Different benefits are you could check knowledge loading queries and have extra granular management over the datasets which can be being loaded for every check. The draw back of this method is that, relying on how huge your check knowledge is, it could add extra time on your check circumstances to finish.
  • Utilizing an Amazon Redshift snapshot devoted to the check surroundings is one other option to handle the check knowledge. With this method, you’ve gotten a pair extra choices:
    • Transient cluster – You may provision a transient Amazon Redshift cluster primarily based on the snapshot when the CI/CD pipeline will get triggered. This cluster stops after the pipeline completes to avoid wasting price. The draw back of this method is that it’s important to think about Amazon Redshift provisioning time in your end-to-end runtime.
    • Lengthy-running cluster – Your check circumstances can hook up with an current cluster that’s devoted to operating check circumstances. The check circumstances are liable for ensuring that data-related setup and teardown are achieved accordingly relying on the character of the check that’s operating. You should utilize @BeforeAll and @AfterAll JUnit annotations to set off the setup and teardown, respectively.

CodeBuild

CodeBuild gives an surroundings the place all of those dependencies run. As proven in our structure diagram, we use CodeBuild for each check and prod. The variations are within the precise instructions that run in every of these environments. These instructions are saved within the buildspec.yml file. In our instance, we offer a separate buildspec file for check and a special one for prod. Through the creation of a CodeBuild undertaking, we will specify which buildspec file to make use of.

There are just a few variations between the check and prod CodeBuild undertaking, which we talk about within the following sections.

Buildspec instructions

Within the check surroundings, we use mvn clear check and package deal the Surefire reviews so the check outcomes might be displayed through the CodeBuild console. Whereas within the prod surroundings, we simply run mvn clear process-resources. The explanation for it’s because within the prod surroundings, we solely must run the Flyway migrations, that are hooked as much as the process-resources Maven lifecycle, whereas within the check surroundings, we not solely run the Flyway migrations, but in addition ensure that it didn’t introduce any regressions by operating check circumstances. These check circumstances would possibly have an effect on the underlying knowledge, which is why we don’t run it in opposition to the manufacturing Amazon Redshift cluster. If you wish to run the check circumstances in opposition to manufacturing knowledge, you need to use an Amazon Redshift manufacturing cluster snapshot and run the check circumstances in opposition to that.

Secrets and techniques through Secrets and techniques Supervisor

Each Flyway and JUnit want data to determine and hook up with Amazon Redshift. We retailer this data in Secrets and techniques Supervisor. Utilizing Secrets and techniques Supervisor has a number of advantages:

  • Secrets and techniques are encrypted routinely
  • Entry to secrets and techniques might be tightly managed through fine-grained AWS Identification and Entry Administration (IAM) insurance policies
  • All exercise with secrets and techniques is recorded, which allows straightforward auditing and monitoring
  • You may rotate secrets and techniques securely and safely with out impacting purposes

For our instance software, we outline the key as follows:

{
  "username": "<Redshift username>",
  "password": "<Redshift password>",
  "host": "<Redshift hostname>",
  "port": <Redshift port>,
  "dbName": "<Redshift DB Title>"
}

CodeBuild is built-in with Secrets and techniques Supervisor, so we outline the next surroundings variables as a part of the CodeBuild undertaking:

  • TEST_HOST: arn:aws:secretsmanager:<area>:<AWS Account Id>:secret:<secret identify>:host
  • TEST_JDBC_USER: arn:aws:secretsmanager:<area>:<AWS Account Id>:secret:<secret identify>:username
  • TEST_JDBC_PASSWORD: arn:aws:secretsmanager:<area>:<AWS Account Id>:secret:<secret identify>:password
  • TEST_PORT: arn:aws:secretsmanager:<area>:<AWS Account Id>:secret:<secret identify>:port
  • TEST_DB_NAME: arn:aws:secretsmanager:<area>:<AWS Account Id>:secret:<secret identify>:dbName
  • TEST_REDSHIFT_IAM_ROLE: <ARN of IAM position> (This may be in plaintext and must be connected to the Amazon Redshift cluster)
  • TEST_DATA_S3_BUCKET: <bucket identify> (That is the place the check knowledge is staged)

CodeBuild routinely retrieves the parameters from Secrets and techniques Supervisor they usually’re out there within the software as surroundings variables. In the event you have a look at the buildspec_prod.yml instance, we use the previous variables to populate the Flyway surroundings variables and JDBC connection URL.

VPC configuration

For CodeBuild to have the ability to hook up with Amazon Redshift, it’s essential to configure which VPC it runs in. This consists of the subnets and safety group that it makes use of. The Amazon Redshift cluster’s safety group additionally wants to permit entry from the CodeBuild safety group.

CodePipeline

To carry all these elements collectively, we use CodePipeline to orchestrate the stream from the supply code by way of prod deployment. CodePipeline additionally has extra capabilities. For instance, you’ll be able to add an approval step between check and prod so a launch supervisor can assessment the outcomes of the checks earlier than releasing the modifications to manufacturing.

Instance state of affairs

You should utilize checks as a type of documentation of what’s the anticipated habits of a operate. To additional illustrate this level, let’s have a look at a easy saved process:

create or exchange process merge_staged_products()
as $$
BEGIN
    replace merchandise set standing="CLOSED" the place product_name in (choose product_name from products_staging) and standing="ACTIVE";
    insert into merchandise(product_name,worth) choose product_name, worth from products_staging;
END;
$$ LANGUAGE plpgsql;

In the event you deployed the instance app from the earlier part, you’ll be able to comply with alongside by copying the saved process code and pasting it in src/fundamental/sources/db/migration/R__MergeStagedProducts.sql. Reserve it and push the change to the CodeCommit repository by issuing the next instructions (assuming that you just’re on the prime of the undertaking folder):

git add src
git commit -m “<commit message>”
git push codecommit fundamental

After you push the modifications to the CodeCommit repository, you’ll be able to comply with the progress of the construct and check levels on the CodePipeline console.

We implement a fundamental Slowly Altering Dimension Kind 2 method by which we mark outdated knowledge as CLOSED and append newer variations of the info. Though the saved process works as is, our check has the next expectations:

  • The variety of closed standing within the merchandise desk must correspond to the variety of duplicate entries within the staging desk.
  • The merchandise desk has a close_date column that must be populated so we all know when it was deprecated
  • On the finish of the merge, the staging desk must be cleared for subsequent ETL runs

The saved process will cross the primary check, however fail later checks. Once we push this alteration to CodeCommit and the CI/CD course of runs, we will see outcomes like within the following screenshot.

The checks present that the second and third checks failed. Failed checks outcome within the pipeline stopping, which implies these dangerous modifications don’t find yourself in manufacturing.

We are able to replace the saved process and push the change to CodeCommit to set off the pipeline once more. The up to date saved process is as follows:

create or exchange process merge_staged_products()
as $$
BEGIN
    replace merchandise set standing="CLOSED", close_date=CURRENT_DATE the place product_name in (choose product_name from products_staging) and standing="ACTIVE";
    insert into merchandise(product_name,worth) choose product_name, worth from products_staging;
    truncate products_staging;
END;
$$ LANGUAGE plpgsql; 

All of the checks handed this time, which permits CodePipeline to proceed with deployment to manufacturing.

We used Flyway’s repeatable migrations to make the modifications to the saved process. Code is saved in a single file and Flyway verifies the checksum of the file to detect any modifications and reapplies the migration if the checksum is completely different from the one which’s already deployed.

Clear up

After you’re achieved, it’s essential to tear down the surroundings to keep away from incurring extra fees past your testing. Earlier than you delete the CloudFormation stack, go to the Assets tab of your stack and ensure the 2 buckets that had been provisioned are empty. In the event that they’re not empty, delete all of the contents of the buckets.

Now that the buckets are empty, you’ll be able to return to the AWS CloudFormation console and delete the stack to finish the cleanup of all of the provisioned sources.

Conclusion

Utilizing CI/CD rules within the context of Amazon Redshift saved procedures and schema modifications enormously improves confidence when updates are getting deployed to manufacturing environments. Much like CI/CD in software growth, correct check protection of saved procedures is paramount to capturing potential regressions when modifications are made. This consists of testing each success paths in addition to all attainable failure modes.

As well as, versioning migrations allows consistency throughout a number of environments and prevents points arising from schema modifications that aren’t utilized correctly. This will increase confidence when modifications are being made and improves growth velocity as groups spend extra time growing performance relatively than trying to find points as a result of surroundings inconsistencies.

We encourage you to strive constructing a CI/CD pipeline for Amazon Redshift utilizing these steps described on this weblog.


In regards to the Authors

Ashok Srirama is a Senior Options Architect at Amazon Net Providers, primarily based in Washington Crossing, PA. He makes a speciality of serverless purposes, containers, devops, and architecting distributed programs. When he’s not spending time together with his household, he enjoys watching cricket, and driving his bimmer.

Indira Balakrishnan is a Senior Options Architect within the AWS Analytics Specialist SA Crew. She is captivated with serving to prospects construct cloud-based analytics options to unravel their enterprise issues utilizing data-driven choices. Exterior of labor, she volunteers at her children’ actions and spends time together with her household.

Vaibhav Agrawal is an Analytics Specialist Options Architect at AWS.All through his profession, he has targeted on serving to prospects design and construct well-architected analytics and determination help platforms.

Rajesh Francis is a Sr. Analytics Buyer Expertise Specialist at AWS. He makes a speciality of Amazon Redshift and works with prospects to construct scalable Analytic options.

Jeetesh Srivastva is a Sr. Supervisor, Specialist Options Architect at AWS. He makes a speciality of Amazon Redshift and works with prospects to implement scalable options utilizing Amazon Redshift and different AWS Analytic providers. He has labored to ship on-premises and cloud-based analytic options for patrons in banking and finance and hospitality trade verticals.