Building serverless pipeline using AWS CDK and Lambda in Python

Creating a serverless pipeline using AWS CDK alongside AWS Lambda in Python allows for event-driven applications which can easily be scaled without worrying about the underlying infrastructure. This article describes the process of creating and setting up a serverless pipeline step by step in AWS CDK and Python Lambda with Visual Studio Code (VS Code) as the IDE.

Completing this guide enables the deployment of a fully working AWS Lambda function with AWS CDK.

Understanding Serverless Architecture and Its Benefits

A serverless architecture is a cloud computing paradigm where the developers need to write the code as functions and these functions get executed upon receiving an event or request. These functions will execute without any server provisioning or management. Execution and resource allocation are automatically managed by the cloud provider – in this instance, AWS.

Key Characteristics of Serverless Architecture:

Event-Driven: Functions are triggered by events such as S3 uploads, API calls, or other AWS service actions.
Automatic Scaling: The platform automatically scales based on workload, handling high traffic without requiring manual intervention.
Cost Efficiency: Users pay only for the compute time used by the functions, making it cost-effective, especially for workloads with varying traffic.

Benefits:

Serverless architecture comes with numerous advantages that are beneficial for modern applications in the cloud. One of the most notable benefits of serverless architecture is improved operational efficiency due to the lack of server configuration and maintenance. Developers are free to focus on building and writing code instead of worrying about managing infrastructure.

Serverless architecture has also enabled better workload management because automatic scaling allows serverless platforms to adjust to changing workloads without human interaction, making traffic spikes effortless. This kind of adaptability maintains high performance and efficiency while minimizing costs and resource waste.

In addition, serverless architecture has proven to be financially efficient, allowing users to pay solely for the computing resources they utilize, as opposed to pre-purchased server capacity. This flexibility is advantageous for workloads with unpredictable or fluctuating demand. Finally, the ease of use provided by serverless architecture leads to an accelerated market launch because developers can rapidly build, test, and deploy applications without the tedious task of configuring infrastructure, leading to faster development cycles.

Understanding ETL Pipelines and Their Benefits

ETL (Extract, Transform, Load) pipelines automate the movement and transformation of data between systems. In the context of serverless, AWS services like Lambda and S3 work together to build scalable, event-driven data pipelines.

Key Benefits of ETL Pipelines:

Data Integration: Combines disparate data sources into a unified system.
Scalability: Services like AWS Glue and S3 scale automatically to handle large datasets.
Automation: Use AWS Step Functions or Python scripts to orchestrate tasks with minimal manual intervention.
Cost Efficiency: Pay-as-you-go pricing models for services like Glue, Lambda, and S3 optimize costs.

Tech Stack Used in the Project

For this serverless ETL pipeline, Python is the programming language of choice while Visual Studio Code serves as the IDE. The architecture is built around AWS services such as AWS CDK for resource definition and deployment, Amazon S3 as the storage service, and AWS Lambda for running serverless functions. All these in combination build a strong robust and scalable serverless data pipeline.

The versatility and simplicity associated with Python, as well as its extensive library collection, make it an ideal language for Lambda functions and serverless applications. With AWS’s CDK (Cloud Development Kit), the deployment of cloud resources is made easier because infrastructure can be defined programmatically in Python and many other languages. AWS Lambda is a serverless compute service which scales automatically and charges only when functions are executed, making it very cost-effective for event-driven workloads. Amazon S3 is a highly scalable object storage service that features prominently in serverless pipelines as a staging area for raw data and the final store for the processed results. These components create the building blocks of a cost-effective and scalable serverless data pipeline.

Language: Python
IDE: Visual Studio Code
AWS Services:
- AWS CDK: Infrastructure as Code (IaC) tool to define and deploy resources.
- Amazon S3: Object storage for raw and processed data.
- AWS Lambda: Serverless compute service to transform data.

Brief Description of Tools and Technologies:

Python: A versatile programming language favored for its simplicity and vast ecosystem of libraries, making it ideal for Lambda functions and serverless applications.
AWS CDK (Cloud Development Kit): An open-source framework that allows you to define AWS infrastructure in code using languages like Python. It simplifies the deployment of cloud resources.
AWS Lambda: A serverless compute service that runs code in response to events. Lambda automatically scales and charges you only for the execution time of your function.
Amazon S3: A scalable object storage service for storing and retrieving large amounts of data. In serverless pipelines, it acts as both a staging and final storage location for processed data.

Building the Serverless ETL Pipeline – Step by Step

In this tutorial, we’ll guide you through setting up a serverless pipeline using AWS CDK and AWS Lambda in Python. We’ll also use Amazon S3 to store data.

Step 1: Prerequisites

To get started, ensure you have the following installed on your local machine:

Node.js (v18 or later) → Download Here
AWS CLI (Latest version) → Install Guide
Python 3.x (v3.9 or later) → Install Here
AWS CDK (Latest version) → Install via npm.
Visual Studio Code → Download Here
AWS Toolkit for VS Code (Optional, but recommended for easy interaction with AWS)

Configure AWS CLI

To configure AWS CLI, open a terminal and run:

A screenshot of a computer

AI-generated content may be incorrect.

aws configure

Enter your AWS Access Key, Secret Access Key, default region, and output format when prompted.

Install AWS CDK

A screenshot of a computer

AI-generated content may be incorrect.

To install AWS CDK globally, run:

npm install -g aws-cdk

Verify the installation by checking the version:

cdk --version

Login to AWS from Visual Studio Code

Click on the AWS logo on the left side, it will ask for credentials for the first time

For the profile name use the Iam user name

After signing in the IDE will appear as below.

Step 2: Create a New AWS CDK Project

Open Visual Studio Code and create a new project directory:

mkdir serverless_pipeline_project

cd serverless_pipeline_project

A computer screen shot of a computer screen

Initialize the AWS CDK project with Python:

cdk init app --language python
This sets up a Python-based AWS CDK project with the necessary files.

Step 3: Set Up a Virtual Environment

Create and activate a virtual environment to manage your project’s dependencies:

python3 -m venv .venv

source .venv/bin/activate # For macOS/Linux

# OR

.venv\Scripts\activate # For Windows

python3 -m venv .venv

source .venv/bin/activate # For macOS/Linux

# OR

.venv\Scripts\activate # For Windows

Install the project dependencies:

pip install -r requirements.txt

Step 4: Define the Lambda Function

Create a directory for the Lambda function:

mkdir lambda

Write your Lambda function in lambda/handler.py:

import boto3

import os

s3 = boto3.client('s3')

bucket_name = os.environ['BUCKET_NAME']

def handler(event, context):

# Example: Upload processed data to S3

s3.put_object(Bucket=bucket_name, Key='output/data.json', Body='{"result": "ETL complete"}')

return {"statusCode": 200, "body": "Data successfully written to S3"}

Step 5: Define AWS Resources in AWS CDK

In the serverless_pipeline/serverless_pipeline_stack.py, define the Lambda function and the S3 bucket for data storage:

from aws_cdk import (

Stack,

aws_lambda as _lambda,

aws_s3 as s3

)

from constructs import Construct

class ServerlessPipelineProjectStack(Stack):

def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:

super().__init__(scope, construct_id, **kwargs)

# Create an S3 bucket

bucket = s3.Bucket(self, "ServerlessPipelineProjectS3Bucket")

# Create a Lambda function

lambda_function = _lambda.Function(

self,

"ServerlessPipelineProjectLambdaFunction",

runtime=_lambda.Runtime.PYTHON_3_9,

handler="handler.handler",

code=_lambda.Code.from_asset("lambda"),

environment={

"BUCKET_NAME": bucket.bucket_name

}

)

# Grant Lambda permissions to read/write to the S3 bucket

bucket.grant_read_write(lambda_function)

Step 6: Bootstrap and Deploy the AWS CDK Stack

Before deploying the stack, bootstrap your AWS environment:

cdk bootstrap

Then, synthesize and deploy the CDK stack:

cdk synth

cdk deploy

You’ll see a message confirming the deployment.

Step 7: Test the Lambda Function

Once deployed, test the Lambda function using the AWS CLI:

aws lambda invoke --function-name ServerlessPipelineProjectLambdaFunction output.txt

You should see a response like:

{

"StatusCode": 200,

"ExecutedVersion": "$LATEST"

}

Check the output.txt file; it will contain:

{"statusCode": 200, "body": "Data successfully written to S3"}

A folder called output will be created in S3 with a file data.json inside it, containing:

{"result": "ETL complete"}

Step 8: Clean Up Resources (Optional)

To delete all deployed resources and avoid AWS charges, run:

cdk destroy

Summary of What We Built

For this project, we configured AWS CDK within a Python environment. This was done to create and manage the infrastructure that is needed for a serverless ETL pipeline. The processing unit of the pipeline is an AWS Lambda serverless function which we developed for data processing. We also added Amazon S3 to use as a scalable and durable storage solution for raw and processed data. We deployed the required AWS resources using AWS CDK which automated the deployment processes. Finally, we confirmed that the entire setup was as expected by invoking the Lambda function and assured the data flowed properly through the pipeline.

Next Steps

In the future, I see multiple opportunities to improve and extend this serverless pipeline. An improvement that could be added is the use of AWS Glue for data transformation since it can automate and scale complicated ETL processes. Also, integrating Amazon Athena will enable serverless querying of the processed data which will allow for efficient analytics and reporting. Furthermore, we could use Amazon QuickSight for data visualization that can enhance the insights obtained from the data, allowing users to interact with the data presented on dashboards. These steps will build upon fundamentally what we have already built and will create a more comprehensive and sophisticated data pipeline.

By following this tutorial, you’ve laid the foundation for building a scalable, event-driven serverless pipeline in AWS using Python. Now, you can further expand the architecture based on your needs and integrate more services to automate and scale your workflows.

Author: Ashis Chowdhury, a Lead Software Engineer at Mastercard with over 22 years of experience designing and deploying data-driven IT solutions for top-tier firms including Tata, Accenture, Deloitte, Barclays Capital, Bupa, Cognizant, and Mastercard.