Introduction to Amazon Lambda, Layers and boto3 using Python3
A serverless approach for Data Scientists
Amazon Lambda is probably the most famous serverless service available today offering low cost and practically no cloud infrastructure governance needed. It offers a relatively simple and straightforward platform for implementing functions on different languages like Python, Node.js, Java, C# and many more.
Amazon Lambda can be tested through the AWS console or AWS Command Line Interface. One of the main problems about Lambda is that it becomes tricky to set up as soon as your functions and triggers get more complex. The goal of this article is to present you a digestible tutorial for configuring your first Amazon Lambda function with external libraries and doing something more useful than just printing “Hello world!”.
We are going to use Python3, boto3 and a few more libraries loaded in Lambda Layers to help us achieve our goal to load a CSV file as a Pandas dataframe, do some data wrangling, and save the metrics and plots on report files on an S3 bucket. Although using the AWS console for configuring your services is not the best practice approach to work on the cloud, we are going to show each step using the console, because it’s more convenient for beginners to understand the basic structure of Amazon Lambda. I’m sure that after going through this tutorial you’ll have a good idea on migrating part of your local data analysis pipelines to Amazon Lambda.
Setting our environment
Before we start messing around with Amazon Lambda, we should first set our working environment. We first create a folder for the project (1) and the environment Python 3.7 using conda (you can also use pipenv )(2). Next, we create two folders, one to save the python scripts of your Lambda function, and one to build your Lambda Layers (3). We’ll explain better what Lambda Layers consists later on the article. Finally, we can create the folder structure to build Lambda Layers so it can be identified by the Amazon Lambda (4). The folder structure we created is going to help you better understand the concept behind Amazon Lambda and also organize your functions and libraries.
# 1) Create project folder
mkdir medium-lambda-tutorial# Change directory
cd medium-lambda-tutorial/# 2) Create environment using conda
conda create --name lambda-tutorial python=3.7
conda activate lambda-tutorial# 3) Create one folder for the layers and another for the
# lambda_function itself
mkdir lambda_function lambda_layers# 4) Create the folder structure to build your lambda layer
mkdir -p lambda_layers/python/lib/python3.7/site-packages
tree .
├── lambda_function
└── lambda_layers
└── python
└── lib
└── python3.7
└── site-packages
Amazon Lambda Basic Structure
One of the main troubles I encountered when trying to implement my first Lambda functions was trying to understand the file structure used by AWS to invoke scripts and load libraries. If you follow the default options ‘Author from scratch’ (Figure 1) for creating a Lambda function, you’ll end up with a folder with the name of your function and Python script named lambda_function.py inside it.
The lambda_function.py file has a very simple structure and the code is the following:
import jsondef lambda_handler(event, context):
# TODO implement
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
These 8 lines of code are key to understanding Amazon Lambda, so we are going through each line to explain it.
import json
: You can import Python modules to use on your function and AWS provides you with a list of available Python libraries already built on Amazon Lambda, likejson
and many more. The problem starts when you need libraries that are not available (we will solve this problem later using Lambda Layers).def lambda_handler(event, context):
This is the main function your Amazon Lambda is going to call when you run the service. It has two parametersevent
andcontext
. The first one is used to pass data that can be used on the function itself (more on this later), and the second is used to provide runtime and metadata information.# TODO implement
Here is where the magic happens! You can use the body of thelambda_handler
function to implement any Python code you want.return
This part of the function is going to return a default dictionary withstatusCode
equal to 200, andbody
with a “Hello from Lambda”. You can change this return later to any Python object that suits your needs.
Before running our first test, it’s important to explain a key topic related to Amazon Lambda: Triggers. Triggers are basically ways in which you invoke your Lambda function. There are many ways for you to set up your trigger using events like adding a file to a S3 bucket, changing a value on a DynamoDB table or using an HTTP request through Amazon API Gateway. You can pretty much integrate your Lambda function to be invoked by a wide range of AWS services and this is probably one of the advantages offered by Lambda. One way we can do it to integrate with your Python code is by using boto3 to call your Lambda function, and that’s the approach we are going to use later on this tutorial.
As you can see, the template structure offered by AWS is super simple, and you can test it by configuring a test event and running it (Figure 2).
As we didn’t change anything on the code of the Lambda Function, the test runs the process and we receive a green alert describing the successful event (Figure 3).
Figure 3 illustrates the layout of the Lambda invocation result. On the upper part you can see the dictionary contained on the returned statement. Underneath there is the Summary part, where we can see some important metrics related to the Lambda Function like Request ID, duration of the function, the billing duration and the amount of memory configured and used. We won’t go deep on Amazon Lambda pricing, but it is important to know that is charged based on:
- duration the function is running (rounded up to the nearest 100ms)
- the amount of memory/CPU used
- the number of requests (how many times you invoke your function)
- amount of data transferred in and out of Lambda
In general, it is really cheap to test and use it, so you probably won’t have billing problems when using Amazon Lambda for small workloads.
Another important detail related to the pricing and performance is how CPU and memory are available. You choose the amount of memory for running your function and “Lambda allocates CPU power linearly in proportion to the amount of memory configured”.
At the bottom of Figure 3, you can see the Log output session where you can check all the execution lines printed by your Lambda function. One great feature implemented on Amazon Lambda is that it is integrated with Amazon CloudWatch, where you can find all the logs generated by your Lambda functions. For more details on monitoring execution and logs, please refer to Casey Dunham great Lambda Article.
We have covered the basic features of Amazon Lambda, so on the next sessions, we are going to increase the complexity of our task to show you a real-world use providing a few insights into how to run a serverless service on a daily basis.
Adding layers, expanding possibilities
One of the great things about using Python is the availability of a huge number of libraries that helps you implement fast solutions without having to code all classes and functions from scratch. As mentioned before, Amazon Lambda offers a list of Python libraries that you can import into your function. The problem starts when you have to use libraries that are not available. One way to do it is to install the library locally inside the same folder you have your lambda_function.py
file, zip the files and upload it to your Amazon Lambda console. This process can be a laborious and inconvenient task to install libraries locally and upload it every time you have to create a new Lambda function. To make your life easier, Amazon offers the possibility for us to upload our libraries as Lambda Layers, which consists of a file structure where you store your libraries, load it independently to Amazon Lambda, and use them on your code whenever needed. Once you create a Lambda Layer it can be used by any other new Lambda Function.
Going back to the first session where we organized our working environment, we are going to use the folder structure created inside lambda_layer
folder to install locally one Python library, Pandas.
# Our current folder structure
.
├── lambda_function
└── lambda_layers
└── python
└── lib
└── python3.7
└── site-packages# 1) Pip install Pandas and Matplotlib locally
pip install pandas -t lambda_layers/python/lib/python3.7/site-packages/.# 2) Zip the lambda_layers folder
cd lambda_layers
zip -r pandas_lambda_layer.zip *
By using pip
with parameter -t
we can specify where we want to install the libraries on our local folder (1). Next, we just need to zip the folder containing the libraries (2) and we have a file ready to be deployed as a Layer. It’s important that you keep the structure of folders we create on the beginning (python/lib/python3.7/site-packages/) so that Amazon Layer can identify the libraries contained on your zipped package. Click on the option Layers on the left panel of your AWS Lambda console, and on the button ‘Create Layer’ to start a new one. Then we can specify the name, description and compatible runtimes (in our case is Python 3.7). Finally, we upload our zipped folder and create the Layer (Figure 4).
It takes less than a minute and we have our Amazon Layer ready to be used on our code. Going back to the console of our Lambda function we can specify which Layers we are going to use by clicking on the Layer icon, then on ‘Add a layer’ (Figure 5).
Next, we select the Layer we just created and its respective version (Figure 6). As you can see from Figure 6, AWS offers a Lambda Layer with Scipy and Numpy ready to be used, so you don’t need to create new layers if the only libraries you need are one of these two.
After selecting our Pandas Layer all we need to do is import it on your Lambda code as it was an installed library.
Finally, let’s start coding!
Now that we have our environment and our Pandas Layer ready, we can start working on our code. As mentioned before, our goal is to create a Python3 local script (1) that can invoke a Lambda function using defined parameters (2) to perform a simple data analysis using Pandas on a CSV located on a S3 (3) and save the results back to the same bucket (4) (Figure 7).
To give Amazon Lambda access to our S3 buckets we can simply add a role to our function by going to the session Execution role on your console. Although AWS offers you some role templates, my advise is to create a new role on the IAM console to specify exactly the permission need for your Lambda function (Left panel on Figure 8).
We also changed the amount of memory available from 128MB to 1024MB, and the timeout to 5 minutes instead of just 3 seconds (Right panel on figure 8), to avoid running out of memory and timeout error. Amazon Lambda limits the total amount of RAM memory to 3GB and timeout to 15 minutes. So if you need to perform highly intensive tasks, you might find problems. One solution is to chain multiple Lambdas to other AWS services to perform steps of an analysis pipeline. Our idea is not to provide an exhaustive introduction to Amazon Lambda, so if you want to know more about it, please check out this article from Yi Ai.
Before showing the code, it’s important to describe the dataset we are going to use on our small project. I chose the Fifa19 player dataset from Kaggle, which is a CSV file describing all the skills from the players present on the game (Table 1). It has 18.207 rows and 88 columns and you can get information about the nationality, clubs, salary, skill level and many more feature from each player. We downloaded the CSV file and uploaded it to our S3 bucket (renamed it fifa19_kaggle.csv).
So now we can focus on our code!
As we can see in the script above, the first 5 lines are just importing libraries. With exception to Pandas, all the other libraries are available for use, without having to use Layers.
Next, we have an accessory function called write_dataframe_to_csv_on_s3
(lines 8 to 22) used to save a Pandas Dataframe to a specific S3 bucket. We are going to use it to save our output Dataframe created during the analysis.
The other function we have on our code is the main lambda_handler,
the one that is going to be called when we invoke the Lambda. We can see that the first 5 assignments on lambda_handler
(lines 28 to 32) are variables passed to the event
object.
From the lines 35 to 41 we use boto3 to download the CSV file on the S3 bucket and load it as a Pandas Dataframe.
Next, on line 44 we use the group by method on the Dataframe to aggregate the GROUP
column and get the mean of the COLUMN
variable.
Finally, we use the function write_dataframe_to_csv_on_s3
to save df_groupby
on the specified S3 bucket, and return a dictionary with statusCode and body as keys.
As described before in the Amazon Lambda Basic Structure session, the event parameter is an object that carries variables available to lambda_handler
function and we can define these variables when configuring the test event (Figure 9).
If we run this the test, using the correct values related to the 5 keys of test JSON, our Lambda function should process the CSV file from S3 and write down the resulted CSV back to the bucket.
Although using the variables hardcoded on test event can show the concept of our Lambda code, it’s not a practical way to invoke the function. In order to solve it, we are going to create a Python script (invoke_lambda.py
) to invoke our Lambda function using boto3.
We are going to use only three libraries: boto3, json and sys. From lines 5 to 10 we use sys.argv
to access the parameter when running the script through the command line.
python3 invoke_lambda.py
The last parameter (aws_credentials) we provide to invoke_lambda.py
is a JSON file with our credentials to access AWS services. You may configure your credentials by using the awscli or generate a secret key using IAM.
On our main function, invoke_lambda
we use boto3 client to define access to Amazon Lambda (line 38). The next object called payload
is a dictionary with all the variables we want to use inside our Lambda function. These are the Lambda variables that can be accessed using the event.get('variable').
Finally, we simply call client.invoke()
with the target Lambda function name, Invocation type, and payload carrying the variables (line 54). The Invocation type can be of three types: RequestResponse (default), to“invoke the function synchronously. Keep the connection open until the function returns a response or times out”; Event, to asynchronously call Lambda; or DryRun when you need to validate user information. For our main purpose, we are going to use the default RequestResponse option to invoke our Lambda, as it waits for the Lambda process to return a response. As we defined a try/except structure on our Lambda Function, if the process runs without errors, it would return a status code 200 with the message “Success!”, otherwise it would return status code 400 and the message “Error, bad request!”.
Our local script invoke_lambda.py
when run with the right parameters takes a few seconds to return a response. If the response is positive with status code 200, you might check your S3 bucket to search for the report file generated by the Lambda function (Table 2). As we used the columns “Club” to group by and “Overall” to get the mean, we are showing the 20 clubs with the highest average player overall skill level.
Final considerations
I hope this quick introduction (not so quick!) to Amazon Lambda helped you understand better the nuts and bolts of this serverless service. And that it can help you somehow try different approaches on your Data Science projects. For more information about serverless architecture using AWS please check this great article from Eduardo Romero.