After having seen an overview of serverless computing with AWS and some tips on configuring Lambda, we will move on in this post, to share some of the challenges that we faced and resolved. The reference application is a real time data processing application that used serverless computing as part of its architecture. A brief overview of this architecture was given in the earlier blog post.
DEBUGGING/ERROR HANDLING
Debugging
Logging
Problem: Writing to logs is one of the most common ways of debugging. The first things that comes to mind is to use the logger provided by AWS in the Lambda’s context. However, the challenge is that the Lambda context logger does not have the flexibility of having log levels.
Solution: Switch to using logsl4j logger and set up its configuration file. Define the log level when writing to the log. Build code with log level set to INFO for dev and QA environments. When deploying to production, build with log level set to ERROR.
DLQ
Problem: If there is too much incoming data, there may be too many log groups to check for errors.
Solution: Use a Dead Letter Queue for errors that need to be flagged and looked at immediately. Examine this queue regularly or set up an alarm to notify you whenever there are messages in this queue.
Unhandled Errors
All errors unhandled by your code are passed to AWS for handling. This poses some questions: Problems: How do you find out occurences of these errors? What types of errors are these? And what does AWS do with them?
Some types of unhandled errors are:
Uncaught exceptions
If there are any uncaught exceptions in your code, these will be passed onto AWS for handling.
Solution: To avoid this, it is best to have a top level try/catch block where you can decide what to do with the exception. (Log it, send to DLQ, etc)
Timeout
Another error that you could run into is that of the Lambda timing out before your code execution is completed.
Solution: Firstly, to catch these errors, you need to check the “Errors” parameter available in CloudWatch for Lambda functions. Next, analyze the code to see how to improve the performance. If the performance cannot be improved, then maybe Lambda is not the right choice for this piece of code.
UPDATE: Recently, AWS has increased the Lambda timeout to 15 minutes (from the earlier 5 minutes).
AWS handling of Lambda errors varies depending on the source of the event that triggered the Lambda. Here are a couple of examples:
Kinesis as trigger
If Lambda encounters an unhandled error, it will put back the record onto the Kinesis stream again. The packet remains in the stream till it reaches its expiry time. In stream based polling like this, the other messages from the stream will not get picked up by the Lambda till that first packet gets processed successfully. This means that the queue gets backed up till the errant packet gets expired and discarded.
Solution: Firstly to figure out that you have this problem, you should check the “Get Records Iterator Age (Milliseconds)” parameter in the Kinesis stream monitoring. This indicates how old the packet is, which is being processed. (That is, how long has it been in the queue.) A continuous high value here means that a record is failing to get processed and is blocking the rest of the records in the stream. Here are some possible options to handle this:
Solution: If possible, figure out why the packet processing is failing. (using logs or DLQ that you have set up) Fix it and redeploy so that the processing succeeds and rest of the items in the stream get picked up.
Solution: Wait for the record to expire and get discarded. This can be a bad option if the expiry time for the records is high – the stream will remain blocked for that long and no other packets will get processed till then.
Solution: If you can afford to lose the packets in the stream, then reconfigure your Kinesis stream trigger and set the “Starting Position” to “Trim Horizon”. This will discard everything that is currently there in the stream.
SQS as trigger
Here if the Lambda has an error, the same event is retried twice and then discarded.
Solution: In case your application does not want to lose this event, then make sure to set up a DLQ when configuring Lambda. This event will be discarded there and you can inspect it there.
PERFORMANCE
Problem: Avoid Lambda timeout errors. Be able to process as many records in a single batch as fast as possible.
Solution:
- As a first step, review your code to optimize it with basic good coding principles. This includes looking at loops, data structures, etc.
- Generally making connections to database or cache or ElasticSearch cluster can be an expensive operation. The options to consider are – either move the database/cache related code, out to a server or maintain a connection that is made in the constructor. However, a connection made in the constructor will be used across multiple instances of the Lambda – which means that this needs to be thread safe.
- If the trigger supports handling of multiple records by the Lambda, then it is essential to have an optimal batch size for the trigger. If the code has a thread safe client for database/cache/ES connections that has been initialized in the constructor, then it would be ideal to process as many records in a loop as possible using this same client. To configure this, first monitor your Lambda to see how long it takes for one record to process. Then set the Lambda timeout to the maximum. Based on these 2 values, calculate how many records can be safely processed in this timeout period and configure the trigger batch size to this number.
COMPUTE
Concurrency – In the earlier post we saw an example where we can use the Lambda concurrency limit. Here is another example:
Problem: If there is a database connection being made in Lambda, upon scaling up, there might be too many simultaneous requests for connections to the database, which the database cannot handle.
Solution: Set the Lambda concurrency to X, where X is the number simultaneous active connections that your database can handle. However, do note that although this will make sure you do not run into database connection errors, you will be limited to scaling this Lambda to this X number. Make sure this works for you to handle the incoming load to this Lambda.
Maintaining state
Problem: Lambda is stateless.
Solution: However, if your code requires you to maintain any state across the multiple executions of this Lambda, or even other Lambda functions, you can easily achieve this using a cache.
Configuration
Problem: To have an easily editable configuration file that does not require recompilation/deployment of code.
Solution: In an earlier post we saw that environment variables can be used to store some configuration parameters. If this does not suffice your needs, then you can consider having text files in S3 buckets. These can be stored in a cache for faster lookup. You can also write a separate Lambda function that will get triggered by this S3 changes and will update the configuration in the cache.
The cache implementation can be of your choice. However, using ElasticCache (Redis) service provided by AWS would be a good choice considering that it would integrate most easily with your Lambda. Also, AWS has designed the ElasticCache service to be accessed exclusively from within AWS. This makes it a secure implementation, as only resources within your VPC can access this cache.
Duplicates on SQS
Problem: Even after you read and process a record from SQS and delete it, it may still be visible on the queue again. This is documented by AWS.
Solution: Your code must be able to handle duplicate messages. You cannot use the message ID given by AWS, as it changes when the same message goes back into the queue. This means that you must create a way to uniquely identify each of your records, and then make sure that you are not processing the same one again. One easy way to track the records is to use the cache. Create an entry in the cache for each record processed with a suitable expiry/eviction time. Before processing a record, check the cache for its entry – if it is there, discard the record as it has been recently processed.
HIGH COSTS
Some areas where costs can be reduced/optimized:
Cloudwatch Logs
- This has 2 cost factors:
- The amount of data stored – Reduce the period for which you store the logs. By default, the Lambda log groups are created with the “Never Expire” option – which means that all your logs are stored indefinitely. Change this to a value that works for you.
- The amount of data ingested – This cost is much higher than the one above. Do not write too much to the logs. To limit this, reduce the amount of data written to logs by using the logsl4j logger and setting the level to ERROR for production environments, as discussed above.
CloudWatch alarms
Setting up alarms with monitoring at 1 minute intervals is 3 times more costly than monitoring at 5 minute intervals. So stick to the default of 5 minutes unless required otherwise.
Kinesis shards
With a large incoming load of data, the first reaction can be to increase the number of Kinesis shards. This is because the Lambda being triggered by Kinesis will scale only as much as the number of shards. However, increasing the number of shards increases the cost, as you pay by the shard-hour. To reduce the number of Kinesis shards, your Lambda code needs to be performance efficient. This has been discussed in the Performance section earlier. To recap:
- One, improve the efficiency/performance of your Lambda function as much as possible by inspecting/monitoring the code.
- Two, increase the timeout of the Lambda function to the maximum and
- Three, increase the batch size of the Kinesis trigger to the maximum possible that can be handled by your code in this maximum timeout period.
SQS
- FIFO queues are slightly costlier than the standard ones, so use these only if there is a requirement for FIFO.
- Limit the number of messages on SQS. If there are a lot of errors in your DLQ, fix them ASAP to reduce these on the queue.
Lambda cost
This is calculated by multiplying the following 2 values, which if kept minimum will bring your cost down.
- The actual time taken for your function to execute, rounded off to the nearest 100 milliseconds. To reduce this, make sure your code is performance efficient.
- The memory you have allocated to your Lambda in its configuration. This cannot be reduced randomly, as this affects the CPU power that is allocated to your Lambda function. It is possible that reducing this will make your code take longer to execute, thereby not really reducing the cost. To set this to an optimal value, you have to reduce this value a little at a time and at every step monitor the time taken for execution and the total Lambda cost. This will have to be done multiple times to arrive at the correct value for this parameter.
Goodbye
Hope you enjoyed reading this blog series. If you have any comments, suggestions or questions regarding the same, please get in touch at blogs@indexnine.com.
Further reading
https://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
http://blog.epsagon.com/how-to-handle-aws-lambda-errors-like-a-pro
https://hackernoon.com/tips-and-tricks-for-logging-and-monitoring-aws-lambda-functions-885af6da29a5
https://dashbird.io/blog/how-to-optimize-aws-lambda-cost-with-examples/