This story is about how to make your service reliable and cheap. It’s not going to be a simple thing but trust me it’s worth of your effort.

What does profiling mean?

So I hope that the next question just popped up…

Why do we profile?

Cost-optimal solution

  1. How long and how much memory need our running Lambda?
  2. Are we performing not required / expensive requests?
  3. How much our infrastructure costs per month?
  4. How much one request costs?

Service resilience

  1. Are there any memory leaks?
  2. Is the service scaling up and down?
  3. How quickly it’s scaling?
  4. Does container work under the load?
  5. Do we protect our legacy systems?
  6. How heavy load we put on the database?

Design your service like you’re on call — you don’t want your phone calling at 2.30 am. You just don’t want that.

When should I profile my service?

  1. Will it cope with the load?
  2. What will be the cost of running your system?
  3. How it will work and communicate?

Here we come to 3 pillars, fundamentals for code which you’ll write very shortly.

Sequence diagram

Sequence diagram also should expose the expensive request for example to external APIs (Google Maps for example)

Architecture diagram

Cost calculation

The question is how small, not big, your infrastructure can be.

Example

First idea

Looks simple, but let’s think about consequences:

  1. A customer has to wait for the request to be processed by the current system. Can result in a bad experience and long response time.
  2. There’s no way to control pressure on the current system. More customers trying to store their readings, more request to API, more requests to the current system and database.
  3. Points of failure with an impact on the client: DB, Current System, API, Frontend applications.

Second idea

Looks better. What improvements can we see here?

  1. Response time: service doesn’t depend on the current “legacy” system. The customer gets a response just after publishing the message in the queue.
  2. If the Current System is down we have retry mechanism out of the box on the infrastructure level. Lambda fails to store message so exit with non zero status, a message goes back to the queue and Lambda will retry. One problem less to solve in the code! Even more, we can have the Dead Letter Queue and nice notifications.
  3. We have a funnel which gives us a way to control the pressure on our legacy system. By setting for example Lambda concurrency we can allow for more or fewer requests per second to our system.

Can we improve something here? Of course! In this solution our clients are blind. Our solution is eventually consistent and they have no idea what’s going on with their request. Is it stored? Was there any issue? Let’s try to solve this issue.

Third approach

Here we go 😄

We’ve added additional persistence for API (ReadingsDB). This one stores the information about the reading meter request. So when user makes the request 002 such document can be stored in ReadingsDB:

{
"uuid": "4ce614d8-f215-4de5-8cd0-62e46df6e3b1",
"customerId": "c0b141d9-2485-4e54-a3bf-783cbb53a903",
"reading": 234,
"status": {
"state": "PENDING"
}
}

This will allow the Frontend application present to the client the current status of request which can be changed by Lambda using the request 015. So after a while, the document can look like this:

{
"uuid": "4ce614d8-f215-4de5-8cd0-62e46df6e3b1",
"customerId": "c0b141d9-2485-4e54-a3bf-783cbb53a903",
"reading": 234,
"status": {
"state": "STORED"
}
}

or:

{
"uuid": "4ce614d8-f215-4de5-8cd0-62e46df6e3b1",
"customerId": "c0b141d9-2485-4e54-a3bf-783cbb53a903",
"reading": 234,
"status": {
"state": "ERROR",
"code": "E001",
"reason": "Reading lower than the previous."
}
}

And of course, result can be presented to the Customer. Looks awesome, doesn’t it?

Benefits of the final approach

  1. Retry mechanism out of the box.
  2. Information about the status of request available for Customer any time.

And the final one, take a look at the sequence diagram and number next to the request. It makes communication inside the team so much easier. Jira tickets, bugs, issues — team members can directly refer to the request numbers.

Summary

The price of reliability is the pursuit of the utmost simplicity.
C.A.R. Hoare, Turing Award lecture

Architecture Diagram / Costs

Based on this diagram we can see that there is an additional component and finally for pricing, we need to take under consideration:

  1. ALB pricing
  2. ECS / Fargate task pricing
  3. Database (ReadingsDB) — MongoDB.Atlas pricing
  4. API Gateway pricing
  5. SQS pricing
  6. Lambda pricing

Having these components we need also do some assumptions about the number of Fargate task which will be constantly running (minimum 2), vCPU/memory for tasks, number of LCU for ALB, memory&duration for Lambda etc. There’ll be cost related to Cloudwatch, ECR etc. However, when you prepared the spreadsheet you can nicely see, what is the monthly cost of your infrastructure and how it changes with a number of requests per month. It’s really worth doing!

Be aware pricing of each component you are using and each request you are making. Check everywhere.

What next?

Wait, what? Do I need to profile tickets?

During this test, you should also observe the logs to identify any exceptions, suspicious logs messages, lost request or strange behaviours. Better you test it now, fewer problems you will have in future.

You can also take a look at the memory metrics. Are there any suspicious patterns? Is your container leaking?

And the most important, during these tests monitor the legacy systems. You have the ways to control the pressure on it:

We have to protect our legacy systems.

Tools

  1. JMeter
  2. ab
  3. Plant UML
  4. Draw.io
  5. AWS Lambda Power Tuning
  6. Cloudwatch (Dashboards, Logs, Insights)

What to read?

  1. Serverless Lens
  2. Understanding AWS Lambda behaviour using Amazon CloudWatch Logs Insights

Summary

Tech Lead / Senior Software Engineer @ Zoopla

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store