Building massively scalable serverless Telegram bots

Telegram as a store

Telegram is an under-appreciated application store. It has a large active userbase, and a great API with large file uploads, payments, and many other useful functions. You get a battle-tested, multi-functional chatting app frontend for free, all you have to do is provide value to its users. Even levelsio, one of the most prolific solopreneurs, migrated his community to Telegram and is building his latest startup as a Telegram bot.

Hosting Telegram bots

Okay, building Telegram bots is a good idea, but how can we deploy them? As we aim to build a validate many ideas, our deployment options need to satisfy a few basic requirements:

  1. Be low cost, preferably free until they hit virality

  2. Allow us to create many different bots to try them out at once

  3. Allow frictionless development, with infrastructure that does not get in our way

  4. Be massively scalable, if we do hit virality

The method that best satisfies all these requirements is AWS serverless offerings. We can deploy our bot as an AWS Lambda, and only pay for the compute time. Even if we simultaneously test hundreds of different bots, we won’t be charged unless they are actually used. AWS provides a generous free tier - each month you get 1 million invocations of Lambda, and 400,000 GB-second of compute time for free. Hitting these limits means your bot is actively being used, which is an excellent problem to have.

Leveraging Infrastructure as Code

Setting up AWS resources manually can be time-consuming, and error-prone. One way around this is to use Infrastructure as Code principles. We can declaratively specify resources in code and commit them to a repository. Thus our infrastructure will be version-controlled and repeatable. If something goes wrong you can easily revert the change. If you need another environment like staging, you can repeatably deploy the exact same infrastructure again.

However, there are levels to Infrastructure as Code. The first level is to version control your infra and make it repeatable. But it is detached from your actual code and lives and evolves separately. Usually, even in companies that utilize IaC, deploying new resources is still tedious and annoying. True IaC is when infrastructure is a first-class citizen in your code and spinning up a lambda is as easy as adding a new route to your express app.

SST

SST solves the exact problem I described above. It allows you to deploy many types of serverless offerings - including Lambdas, SQS queues, EventBridge buses, databases, S3 buckets, and more. And it does this in a way that does not get in your way, but instead helps you achieve your goals faster. We will use SST for our Telegram bot-building meta-framework.

Setup an SST app

SST has an excellent guide to help you get started with the basics of IaC and SST. Following the steps from the guide and their docs, we can set up an SST app using this snippet. Feel free to use whatever package manager you like best, I use pnpm as it closely matches our goal of deploying many Telegram apps without them taking too much space.

pnpm create sst telegram-bot
cd telegram-bot
pnpm install

The default app, as of the writing of this article, has an Event Bus and 3 Lambdas configured out of the box. We won't need most of these. Feel free to delete the EventBus construct from stacks/MyStack.ts and remove the 2 /todo endpoints passed to the Api construct. Make sure you have a configured AWS account. If you don't, you can follow the SST guide to get everything you need. Then, run

pnpm sst dev --stage dev

This will set up an AWS environment with the name dev, and deploy all the necessary CloudFormation and adjacent resources on which SST depends. Afterwards, the magic of SST begins - you will get a "live" Lambda function which will rebuild whenever you change the code. This is great because you get to test your app directly against the infrastructure you will be using in the future. It's amazingly fast as well!

You can curl the url outputted in your terminal and you will get a Hello World response. Congratulations! You set up your massively scalable AWS Lambda deployment, spending little to no time to get most of the sensible best-practice defaults.

Setup Telegram bot with BotFather

Now that we have our deployment environment, we need to set up our bot. Telegram has a bot to rule them all, with an offer you can't refuse - the @BotFather. Search for this username in the Telegram search bar, and follow the straightforward directions to register your bot. Take note of the token it gives you, as that token is what we will use to control the bot.

Telegraf

There are many bot libraries for Telegram, and the one we will use in this guide is Telegraf. This library allows us to use Telegram bot features in TypeScript. Install this library in the packages/functions directory of your app.

cd packages/functions
pnpm add telegraf

Now we can use Telegraf to create a convenient interface to the Telegram API.

const bot = new Telegraf("123456789:AbCdefGhIJKlmNoPQRsTUVwxyZ");

But wait a second! We can't just leave our bot token in code in plaintext. Fortunately, SST has a great way of handling secrets. Let's set up an SST secret. This command will create an object in ParameterStore and encrypt it with a key from KMS.

pnpm sst secrets set TELEGRAM_BOT_TOKEN "123456789:AbCdefGhIJKlmNoPQRsTUVwxyZ" --stage dev

Now, let's add the secret to our stack.

const TELEGRAM_BOT_TOKEN = new Config.Secret(stack, "TELEGRAM_BOT_TOKEN");

Don't forget to import the Config construct and bind it to the API like so.

import { StackContext, Api, Config } from "sst/constructs";

...

const webhook = new Api(stack, "api", {
  routes: {
    "POST /": "packages/functions/src/lambda.handler",
  },
  defaults: {
    function: {
      bind: [TELEGRAM_BOT_TOKEN],
    },
  },
});

We can now easily use the secret in our lambda function.

import { Config } from "sst/node/config";

...

const bot = new Telegraf(Config.TELEGRAM_BOT_TOKEN);

Let's set up a simple command to say Hello to all users who "start" our bot.

bot.start((ctx) => ctx.reply("Hello"));

Starting the bot

Okay now that we have the bot set up, we should just call bot.launch(), right? Well, not quite.

Polling vs Webhook

There are 2 ways of setting up Telegram bots - polling, and webhooks.

The easy method is to use Polling. This means your app periodically queries Telegram for new messages. This is super easy to set up in a server and works great for bots that don't have scalability requirements. However, we do not have a server, and we do want scalability.

So, our solution is to use Webhooks. You specify an endpoint, and Telegram sends a request to that URL whenever a new message comes in. With this method, your bot can "sleep" when it's not in use. This means we can safely let the Lambda turn off and not worry about missing any messages. Whenever our users message us, the Lambda will wake up, handle that one message, and go back to sleep.

Instead of launching the bot or attaching it to an existing server, we can manually handle incoming messages using the bot.handleUpdate method.

const update = JSON.parse(_evt.body);
await bot.handleUpdate(update);
return { statusCode: 200, body: "" };

The last step is to let Telegram know about the endpoint of our webhook. This is, of course, possible to do programmatically. But as it's a one-time call, I find it easier to just curl the command directly.

curl https://api.telegram.org/bot{bot_token}/setWebhook?url={lambda_url}

And that's that! Now you can send the /start command to your bot and get back a message.

Pitfalls and Edge Cases

Unfortunately, the bot is not ready yet. There are still changes we need to make to handle messages in a more semantically correct way. For an example of a common pitfall, we can change the bot response computation to take more time. We can send a request to an external API like OpenAI (foreshadowing), and reply with the AI's response. In some cases, the webhook will time out, and Telegram will re-send the original message. To handle this, we need a more robust infrastructure and a better strategy of handling messages.

Telegram expects a response from our webhook. If we do our calculations and return the response too late, it times out and retries. A naive strategy might be to always respond immediately, and only then handle the message. This can be easily achieved by removing the await from bot.handleUpdate() call. However, this leads to message ordering issues. Telegram delivers messages in order and only delivers the next message once you acknowledge the receipt of the previous one by responding to the webhook call. So, if you always respond early and then handle the messages, you will get all the messages at the same time, which will inevitably result in race conditions, ordering issues, and parallelism where we don't want it. To avoid data loss and corruption, we need to design a strategy to handle the messages in order, but without timing out.

SQS Queue

The semantically correct way to achieve this is to use SQS Queues. We could also use EventBridge, but Events don't guarantee ordering, while SQS queues do. So, in our architecture, we can Get a message to the webhook -> Add the message to a queue -> Respond to the webhook request. And with another Lambda function, we can handle the messages from the queue. In this case, we could still have issues of responding in time, because of unexpected intermittent network failures. But we can simply use a DeduplicationId for each queue message to make sure we handle every Telegram update only once.

This strategy is not foolproof - at least not yet. If we have 100 users using our bot simultaneously, and we push all their messages into a queue, the users will block each other. Our Lambda will handle the messages one by one, and the users who send their messages later will need to wait until all others are handled. This is not what we want - we essentially want to have one queue per user. SQS queues have an excellent way of handling this - FIFO queues with MessageGroupIds. We can push messages into the queue and set the userId of the sender as the MessageGroupId, and this will be exactly what we need.

Add this snippet of code to set up the queue in the stack.

const messageQueue = new Queue(stack, "messageQueue", {
  cdk: {
    queue: {
      fifo: true,
    },
  },
  consumer: {
    function: {
      bind: [TELEGRAM_BOT_TOKEN],
      handler: "packages/functions/src/messageConsumer.handler",
    },
  },
});

And make sure to bind the queue to the webhook, to make sure it has the necessary permissions to push to the queue. Also notice that we remove the TELEGRAM_BOT_TOKEN binding, as the webhook will no longer need to talk to Telegram. It will only be used to push messages to the queue and respond to the Telegram API with a 200 status code.

const webhook = new Api(stack, "api", {
  routes: {
    "POST /": "packages/functions/src/webhook.handler",
  },
  defaults: {
    function: {
      bind: [messageQueue],
    },
  },
});

In the webhook.ts file, change the handler function.

const sqs = new SQSClient({});

export const handler = ApiHandler(async (\_evt) => {
    const update = JSON.parse(\_evt.body);
    const messageId: string = `${update.message.message_id}`;
    const userId: string = `${update.message.from.id}`;

    const command = new SendMessageCommand({
      QueueUrl: Queue.messageQueue.queueUrl,
      MessageBody: _evt.body,
      MessageDeduplicationId: messageId,
      MessageGroupId: userId,
    });

    await sqs.send(command);

    return { statusCode: 200 };
});

And create a packages/functions/src/messageConsumer.ts file with this handler function in it.

import { SQSEvent } from "aws-lambda";

export const handler = async (_evt: SQSEvent) => {
  const bot = new Telegraf(Config.TELEGRAM_BOT_TOKEN);
  bot.start((ctx) => ctx.reply("Hello"));

  for (const record of _evt.Records) {
    const update = JSON.parse(record.body);
    await bot.handleUpdate(update);
  }

  return {};
};

Make sure to handle the necessary package installations in packages/functions.

pnpm add aws-lambda

Conclusion and Next Steps

We have added a queue to our infrastructure and push all incoming messages to it. This way, we can quickly respond to the Telegram API, while still making sure we keep the ordering of the messages coming in. Also, we parallelize the handling of messages for multiple users.

In the next article, we will work on integrating Langchain into this serverless environment to enhance the abilities of our bot. We will discuss the differences between using RAGs and Embeddings for information retrieval, and agentic action-taking to automatically reach out to our users with useful information without them pinging the bot.