Trello Migration: From RabbitMQ to Kafka for Better Scalability and Lower Costs

Trello Migration: From RabbitMQ to Kafka for Better Scalability and Lower Costs

In the world of real-time collaborative tools, Trello stands out as a popular to-do application that allows users to manage tasks across boards and columns. One of its key features is real-time updates, where changes made by one user are instantly reflected for all others on the same board. To achieve this, Trello relies heavily on web-sockets communication, which requires a robust and scalable messaging system.

In this blog, we’ll explore Trello’s journey from using RabbitMQ to adopting Kafka, the challenges they faced, and the significant improvements they achieved in terms of scalability, reliability, and cost efficiency.

The Problem: Real-Time Updates at Scale

Trello’s core functionality revolves around real-time updates. For example, if two users are collaborating on a board and one moves a card, the other user should see the change instantly. This requires a system that can handle a large number of concurrent web-sockets connections and deliver messages with low latency.

Initially, Trello used Redis for this purpose, but as the platform grew, they moved to RabbitMQ, a message broker designed for handling message queues. However, as Trello continued to scale, they encountered several challenges with RabbitMQ.

The RabbitMQ Architecture

Trello’s RabbitMQ setup consisted of a 15-node cluster, divided into inbound and outbound clusters:

  1. Inbound Cluster: A single RabbitMQ cluster with one exchange that received all updates from Trello’s API server.

  2. Outbound Clusters: Four RabbitMQ clusters, each with three nodes, responsible for distributing messages to web-sockets servers.

How It Worked:

  • When a user performed an action (e.g., moving a card), the API server pushed an event to the RabbitMQ inbound cluster.

  • The inbound cluster routed the message to the appropriate outbound cluster based on a sharding key (e.g., board ID).

  • Web-socket servers pulled messages from the outbound clusters and pushed updates to the connected users.

Challenges with RabbitMQ:

  1. Partition Handling: Managing partitions in RabbitMQ was complex and error-prone.

  2. Cluster Availability: The system suffered from split-brain issues, where multiple nodes believed they were the master, leading to conflicts and requiring a full cluster reset to recover.

  3. Transient Queues: Creating and deleting transient queues for web-socket connections was slow and resource-intensive, especially during reconnections.

The Shift to Kafka

To address these challenges, Trello decided to migrate to Apache Kafka, a distributed streaming platform known for its fault tolerance, low latency, and high availability.

New Architecture with Kafka:

  1. Event Publishing: All updates from Trello’s API server are pushed to a Kafka topic.

  2. Socket Master: A dedicated process reads events from Kafka, filters them, and forwards them to the appropriate web-socket servers.

  3. Web-socket Servers: These servers receive updates from the Socket Master and push them to the connected users.

Benefits of Kafka

  • Simplified Architecture: Kafka eliminated the need for complex RabbitMQ clusters and routing logic.

  • Scalability: Adding more web-socket servers and Kafka consumers is straightforward, making it easier to handle growing traffic.

  • Cost Efficiency: Trello achieved a 33% reduction in memory usage and a 5x reduction in infrastructure costs.

  • Improved Reliability: With Kafka, Trello experienced only one outage compared to four outages with RabbitMQ.

Key Takeaways

  1. Infrastructure Costs Matter: As your application scales, the infrastructure costs can escalate quickly. It’s crucial to monitor and optimise these costs regularly.

  2. Batching is Key: Whenever possible, batch operations to minimize the number of API calls and improve efficiency.

  3. Choose the Right Tool: While RabbitMQ is a great tool for certain use cases, Kafka’s distributed nature and fault tolerance make it a better fit for high-scale, real-time systems like Trello.

Conclusion

Trello’s migration from RabbitMQ to Kafka is a great example of how choosing the right technology can significantly improve system performance, scalability, and cost efficiency. By simplifying their architecture and leveraging Kafka’s strengths, Trello was able to provide a seamless real-time experience for millions of users while reducing operational overhead.