Why Eventbrite runs a 700 node Kube cluster just for development
Published on Aug 19, 2020
In Part 1 of this interview with Remy DeWolf, a principal engineer on the DevTools team at Eventbrite, we discussed what information factored into Eventbrite’s decision to move their development environment to the cloud.
The DevTools team at Eventbrite set out to build
yak because they had too many services to run locally. They knew they wanted to move their development environment to the cloud, but it ended up
yak had additional benefits that ranged from sharing environments to transitioning to working remotely due to COVID.
In this post, we dig into how
yak works, what it’s like for devs to use it, and how it’s been received.
How is the Eventbrite application architected?
This is a common story that you will find in a lot of startups. The founding engineers built a monolith and the strategy was to build features fast and capture the market. It was a very successful approach.
As the company grew over time, having a large team working on the monolith became challenging. When reaching a certain size, it was also harder to keep scaling vertically.
Over time, some of the monolith was migrated over to microservices. Now, new services are generally containerized, and the monolith is containerized in dev but not in production.
What prompted you to rehaul your dev envs and what problems did you set out to solve?
See How did you decide it was time to build yak? from Part 1 of the interview.
How did you convince your company?
In the beginning, we partnered with developers to help us focus on the most important features and also to keep them excited about the technology. Specifically, there was a great interest to learn more about Kubernetes. We also added instrumentation to our developer tools so we could measure how much time developers were wasting.
Once we understood how much time the developers spent waiting or dealing with issues, we had to make a call between spending money on cloud computing or wasting engineer time. We presented our plan to the CTO and we got the green light to move forward.
How Cloud Native kills developer productivity
Making the switch to microservices but think it’s too good to be true? Or you already made the switch but you’re starting to notice that local development is harder than it used to be. You’re not alone.Download Now
What’s the developer workflow like with yak?
Every morning, a developer has two options:
- Reconnect to their previous session: this takes a few seconds and they can resume their work from where they left it the previous day.
- Update their local branch and update their remote docker images: this takes 5-7 minutes to get the environment updated.
From this point, all their containers are running remotely and they can work through the day. Here are some of the common operations:
Change code locally: Changed files are automatically synced over to their remote containers. It usually takes a few seconds for the changes to be available. We use
rsyncfor this, which is very efficient. To keep it simple, we do a one-way sync (from laptop to remote container).
Note: this flow is much faster than the standard flow of building/pushing/deploying images, which in practice is hard to get under a minute for large applications.
Debug code: Developers can add breakpoints in their code and attach to a running container to get a live debugging session. We provided a command that wrapped
kubectl attachunder the hood.
Run tests: Developers can run unit tests locally, but any tests that require dependencies (such as a DB or Redis) can be run remotely in a pod. For integration tests, they can run tests in a specific pod and connect to the other services directly.
Most of the frontend changes are done locally and don’t require the cloud. The cloud is very useful for backend development and running various tests.
How does it work?
Every developer has their own namespace where they manage their remote developer environment. Kubernetes does the heavy lifting and yak simplify the management of their containers.
How does the DevTool team interact with the development environments?
Usually, we don’t need to interact with the dev env but only focus on the big picture, such as managing the clusters, and adding more features to the tools.
We do support the developers, so sometimes we would directly connect to a namespace to troubleshoot some issues, via standard Kubernetes commands like
What’s the ongoing maintenance burden?
In the beginning, it was a lot of work because we had a few issues where we didn’t understand the root cause, and we were also weak with the user documentation.
Over time, we got this under control. The documentation was revamped and we built a good knowledge base about the most common errors.
How has the environment changed from when you first designed it?
Our approach was always to deliver incremental value by first focusing on a minimum viable product (MVP) and adding features over time. For these reasons, we made many changes from the original design.
Here are a few interesting changes:
- Our infrastructure was running on one EKS cluster originally. At one point, we had 700 worker nodes, and 14,000 pods running. We ran into performance and rate-limiting issues that made us reconsider this single-cluster approach. Over time, we switched to a multi-cluster architecture where each cluster had no more than 200 nodes.
- Syncing the code directly into running containers could sometimes cause the container to crash if the changes made the application fail the probe check. After iterating a few times on how to solve this problem, we decided to set up a sidecar container that is responsible for syncing the code.
- To persist data over time (for example to save the MySQL database files of a developer) we use Statefulsets backed by EBS volumes. However, AWS has a limitation around EBS volumes – an application running in a pod on EKS must be on a node in the same availability zone (AZ) as the EBS volume. To solve this problem, we partitioned our EKS nodes per availability zone and we used taints to make sure that our Statefulset would be in the same AZ.
Have there been any unexpected benefits?
Sharing environments: Have you ever heard a developer say “but it worked for me locally when I ran the tests?" Consistency improves by running in the cloud. The ability to share developer environments proved to be very helpful when trying to understand test failures or work on issues that were hard to reproduce.
Working globally: We have a globally distributed team but most of the test/QA infrastructure is in the US. Simple operations like resolving the application dependencies or downloading a Docker image requires a lot of networking round trips. If the network latency is poor, these operations are slow.
By running on the cloud, the developer opens a connection to some container (with port forwarding or by getting a shell) and then they can run their commands from the same AWS region where the rest of the infrastructure is located. For our engineers based outside of the US, being able to develop on the cloud has been a huge improvement.
Transitioning during COVID: When COVID happened, all developers switched to working remotely. For some, this meant sharing home internet connections with other households or moving back to their family. It would have been extremely difficult or impossible for some of them to run the developer environment locally. Operations such as pulling Docker images or resolving application dependencies would require gigabit of data on a daily basis. By developing on the cloud, the transition to remote work was fairly seamless and the developers were able to continue their work from home.
Blimp and Eventbrite
Blimp has collaborated with Eventbrite for a long time. We first met when we were building the predecessor to Blimp, which moves your Docker Compose development environment into the cloud. Eventbrite had already built
yak internally, and we were trying to make a general solution. We’ve been trading ideas ever since.
Check out Blimp to get the benefits of
yak without having to build it yourself!
Part 1 of the interview: Why managing dev environments is a full time job at Eventbrite
Read Blimp commands and usage in the Docs
See if you’re making any of these 5 common Docker Compose mistakes