We are the devs from Team Netlify, Ask us anything!View other answers to this thread
What are the traits that netlify looks for when hiring engineers? What are the biggest challenges that netlify faces in terms of infrastructure?
Re: Challenges in terms of Infrastructure, from our Head of Infrastructure, Ryan Neal:
A lot of our problems fall in three big concerns:
- developer productivity
Scale: We are doing 10s of thousands of requests per second near constantly across our network. That value is doing the nice "up and to the right" that makes business happy and engineers excited. That means that any service that wants to operate on that firehose of data, or be involved in the request chain, has to be able to handle throughput. Want to know something about the traffic is being served out of Singapore? The service needs to handle billions of events. Want to break that down by a few facets? It is going to be 10s millions of unique pairs, so you have to consider disk/memory/time very carefully. This makes for some fun challenges on how you write horizontally scalable services that are elastic enough to come up and down quickly.
Availability: Our customers are trusting us with their web presence, we take that very seriously. To that point we design with failure in mind all the time ("How does the system work if this fails", "How will we know it failed", etc). All of our services, from edge to origin, are built to answer these questions. They're made to be immutable, disposable, and fault tolerant.
In order to provide more 9s we also focus a lot on automation. People aren't good at managing systems, systems are good at managing systems. We are constantly looking at the system and trying to find more ways to keep it stable. In the near future we are going to be re-examining some of our core design decisions to see if we can find better ways for it to gracefully degrade and automatically repair itself. Of course, we're an always on system too - meaning we have to change the tires on the car as we are driving it.
We are in 6 cloud providers right now, growing to 8 in the next few months. Being cloud agnostic adds a lot of complexity, from simple things like "how do we authorize into the server" to "how do we fail between clouds in a reliable way". Not being able to say things like "Just use this service" means that we have to have the ownership and knowledge to effectively run a lot of core services ourselves.
Productivity: We are constantly updating the platform for both product reasons and for scale considerations. These change need to be applied quickly and safely by our developers. That means that on the infrastructure side we need to build out the tooling that will let us operate, observe, and modify the system in a deterministic way. We are always finding more things we can monitor (imho: you can never have enough insight into your system).
We also have a unique problem that our infrastructure is split in that some of it is containerized and some of it is raw boxes. This split is because we sometimes need the performance of the raw boxes (e.g. the edge nodes) and sometimes the flexibility of something like kubernetes. This means that we have to figure a way to apply changes to both and give our developers the same experience but handle the differences behind the scenes.