Based on his years of experience, BBD CEO Kevin Staples shared an insightful look into performance, scalability and stability during his talk at our international tech conference esc@pe this year. Here’s what he had to say.
I’ve spent much of the past 18 years of my career facing the most complex performance and scalability problems that I could find across our client base. As humans, we’re wired to learn through pain and My mother says that as a child I sat on the far end of that pendulum swing, often needing a lot of pain to learn. She eventually got tired of telling me not to touch the hot stove plate and had to let me touch it in a controlled fashion to learn my lesson. After that I learnt my lesson –quickly.
I wish I could say when I was younger in my career, I learnt about scalability and performance of systems from a talk much like those given at esc@pe – but that’s not where my journey into it all began.
Contrary to popular belief, most of our modern technologies don’t tend to behave in a way where your performance degrades gradually and your user base can carry on working on the system, just with a degraded experience. Actually, most of our modern technologies tend to behave more like a cliff that you fall off of. And unfortunately, this is where my journey into performance and scalability started.
About 18 years ago, we were building a very large CRM-type system for one of our large telco clients in South Africa. This system was big, by the time we were done we had well over 1 000 web-based screens in the system that needed to scale to tens of thousands of very active users. After resolving numerous challenges along the way, our server-side service cluster was handling roughly 6 000 requests per second, while our primary database was processing approximately 30 000 SQL requests per second. This gives you a clear picture of the tremendous load we needed to support on this system.
Throughout my career I’ve often been that typical engineer called into the “war room” when systems went down. And I’ve also been that senior engineer in these war rooms trying to diagnose and restore service. Oddly enough, I’ve always enjoyed these experiences. There’s something about chasing down the problem, trying to restore service and getting it done. I can promise you though that you don’t want to be that senior engineer in the war room when your system’s gone down because it can’t handle the load. It tends to create a condition I call “the Raging Herd” – a concept where there’s all of this load coming and your system’s gone down, and so the gates have swung closed. This leaves you with a raging herd getting agitated at the gate, trying to log into the system. If your system couldn’t handle the load in your steady peak state, you’re definitely not going to handle the raging herd load when you restore service. The gate swings open, the herd comes through very aggressively in much higher loads and you go down again…
At that client in those days, we faced prolonged outages, going down during peak hours and only recovering when the load naturally decreased in the afternoon. Those sleepless nights haunted us because we knew we hadn’t truly solved the problem; and the load was coming back hot the next morning.
Performance and scalability are quite broad topics according to the nuance of the type of systems – from large data processing to streaming TPU-intensive type systems. I’m limiting the discussion here to large front-end and transaction processing systems as they behave similarly with high scalability to the server-side load.
Myths to dispel
Myth 1: We need more hardware!
“We’ve got a well-structured project, really good people working on it and we’re going to develop this system according to best practice. Then we’ll throw it over the fence to a performance testing person, and if it doesn’t scale to the right load, we’ll just add more server instances and all will be good.”
Myth 2: We need to upgrade!
“We’re just dependent on that upgrade. If we get that upgrade to the new version of the database or operating system, there’ll be a whole bunch of performance optimisations that’ll mean we’re all good.”
Myth 3: Java is too slow!
In the old days it was: “Java is too slow – if we just use C++ we would’ve been okay.”
And these days: “If we were just running JavaScript in Node it would be much faster!”
Myth 4: Autoscaling is magic!
And then the new kid on the block: If we had microservices architecture running in cloud with ESK autoscaling, it would automatically scale up in cloud and that would solve all our problems.”
In the telco story, we’d already attempted all these solutions – adding more server instances, throwing a huge amount more CPU at the problem, and increasing the memory and the network. On top of that, we were already running on the most expensive server hardware that money could buy in those days and had the fastest network topology available. It was obvious to us that when we were going down under load, the server wasn’t even blinking; so much so we used to joke that it didn’t even know our application was running. The CPU was also idling along nicely, and we had plenty of spare network capacity. What we learnt in those days, and I’ve seen it numerous times in the decades since, is that the vast majority of performance and scalability problems stem from design and code.
How to go about performance and scalability
I’ve realised that there’s two repeating formulas that tend to guide our efforts in ensuring performance and scalability.
The first is the inverse relationship between performance and scalability. Put simply; if the code is executing lightning fast, it will scale. If it’s not, it won’t. There’s no way around this.
The second is the Pareto Principle, or more commonly the 80/20 Principle: if you do very large-scale front-ends, and you have maybe 1 000 screens, 20% of them will account for almost all the active daily usage of that front-end system. The other 80% of the screens are hardly ever used. The same applies into the server-side clusters for those systems. I’ve noticed that if 80% of the services running server-side take an almost irrelevant load, the remaining 20% of the services on your services cluster make up just about all the load for that system. Focus on that 20% and forget about the 80%, because they won’t be material in your scalability.
As a final note on the 80/20 Principle, remember that it tends to be recursive. Focus on the initial 20%, then apply the principle again. You’ll see the pattern repeating with the next 20% showing a disproportionately higher load. You need to subscribe to higher performance criteria for these.
The combination of these two formulas bring focus to what we need to worry about when looking at performance and scalability.
So how do we fix the issue?
I’ve noticed that in fixing the problems, we tend to use four common patterns. Interestingly, they tend to behave a lot like plumbing.
There’s a cross-section of pipe with a big load of water coming through, but the pipe isn’t wide enough and you don’t have enough of one of the above limited resources. These four plumbing concepts are how we go about solving the issues – starting with the least likely pattern that comes into play.
1. Slow down and smooth out
This pattern is where you intentionally apply a a bottleneck to manage overwhelming loads, but is seldom applicable to the type of systems we’re talking about here. This pattern is applicable when dealing with the proverbial raging herd.
2. Scale up
In plumbing terms, this pattern would be “get a bigger, thicker pipe” to handle the volume of water. In server terms that means getting a bigger server or faster CPU. It’s sometimes applicable, but with our modern technologies, the scale-out pattern is more applicable.
3. Scale out
In server-terms, this pattern would be scaling out the number of servers and balancing the load across them. In cloud terms, you would spin out more pods for particular services according to the load on each. In other resource concepts we sometimes call it parallel processing, or in certain cases, multiplexing. All these examples are of the same pattern where we process in parallel; or, referring to the plumbing metaphor, adding different, thinner pipes in parallel to your main pipe to concurrently channel water through them.
4. More efficiency
This final pattern is often the most applicable. Make it more efficient. Improve your code, make it more efficient so that it’s less onerous on resources it’s using, and design your code to run faster. In about 95% of the systems that aren’t scaling in my experience, this is the pattern to make use of.
It’s pretty clear to me now that I’ve spent the last 18 years becoming an exceptionally well-paid plumber…
Design principles
Here are some of the design principles I’ve learnt, through pain, to use when scaling systems to very high concurrent loads:
- Stick to standard technologies – it’s called the bleeding edge for a reason
- Cache aggressively
- Stateless: minimise your use of state
- Don’t use disk – it won’t scale
- Be mindful of the cascade of your service aggregation pattern
- Avoid old things – don’t integrate to legacy if you need to scale
- Protect that legacy system from yourself as a raging herd – they’re often fragile and are unlikely to scale as you’d need
- If there’s no alternative, use the Circuit Breaker Pattern
- Software engineers, pay attention to your SQL performance
- Many are misguided by what good performance actually looks like
- Pay attention to the pooling patterns in your technology
- Don’t think of your pool as unlimited
- Don’t fetch a resource out of a pool if you’re not going to use it fast then release it
- Don’t process large amounts of data service-side
- Exception handling:
- Avoid what I like to call “exception floods” from users
- Avoid auto retries
- Blocking and locking contention – thread safe, always
To watch Kevin’s full talk featuring more insights, principles, golden metrics and examples, click here.