Performance, scalability and the ugly stepsister: stability

Based on his years of experience, BBD CEO Kevin Staples shared an insightful look into performance, scalability and stability during his talk at our international tech conference esc@pe this year. Here’s what he had to say.

I’ve spent much of the past 18 years of my career facing the most complex performance and scalability problems that I could find across our client base. As humans, we’re wired to learn through pain and My mother says that as a child I sat on the far end of that pendulum swing, often needing a lot of pain to learn. She eventually got tired of telling me not to touch the hot stove plate and had to let me touch it in a controlled fashion to learn my lesson. After that I learnt my lesson –quickly.

I wish I could say when I was younger in my career, I learnt about scalability and performance of systems from a talk much like those given at esc@pe – but that’s not where my journey into it all began.

Contrary to popular belief, most of our modern technologies don’t tend to behave in a way where your performance degrades gradually and your user base can carry on working on the system, just with a degraded experience. Actually, most of our modern technologies tend to behave more like a cliff that you fall off of. And unfortunately, this is where my journey into performance and scalability started.

About 18 years ago, we were building a very large CRM-type system for one of our large telco clients in South Africa. This system was big, by the time we were done we had well over 1 000 web-based screens in the system that needed to scale to tens of thousands of very active users. After resolving numerous challenges along the way, our server-side service cluster was handling roughly 6 000 requests per second, while our primary database was processing approximately 30 000 SQL requests per second. This gives you a clear picture of the tremendous load we needed to support on this system.

Throughout my career I’ve often been that typical engineer called into the “war room” when systems went down. And I’ve also been that senior engineer in these war rooms trying to diagnose and restore service. Oddly enough, I’ve always enjoyed these experiences. There’s something about chasing down the problem, trying to restore service and getting it done. I can promise you though that you don’t want to be that senior engineer in the war room when your system’s gone down because it can’t handle the load. It tends to create a condition I call “the Raging Herd” – a concept where there’s all of this load coming and your system’s gone down, and so the gates have swung closed. This leaves you with a raging herd getting agitated at the gate, trying to log into the system. If your system couldn’t handle the load in your steady peak state, you’re definitely not going to handle the raging herd load when you restore service. The gate swings open, the herd comes through very aggressively in much higher loads and you go down again…

At that client in those days, we faced prolonged outages, going down during peak hours and only recovering when the load naturally decreased in the afternoon. Those sleepless nights haunted us because we knew we hadn’t truly solved the problem; and the load was coming back hot the next morning.

Performance and scalability are quite broad topics according to the nuance of the type of systems – from large data processing to streaming TPU-intensive type systems. I’m limiting the discussion here to large front-end and transaction processing systems as they behave similarly with high scalability to the server-side load.

Myths to dispel

Myth 1: We need more hardware!

“We’ve got a well-structured project, really good people working on it and we’re going to develop this system according to best practice. Then we’ll throw it over the fence to a performance testing person, and if it doesn’t scale to the right load, we’ll just add more server instances and all will be good.”

Myth 2: We need to upgrade!

“We’re just dependent on that upgrade. If we get that upgrade to the new version of the database or operating system, there’ll be a whole bunch of performance optimisations that’ll mean we’re all good.”

Myth 3: Java is too slow!

In the old days it was: “Java is too slow – if we just use C++ we would’ve been okay.”

And these days: “If we were just running JavaScript in Node it would be much faster!”

Myth 4: Autoscaling is magic!

And then the new kid on the block: If we had microservices architecture running in cloud with ESK autoscaling, it would automatically scale up in cloud and that would solve all our problems.”

In the telco story, we’d already attempted all these solutions – adding more server instances, throwing a huge amount more CPU at the problem, and increasing the memory and the network. On top of that, we were already running on the most expensive server hardware that money could buy in those days and had the fastest network topology available. It was obvious to us that when we were going down under load, the server wasn’t even blinking; so much so we used to joke that it didn’t even know our application was running. The CPU was also idling along nicely, and we had plenty of spare network capacity. What we learnt in those days, and I’ve seen it numerous times in the decades since, is that the vast majority of performance and scalability problems stem from design and code.

How to go about performance and scalability

I’ve realised that there’s two repeating formulas that tend to guide our efforts in ensuring performance and scalability.

The first is the inverse relationship between performance and scalability. Put simply; if the code is executing lightning fast, it will scale. If it’s not, it won’t. There’s no way around this.

The second is the Pareto Principle, or more commonly the 80/20 Principle: if you do very large-scale front-ends, and you have maybe 1 000 screens, 20% of them will account for almost all the active daily usage of that front-end system. The other 80% of the screens are hardly ever used. The same applies into the server-side clusters for those systems. I’ve noticed that if 80% of the services running server-side take an almost irrelevant load, the remaining 20% of the services on your services cluster make up just about all the load for that system. Focus on that 20% and forget about the 80%, because they won’t be material in your scalability.

As a final note on the 80/20 Principle, remember that it tends to be recursive. Focus on the initial 20%, then apply the principle again. You’ll see the pattern repeating with the next 20% showing a disproportionately higher load. You need to subscribe to higher performance criteria for these.

The combination of these two formulas bring focus to what we need to worry about when looking at performance and scalability.

So how do we fix the issue?

I’ve noticed that in fixing the problems, we tend to use four common patterns. Interestingly, they tend to behave a lot like plumbing.

There’s a cross-section of pipe with a big load of water coming through, but the pipe isn’t wide enough and you don’t have enough of one of the above limited resources. These four plumbing concepts are how we go about solving the issues – starting with the least likely pattern that comes into play.

1. Slow down and smooth out

This pattern is where you intentionally apply a a bottleneck to manage overwhelming loads, but is seldom applicable to the type of systems we’re talking about here. This pattern is applicable when dealing with the proverbial raging herd.

2. Scale up

In plumbing terms, this pattern would be “get a bigger, thicker pipe” to handle the volume of water. In server terms that means getting a bigger server or faster CPU. It’s sometimes applicable, but with our modern technologies, the scale-out pattern is more applicable.

3. Scale out

In server-terms, this pattern would be scaling out the number of servers and balancing the load across them. In cloud terms, you would spin out more pods for particular services according to the load on each. In other resource concepts we sometimes call it parallel processing, or in certain cases, multiplexing. All these examples are of the same pattern where we process in parallel; or, referring to the plumbing metaphor, adding different, thinner pipes in parallel to your main pipe to concurrently channel water through them.

4. More efficiency

This final pattern is often the most applicable. Make it more efficient. Improve your code, make it more efficient so that it’s less onerous on resources it’s using, and design your code to run faster. In about 95% of the systems that aren’t scaling in my experience, this is the pattern to make use of.

It’s pretty clear to me now that I’ve spent the last 18 years becoming an exceptionally well-paid plumber…

Design principles

Here are some of the design principles I’ve learnt, through pain, to use when scaling systems to very high concurrent loads:

Stick to standard technologies – it’s called the bleeding edge for a reason
Cache aggressively
Stateless: minimise your use of state
Don’t use disk – it won’t scale
Be mindful of the cascade of your service aggregation pattern
Avoid old things – don’t integrate to legacy if you need to scale
- Protect that legacy system from yourself as a raging herd – they’re often fragile and are unlikely to scale as you’d need
- If there’s no alternative, use the Circuit Breaker Pattern
Software engineers, pay attention to your SQL performance
- Many are misguided by what good performance actually looks like
Pay attention to the pooling patterns in your technology
- Don’t think of your pool as unlimited
- Don’t fetch a resource out of a pool if you’re not going to use it fast then release it
Don’t process large amounts of data service-side
Exception handling:
- Avoid what I like to call “exception floods” from users
- Avoid auto retries
Blocking and locking contention – thread safe, always

To watch Kevin’s full talk featuring more insights, principles, golden metrics and examples, click here.

South Africa as a data-safe safe choice for EU businesses

Digital strategy, Software development, Tech & business consulting

Cookie	Duration	Description
__cf_bm	1 hour	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
_GRECAPTCHA	5 months 27 days	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
PHPSESSID	session	This cookie is native to PHP applications. The cookie stores and identifies a user's unique session ID to manage user sessions on the website. The cookie is a session cookie and will be deleted when all the browser windows are closed.
rc::a	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::b	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::c	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::f	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
zalb_*	session	Zoho sets this cookie for load balancing and session stickiness. It ensures that user requests are consistently directed to the same server during a session, helping maintain session integrity and improving website performance.

Cookie	Duration	Description
_zcsr_tmp	session	Zoho sets this cookie for the login function on the website.
bcookie	1 year	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
bscookie	1 year	LinkedIn sets this cookie to store performed actions on the website.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
UserMatchHistory	1 month	LinkedIn sets this cookie for LinkedIn Ads ID syncing.
yt-remote-cast-available	session	The yt-remote-cast-available cookie is used to store the user's preferences regarding whether casting is available on their YouTube video player.
yt-remote-cast-installed	session	The yt-remote-cast-installed cookie is used to store the user's video player preferences using embedded YouTube video.
yt-remote-fast-check-period	session	The yt-remote-fast-check-period cookie is used by YouTube to store the user's video player preferences for embedded YouTube videos.
yt-remote-session-app	session	The yt-remote-session-app cookie is used by YouTube to store user preferences and information about the interface of the embedded YouTube video player.
yt-remote-session-name	session	The yt-remote-session-name cookie is used by YouTube to store the user's video player preferences using embedded YouTube video.
ytidb::LAST_RESULT_ENTRY_KEY	never	The cookie ytidb::LAST_RESULT_ENTRY_KEY is used by YouTube to store the last search result entry that was clicked by the user. This information is used to improve the user experience by providing more relevant search results in the future.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.
_ga_0NYLJN3XCS	2 years	This cookie is installed by Google Analytics.
_ga_GY82V894KJ	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_*	1 minute	Google Analytics sets this cookie to store a unique user ID.
_gat_gtag_UA_28875611_4	1 minute	Set by Google to distinguish users.
_gcl_au	3 months	Google Tag Manager sets the cookie to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_tt_enable_cookie	3 months	Tiktok set this cookie to collect data about behaviour and activities on the website and to measure the effectiveness of the advertising.
_ttp	3 months	TikTok set this cookie to track and improve the performance of advertising campaigns, as well as to personalise the user experience.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies store information about how the user uses the website to present them with relevant ads according to the user profile.
li_sugr	3 months	LinkedIn sets this cookie to collect user behaviour data to optimise the website and make advertisements on the website more relevant.
muc_ads	1 year 1 month 4 days	Twitter sets this cookie to collect user behaviour and interaction data to optimize the website.
personalization_id	1 year 1 month 4 days	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	doubleclick.net sets this cookie to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	6 months	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
VISITOR_PRIVACY_METADATA	6 months	YouTube sets this cookie to store the user's cookie consent state for the current domain.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__Secure-ROLLOUT_TOKEN	6 months	Description is currently not available.
AnalyticsSyncHistory	1 month	No description
crmcsr	session	No description available.
gclid	1 month	Description is currently not available.
li_gc	5 months 27 days	No description
ln_or	1 day	No description
ttcsid	3 months	Description is currently not available.
ttcsid_CP28EQBC77U5LTIRJOS0	3 months	Description is currently not available.

Performance, scalability and the ugly stepsister: stability