Topic

How to Design a Scalable Web Platform

Insights/ Web Architecture & Platforms / Architecture Decisions

12 Jan 2024 - 09 min read

How to Design a Scalable Web Platform
Listen to article00:00 / 11:26

What "scalable" actually means

"Scalable" gets used as a synonym for "good", which makes the word almost useless. For a web platform, scalability is something narrower and more measurable: the system can absorb more of whatever it is asked to absorb (more requests, more users, more data, more concurrent writes) without the user noticing, without the team firefighting, and without rewriting the architecture from scratch.

That definition is operational, not theoretical. A platform is scalable when the team knows what breaks first under load, has already removed or moved that bottleneck once or twice, and has a plan for the next one. A platform is not scalable when the answer to "what happens at three times today's load" is "we will see". Scaling is not making things bigger; it is knowing what breaks first and moving the bottleneck before users notice.

This article is the operational layer of the architecture conversation. The whole-system context for the choices below is in the architecture-choice article; here, the focus is what makes a chosen architecture hold up when load grows.

Start with the load shape, not the architecture

The most useful capacity-planning artefact is not a target number ("we need to support a million users"), it is a load shape: how the actual traffic is distributed in time, geography and request type. Every scalable design follows from this.

Three properties matter more than any peak number. The first is the peak-to-average ratio. A site that handles a thousand requests per second on average and ten thousand at peak is a different design problem from one that handles two thousand on average and three thousand at peak. The first one needs an architecture that can absorb a 10x burst; the second needs one that runs hot and steady. Pretending they are the same problem is how systems get over-engineered for bursts that never come or under-engineered for bursts that do.

The second is the request mix. A site that is 90% read and 10% write is solved with caching. A site that is 50/50 is solved with database tuning. A site that is 10% read and 90% write is solved with queues, batches, and async writes. Knowing the mix is what tells you which lever to pull first.

The third is geographic distribution. A national platform with users on a single timezone has different cache and CDN choices from a global platform with traffic spread across continents. The CDN strategy that works for one is wasteful for the other.

Capacity planning that starts with these three properties consistently produces architectures that hold. Capacity planning that starts with a target user number consistently produces architectures that are wrong by an order of magnitude in some direction.

The caching hierarchy that does most of the work

Most production scalability comes from caching, and most caching mistakes come from caching at the wrong layer. A useful platform has caches at four layers, each doing a specific job.

Browser cache: the user's browser holds static assets (images, scripts, stylesheets) and conditional GETs. Tuning cache headers correctly is the cheapest scalability gain available, and most sites get it wrong by either caching too aggressively (stale content for hours) or not at all (every navigation re-downloads everything).

CDN cache: a content delivery network holds entire HTML pages and assets close to the user. For static or rarely-changing pages, this turns most of the traffic into a problem the CDN solves before the request ever reaches the application. For dynamic pages, the CDN can still hold for short windows (seconds to minutes) and absorb traffic spikes that would otherwise destroy the origin.

Application cache: an in-memory or shared cache (Redis, Memcached, or in-process LRU) holds the results of expensive computations and database queries. The discipline here is to cache the right thing at the right key with the right TTL: too short, and the cache does nothing; too long, and users see stale data. Cache invalidation is the second of the famous "two hard problems"; the first is naming, but cache invalidation is harder.

Database cache and query plan cache: the database itself holds frequently-read pages in memory. Sizing the database server correctly so the working set fits in RAM is often a bigger lever than any application-side optimisation, because a query that hits memory is orders of magnitude faster than one that hits disk.

When all four layers work together, the application server only sees the requests it actually has to handle, not the ones the cache could have absorbed. When one of them is missing or misconfigured, scaling the application by adding instances is solving the wrong problem.

Database scaling: where most platforms actually break

For most growing web platforms, the database is the first thing to break under sustained load, and the last thing the team thinks about when planning. Five things shift in the right order for a platform that scales.

Indexing. The single highest-leverage performance work, and the most often skipped. Slow queries become fast queries because the right index exists; without it, no amount of horizontal scaling helps.

Connection pooling. The application opens connections to the database. Without a pool, every request opens a new connection, which is expensive and fragile. With a pool, the application reuses connections and the database's connection count stays under control. This is a 20-line change that prevents most "database is at 100%" incidents.

Read replicas. Most web platforms are read-heavy. Routing reads to replicas while writes go to the primary multiplies the read capacity of the system without changing the application meaningfully. The honest catch is replication lag: a read that just happened on the primary may not be visible on the replica yet, which can produce surprising bugs in workflows that read-after-write.

Sharding by tenant or by domain key. When a single primary cannot handle the write load, the data is split across multiple databases by a key that the application can route on. This is heavy engineering with real failure modes; it is the right answer when reads-from-replicas is no longer enough, and the wrong answer earlier than that.

Caching at the database boundary. Many "database is the bottleneck" problems are actually "we are asking the database the same question 50 times a second" problems. An application cache in front of the read path absorbs that, freeing the database for the queries that genuinely need it.

These shifts are not microservices; they are the boring, well-known ladder that every growing platform climbs. Skipping rungs is how teams discover, the expensive way, why the rungs are there.

Async boundaries and the queue architecture

A scalable platform aggressively separates "the user is waiting" work from "the user is not waiting" work. The first has to be fast; the second can be slow as long as it is reliable.

User-facing requests do the smallest possible amount of work synchronously: validate, persist a small record, enqueue a job, return. Background workers pick up the job and do the heavy work (sending emails, generating reports, calling third-party APIs, processing uploads, recomputing aggregates). The queue between them absorbs spikes: a hundred jobs queued does not slow down the user, and the workers process at their own pace.

Three queue properties make this work in practice. Idempotency: a job has to be safe to run twice, because retries will happen. Backoff and dead-letter queues: when a job keeps failing, it stops blocking the queue and lands somewhere a human can investigate. Visibility: the queue depth, the worker throughput, the failed-job count are part of the dashboard, not a black box.

The platforms that hold under spikes are the ones where the synchronous path stays under control because most of the work happens asynchronously, with the queue acting as the shock absorber. The platforms that buckle under spikes are the ones where every request triggers everything inline, and a third-party API slowdown takes down the whole site.

Observability under load: the early warning system

A scalable platform is not one that never breaks; it is one that signals it is about to break, in time for the team to do something. That requires observability designed for load, not just for debugging.

Three things have to be visible at all times. Latency percentiles, not averages. The p99 of response time tells you what your worst users see; the average tells you very little. A site whose p50 is 200ms and p99 is 8 seconds has a problem that an average of 400ms will hide.

Saturation indicators. CPU utilisation, memory pressure, database connection count, queue depth, cache hit rate. These say how close the system is to its limits, not just how it is performing today. A system at 80% saturation today is one bad release away from a bad afternoon.

Per-route or per-tenant breakdowns. The aggregate p99 might be fine while a specific endpoint or a specific customer is in pain. Breakdowns surface that before it becomes a support ticket storm.

The discipline that makes observability useful is alerting on saturation and latency before user-visible failure, with the threshold tuned by experience rather than copied from a tutorial. Teams that get this right see incidents coming. Teams that get it wrong learn about incidents from customer support.

Final takeaway

A scalable web platform is not the one with the biggest servers; it is the one whose architecture, caching, database, async path and observability all hold under more load than the current load by a deliberate margin. Scaling is bottleneck migration: every solved bottleneck reveals the next one, and the platforms that scale well are the ones whose teams expect that and have a habit of moving the next bottleneck before it becomes an incident. Building for a hypothetical million users is usually a mistake; building for the next plausible step, with the operational tools to see the step after that coming, is usually the right move.

The wider context, including how scalability connects to the rest of the architecture decision, is collected in the web architecture and platforms insights cluster. And when the question moves from "is the platform scalable" to "we have a real load shape, a real database under pressure, and we need someone to design the indexes, the cache, the read replicas and the async path", that is exactly what my database architecture and performance practice is built around.

- Haja Faniry

Related services

Web Application Development

Custom web application development for companies, startups and international organisations.

Custom Digital Platforms for NGOs & Organisations

Development of digital platforms and data systems for NGOs, research institutions and international organisations.

API Development & System Integration

API development and system integration services to connect platforms, automate workflows and enable seamless data exchange between applications.

Previous Post
Common Architecture Mistakes in Web Projects
Next Post
When a Headless Architecture Makes Sense
How to Design a Scalable Web Platform | Haja Faniry