Last Black Friday, at 14:23 GMT, our payment platform processed 847 transactions per second. We handled £4.2M in transactions in 40 minutes. Nothing broke.
This is how we built it.
I'm not going to give you the fluffy "use microservices and you'll be fine" advice. I'm going to walk you through the actual architecture decisions, the mistakes we made in v1, and the specific patterns that got us from "this thing falls over at 500 requests per minute" to handling 10 million transactions per day.
Start With Idempotency or Don't Start
Every payment system discussion should begin here, and almost none of them do. Idempotency is the property that applying the same operation multiple times produces the same result. Without it, network retries create duplicate charges. Users get billed twice. You get chargebacks. Your CEO gets a very unpleasant phone call.
The pattern is simple but needs to be consistent across every layer:
```typescript // Every payment request gets a client-generated idempotency key interface PaymentRequest { idempotencyKey: string; // UUID, generated by client amount: number; currency: string; customerId: string; paymentMethodId: string; }
async function processPayment(request: PaymentRequest) { const existing = await redis.get(`idempotency:${request.idempotencyKey}`); if (existing) { return JSON.parse(existing); // Return cached result, no duplicate charge }
const result = await executePayment(request); await redis.setex( `idempotency:${request.idempotencyKey}`, 86400, // 24-hour TTL JSON.stringify(result) ); return result; } ```
The key sits in Redis for 24 hours. If the client retries—due to a timeout, a network blip, anything—they get back the same result. The payment is not processed twice.
This sounds simple. Applying it consistently across a team of engineers building 30+ services over 18 months is not simple. We made it a prerequisite at the API boundary, enforced it in code review, and built monitoring to catch any endpoint not implementing it.
The Architecture: Events Over RPC
V1 of this platform was synchronous. A payment request would: validate → charge card → update ledger → trigger fulfilment → send confirmation email. All in one HTTP request, all in one database transaction.
This works fine until your email service has a 3-second outage, or your fulfilment service is slow during peak, or your ledger update hits a database lock. Any one of those creates a cascading failure that impacts every transaction in flight.
V2 used an event-driven architecture. The payment request does three things synchronously: validate, charge the card, write an event to Kafka. Everything else is downstream.
``` Payment API → [validate + charge] → Kafka topic: payment.processed ↓ Ledger Service (consumer) Fulfilment Service (consumer) Email Service (consumer) Analytics Service (consumer) ```
Each consumer processes events independently. If email is down, payments still process. The email service catches up when it recovers, because the event is durably stored in Kafka with a 72-hour retention window.
The trade-off: the system becomes eventually consistent. The ledger update happens milliseconds after the payment, not in the same transaction. For most financial applications, this is acceptable. For some—particularly those with strict balance enforcement—it's not, and you need compensating transactions.
Handling Failure: Circuit Breakers and Dead Letter Queues
The third payment processor we integrated had a particularly unreliable production environment. Without circuit breaking, a timeout on their API would propagate back to our callers and eventually exhaust our connection pool.
We implemented a circuit breaker per payment processor: if 5 consecutive requests fail within a 30-second window, the circuit opens and we route to a fallback processor. The circuit half-opens after 60 seconds to check if the processor has recovered.
Failed events go to dead letter queues—a separate Kafka topic per service for messages that failed processing after 3 retries. Our ops tooling monitors DLQ depth as a leading indicator of service health. A growing DLQ means something is failing silently.
The Database Layer
PostgreSQL. Not a NoSQL store. Not NewSQL. Plain PostgreSQL 16, with very careful attention to schema design.
The main payments table has row-level locking for balance operations. We use `SELECT ... FOR UPDATE` when reading a balance before updating it. Yes, this creates contention under high load. No, it's not a problem at our scale because we shard by customer ID—so locks only contend within a single customer's transactions.
Read replicas for everything that doesn't need to see the absolute latest data: reporting queries, analytics, dashboard endpoints. Primary database handles only write-path operations.
PgBouncer for connection pooling. Without it, 500 concurrent API workers attempting to connect directly to PostgreSQL would exhaust its connection limit. With PgBouncer in transaction pooling mode, we handle 10K concurrent API connections against a PostgreSQL instance configured for 200 database connections.
Performance Under Load: The Numbers
At peak (Black Friday), we were running: - 847 transactions per second - P50 latency: 180ms (card charge + event publish) - P99 latency: 620ms - Error rate: 0.003%
The P99 was the number we optimised hardest. Long-tail latency in payment processing destroys UX—users think their card has been declined when it's actually just slow.
The optimisations that moved the needle: connection pooling (took P99 from 2.1s to 900ms), async email and fulfilment (took P99 from 900ms to 620ms), and pre-warming instances before peak events (eliminated cold start latency).
PCI DSS Is a Constraint, Not a Feature
We don't store card numbers. We don't process raw PANs. We tokenise at the edge—Stripe or Braintree handles sensitive card data, we handle tokens. This drops us to PCI SAQ A scope, which is vastly simpler than full PCI DSS Level 1 compliance.
Network segmentation: payment processing services run in a dedicated VPC with no internet access except outbound to payment processor endpoints. Everything else communicates via private endpoints. Audit logging on every transaction with immutable storage.
Penetration testing twice a year, finding remediation tracked with SLA-based escalation. Not because our clients demanded it—because we demanded it of ourselves.
The Part Nobody Writes About: Reconciliation
Every night, we reconcile our internal ledger against our payment processor's settlement reports. Discrepancies are flagged, investigated, and resolved before the next business day. This process, which nobody finds exciting, has caught three data integrity issues that would have been material misstatements in our clients' financials.
Soft deletes everywhere. In a financial system, you never delete a transaction record. You mark it as voided, refunded, or cancelled. The audit trail is non-negotiable.
The monitoring that actually matters: not just availability and latency, but business metrics. Transaction volume relative to historical baseline, refund rate, chargeback rate, payment method decline rate. Anomalies in these numbers often indicate fraud or integration failures before they show up in technical monitoring.
Building a payment platform that handles scale is a solvable engineering problem. The answers are well-known: idempotency, events, circuit breakers, appropriate database choices. What makes it hard is applying these patterns consistently, under business pressure, across an entire team. That's an organisational problem as much as a technical one.