Web-hook Notification Platform for Payment Transactions

4 min readApr 7, 2024

Introduction:

The ever-growing e-commerce landscape demands robust payment processing systems that can handle surging transaction volumes. A critical component of this infrastructure is the delivery of real-time payment notifications to merchants. These webhooks empower merchants to efficiently manage transactions, enabling functionalities like reconciliation, bookkeeping, and reporting. This blog post dives into the architectural considerations and design iterations for crafting a scalable and reliable webhook notification platform specifically for payment transactions.

Key Characteristics of the Platform

Unwavering Reliability: Data loss is unacceptable. The platform must guarantee eventual delivery of all webhook events, ensuring merchants receive crucial payment updates.
Effortless Scalability: The platform should seamlessly handle millions of daily transactions without compromising performance or responsiveness.
High Availability (Desirable): While near real-time delivery is ideal, availability requirements can be slightly relaxed as long as events are ultimately delivered reliably.
In-depth Observability and Monitoring: Monitoring the platform’s health is paramount. We need to understand how different components interact, identify performance bottlenecks, and pinpoint failing elements to ensure smooth operation.

Design Iteration #1: Pub-Sub Architecture (A Starting Point)

Our initial design leverages a publish-subscribe (pub-sub) architecture. In this setup, the payment service acts as the publisher, broadcasting payment status change events to a message broker. Subscribers, then consume these messages and trigger subsequent actions such as:

Retrieving merchant details (endpoint URL, credentials) from a merchant info service.
Sending the notification payload to the merchant’s designated endpoint.

This architecture effectively addresses scalability, latency, and availability to a certain extent. However, it falls short in guaranteeing reliability. Let’s explore potential shortcomings:

Single Point of Failure: If any service (notification, merchant info, merchant) becomes unavailable, event delivery stalls.
Message Broker Issues: Broker downtime or network problems can disrupt message delivery or loss of events.

Design Iteration #2: Pub-Sub with Fault Tolerance Mechanisms

We can enhance reliability by introducing fault tolerance mechanisms within each component:

Payment Service: To ensure guaranteed delivery and to decouple message publishing from the core payment logic, we can implement the Transactional Outbox Pattern. This pattern ensures both transaction records and event payloads are persisted in a database. Events can then be retrieved later for further processing, guaranteeing message persistence even if the broker is unavailable.
Broker and Subscriber: We can introduce retries on both the broker and subscriber sides for enhanced resilience:

Retry Queue: An additional retry queue will be added to the broker for keeping the events that couldn’t get successfully processed by the subscriber due to transient failures such as network errors or temporary service unavailability. In such case, subscriber will manage the event level retries and re-queue the events.
Exception Queue: Similarly, exception queue will be added to the broker for keeping the events that couldn’t get successfully processed event after retries. This queue can be monitored, debugged and processed separately for improvements.

3. Notification and Merchant Info Services: These services can also benefit from retry mechanisms for handling potential downstream service failures.

4. Merchant Info Service: Considering the read-heavy nature of merchant data access, utilizing a highly available database with caching can further improve performance and reliability.

This iteration strengthens the design by introducing retries and message persistence, significantly improving reliability. However, there are still edge cases to consider:

Timeouts and Duplicates: Retries can lead to duplicate event processing e.g. Its possible that multiple events could be sent to merchant in case of a connection timeouts scenario.
Retry Storming : The retries can have cascading effect and can overload downstream services.

Design Iteration #3: State Management and Idempotency for Enhanced Reliability

To further bolster reliability and address remaining edge cases, we can introduce following mechanisms:

State Management: By storing the delivery state of webhook events (delivered/undelivered), we can Optimize Retries — Only undelivered events are retried, preventing unnecessary processing.
Idempotency with Message IDs: Assigning a unique message ID as an idempotency key ensures that retries don’t introduce unintended side effects like duplicate transactions.
Circuit Breakers: Circuit breakers short circuit the traffic to failing components and helps in avoiding the cascading effects such as retry storming.
Observability: Reports based on delivery states can help pinpoint problematic merchants with unstable APIs.

Conclusion:

Building a scalable and reliable webhook notification platform requires careful consideration of architectural choices and implementation details. This blog post explored a three-step design iteration process, starting with a basic pub-sub architecture and progressively introducing fault tolerance mechanisms, state management, and idempotency to achieve the desired level of reliability.

In order to improve further, please provide your valuable feedback and share the techniques that you are using for similar use cases. 🙏