STLM Meetup: Design Pattern


This is loosely based on the STLM meetup about the Design Pattern on July 18, 2019.

When I sit on the other side of the table, I do not probe candidates’ specialization on the data structure or algorithms, — this can be done with much cheaper alternatives. I evaluate the candidates in the following rubric:

  • Problem solving skills. How the candidate explore the problem domain? Can the candidate clarify the ambiguity to consolidate the requirements?
  • Communication and collaboration. Can the candidate work with the interviewer, listen to the feedbacks and adapt to changes?

When I sit on this side of the table, I adopt the 4S framework for design interview:

  • Scenario: understand the users(internal, or external?) and use case. This will lead to the scope, SLA, scalability and future optimization opportunities.
  • Services: break down the system to small components, and understand how they interact.
  • Storage: most systems are stateful, how the states are persisted?
  • Scale: how we scale the performance, throughput, low latency?


In the Scenario phase, we try to answer the question:

What is the right thing to build?

More concretely:

  1. Who are the users? What problems we try to solve?
  2. If the system is part of workflow, what does the big picture look alike?
  3. What is the service-level agreement?

We can also explore from engineering perspective:

  1. What are the requirements for the date consistency?
  2. Are the system skewed towards read-heavy or write-heavy, OLTP or OLAP?


In the Services phase, we try to break down the system to multiple services, each focuses on the domain-specific problems. We will also discuss how these services interact, synchronous API or asynchronous message queue.

We need to handle the sorry path and operation challenges regardless:

  1. How could we make the request idempotent? With idempotency ensured, we can safely retry the request if it fails.
  2. How could we enforce QoS, circuit-break? Usually, it is a good idea to assign unique API key for each API consumer, so we can reject nonessential requests when system is saturated.
  3. How could we apply distributed trace to track the request across services during its life cycle?

The Microservice Patterns summarize common patterns in the microserivce design, highly recommended. I would also use the terminology in the following discussion.

Saga and Event Bus

In the Microservice era, states are scattered in various services1. It is prohibitively expensive to enforce distributed transaction, such as 2-phase commit(2PC). The Saga pattern destructs the workflow to several self-contained services, chained with event bus. To rollback, the rollback events are emitted to trigger cascaded actions to reverse previous side effects.

The event bus, or message queue, is sometimes perceived as the silver bullet for the distributed system. It shines to decouple the caller and callee, the availabilities of the dependencies are less impactful to the downstream services.

On the other side, we should be aware the limitation of the message queue:

at most once vs. at least once

It is impossible to deliver exactly once without sacrificing availability. We have to make a choice between at most once or at least once. We have to dedup messages based on the idempotent keys if the latter is chosen, as Segment did.

Handle racing condition correctly

Single worker automatically solves the racing condition issue, but this time bomb will go out when the system scales out. Please do NOT solve the racing condition with message queue, use optimistic lock instead.


Use RDBMS as the baseline, migrate to NoSQL for optimization. For example, Most relational databases uses B+ tree for fast insertion, update, and retrieval. If the data is append-only, such as logs and metrics, the SSTables and Log Structured Merge Trees are more performant data structure for this use case.


Where we tackle the I/O bound or CPU bound problems, the strategies for scalability are:

  1. Reuse artifacts to avoid expensive computation or I/O, such as caching.
  2. Distribute the workload across multiple machines, such as sharding and Command and Query Responsibility Segregation(CQRS).