Flowchart for Choosing gRPC Methods Types
Zürich, Schweiz
Scenario
Let’s suppose your boss tasks you with the problem of building a data upload service for blobs. The only requirement is the system has to be built on gRPC. How do you go about doing this, and, in the end, does your solution look more like no. 1 or no. 2 below?
Prototypical Unary RPC:
1 2 3 4 5 6 7 8 9 10message UploadRequest { string file_name = 1; bytes content = 2; } message UploadResponse { ... } service UploadService { rpc Upload(UploadRequest) returns (UploadResponse) {} }Prototypical Streaming RPC
1 2 3 4 5 6 7 8 9 10message UploadFragment { string file_name = 1; bytes content = 2; } message UploadResponse { ... } service UploadService { rpc Upload(stream UploadFragment) returns (UploadResponse) {} }
These are two approaches to building the solution. But they have different costs to build, maintain, and reason about, and each has different scenarios in which one form would be more suitable than the other.
These are tradeoffs I want to explore with you.
Opening
Remote Procedure Call (RPC) infrastructures are a godsend in the space of distributed systems. One of the infrastructures in that ecosystem that I value for its versatility and extensibility is gRPC. This appeal has several ingredients:
- Cross-platform libraries that are generally high-performance
- Native mature wire format that is easy to version
- Request interception both client- and server-side
- Multiple structures for expressing RPC service methods: unary and streaming
Challenge: Choosing Which Method Structure
But, in spite of those things I like, a challenge lies in that last bullet point:
The reality is that there are not two but rather four main structural choices, which per the gRPC documentation are:
- unary methods: “[T]he simplest type of RPC where the client sends a single request and gets back a single response.”
- streaming method family
- server streaming: “A server-streaming RPC is similar to a unary RPC, except that the server returns a stream of messages in response to a client’s request.”
- client streaming: “A client-streaming RPC is similar to a unary RPC, except that the client sends a stream of messages to the server instead of a single message.”
- bidirectional (BiDi) streaming: A combination of both server and client streaming.
How do you know which one to choose? My observation is that the choice is not really all that clear, especially if you are a beginner in the ecosystem. To make matters worse, the choice marries a combination of considerations from domain-specific requirements to local practical matters.
One of the realities I’ve seen in my career is a preferential bias from less experienced engineers toward modeling a lot of operations as streaming RPC. I am not sure exactly what motivates it. I think curiosity and novelty and a desire to master something complex are certainly ingredients.
For me, I have nothing against streaming RPC, but it tends to be significantly more complicated to implement correctly both client- and server-side. And that’s not including the impact of adjacent ecosystem components when being used in a streaming system (e.g., implementing an interceptor correctly for it).
Summary: I tend to think most things should be modeled with unary methods unless there is a strong, compelling reason not to do so.
This is informed by a set of values that I prioritize (in order):
simplicity: choose the least complex solution for the job and its requirements
All of my career experience with maintaining and extending legacy systems essentially leads me to some of the same conclusions of Google’s Go Style Guide reaches around simplicity and least mechanism.
Use the most clear and least magic solution for the problem that you can get away with.
correctness: don’t sacrifice it
This is a fundamental invariant of design. What I will note is that there is seldom one correct solution to a problem. There can be multiple, and there are tradeoffs.
Enormity of the Decision Space
As a thought exercise, I attempted to make a flow chart of the high-level considerations I would apply when making a decision:

As you can see, it’s a bit of a monster. How do we make sense of this?
Tip: You can click on the image to see a larger version.
The edges are colored and sized differently according to the vertical and horizontal position of their origin.
As I undertook this exercise, I did not want to approach this problem exhaustively but rather comprehensively. Namely, how would I go about giving advice to myself as a junior engineer with everything I have learned today?
I tried to make each node in the graph self-standing in terms of the text; but if something doesn’t make sense, you should consult the flowchart explanation in the appendix.
Closing
I set off writing this article, thinking it was going to be a short exercise. It turned out I was wrong. What I wanted to write was a message in a bottle back to my past self. It’s not that I made a mistake with system design, but rather I’d have found the questions posed instructive.
As you can see, there are a lot of considerations that flow into this. My goal was never to be exhaustive in covering all of them but rather to give you a smattering of what I would consider myself. There are plenty of more advanced topics to consider that I did not have the energy to write about, so here is a taster:
mobile clients: how you manage energy and connection persistence is tricky.
backend session affinity: each streaming RPC is sticky to the backend it is operating with; whereas load balancing put in front of unary may spray requests to different backends.
If you’re looking for something more official and canonical, my colleague Eric Anderson from the gRPC team gave a talk about this at the 2018 CloudNativeCon entitled Using gRPC for Long-lived and Streaming RPCs:
There is admittedly a bit of a bias in my examination above: use the simplest solution that will deliver your goals, which most of the time is unary RPC methods. I hope that is something that Eric would agree with; he knows far more than I!
Aside: As I was labeling the diagram, I mostly did so in breadth-first order; whereas when I wrote out the prose above I did so in a more depth-first manner. Woops. I’ve spent so much time futzing with the labeling of content, identifiers, and such that I need a break.
I hope this exercise was as useful to you as it was to me.
Appendix: Walking the Flowchart
The part is optional to read. It provides an extended explanation of what you are seeing in the chart itself. Without further ado, let’s walk through the chart.
Step: Start Designing the RPC Service (Node 1)
As long as we are are considering modeling, we are kind of in this activity of designing. Let’s for the sake of argument consider one RPC method we want to design and work through it iteratively. You’ll work through this chart essentially for each RPC method you design.
Tip: If that sounds taxing, this may be a good reason to settle on certain common service API verbs like CRUDL with a common contract.
Question: Shortly-Lived Requests (Node 2)
The question to ask here is whether the wall time for receiving, processing, and replying to the request is short. What does short mean? That’s left as an exercise to the reader:
If the caller is a human interface, you might want to consider “short” from the perspective of how a human perceives it. Consider Response Times: The 3 Important Limits as a good starting point.
If the caller is a machine, processing time requirements and resource efficiency might matter. Consider Computer Latency at a Human Scale.
My gut tells me that 30 seconds is a good default upper limit for shortly-lived request. It is already a long time, but you can see that complexity ratchets up if it is more longly-lived (think: durability in case of interruption).
The other question to ask is not whether the operation could take 30 seconds, but to consider typical successful operation durations. I’d consider: median duration and 95th percentile duration. These metrics help you figure out how common a duration is.
If you don’t know what the expected duration is, consider modeling the operation in an isolated function in a program instead of as a RPC method and seeing how it performs.
Try something like this using Go’s benchmark framework:
| |
Step: Use Unary RPC Method (Node 3)
Here the case is rather easy: just build your unary RPC method per the gRPC instructions. You might end up with something like this prototypical echo service in your Protocol Buffer IDL definition.
| |
All that remains is creating the server implementation:
| |
Question: Do Other Clients want to Watch (Node 4)
Technically this question can apply with shortly-lived request (see previous question), but that is rare. The main idea behind it is this:
a client sends a request to your server to perform some operation that will take a long time
other clients or parties (or potentially the original client at a point later) wants to observe the state of the operation after it has begun
You can typically model operations that other parties can watch by there being some sort of an ID that the client provides to the server to name the operation or that ID being something the server gives back to the client. Either way, there is another service API mechanism to look up the status of the operation based on that ID.
Some of the later steps demonstrate a few patters for doing this.
Question: Are Trailing Operations Opaque (Node 5)
This question flows from the previous about other watchers. The way I would think about this question is whether the lingering operation is expected to be deeply modified at any point later. If an ongoing operation is opaque, the most you’ll know about it is whether it is still running, not really anything deeply about its progress or its make-up. A good analogy for this is a simple tracking number for a piece of mail; that is a relatively opaque way of tracking progress.
Aside: In truth, what you can do with a simple piece of mail and a tracking number can vary significantly from locality to locality. In Switzerland, I can do a lot (e.g., request the post to hold the mail or deliver it to another address). Back in the United States with USPS, relatively little.
If an operation is not opaque, perhaps you have the ability to modify it later (think: mutating an ongoing workflow). A good analogy for a such an operation is hiring a moving company to move your possessions to a new home. If the company is of repute, they’ll give you a tracking number for your move, and you can contact the company with that number to amend the delivery of your possessions to the new home (e.g., the contractors who are tearing out the old carpet are going to take two extra weeks, and the flooring can’t be covered in boxes).
Step: Use Long Running Operations (Node 12)
A Long-Running Operation (LRO) is a special form of a unary RPC. The principle behind it is that you have an ordinary unary RPC method, and it returns a google.longrunning.Operation message:
| |
The LRO pattern is common in intent-oriented management planes.
I highly recommend any user of LRO consult the embedded documentation in the IDL source code. Additional semantics for LRO are defined in the API Improvement Proposals (AIP), particularly AIP-151.
The google.longrunning.Operations RPC service interface is also something your server should implement and expose for management and interrogation of LRO status. Its API is rather crude, so it doesn’t really give you an ability to deeply inspect an operation or mutate one (outside of requesting cancellation).
In the case of data compaction in the example above, an LRO seems reasonable: it’s kind of a one-shot operation. It takes a long time to run (potentially). You tend to ignore it once you request it. You might cancel it if need be, but you seldom have the need to modify an outstanding compaction operation. Moreover, one would expect that a proper data storage system would handle compaction idempotently and transactionally should two compaction operations happen in rapid succession or race each other.
Step: Use Job API Convention (Node 13)
There is a more advanced form of LRO management where the system designer creates a custom service definition that conforms to the AIP-152 Job specification. You might also find this pattern in intent-oriented management planes.
We can adapt the example for Node 12 above using RunCompactionJob as a resource around which a standard CRUDL are built, including an AIP-152-prescribed Run method. This CRUDL contract enables RunCompactionJob to be modifiable as opposed to the one-shot approach above. This has a lot of power if the operation might need reconsideration or be run with a separate authority:
| |
Aside: I’ve gone to rather significant lengths to attempt to model the Job API sketch in as AIP-correct of a way as possible.
If you are not terribly familiar with gRPC or Google’s APIs for its Cloud products, you might come away thinking: this is a lot of ceremony and complexity, and you might be right — to some extent. But let’s consider the what AIP gives us:
Well-known conventions and structures for both API producer and consumer alike.
Producer and consumer do not need to consider and design many vagaries for common cases.
One critique I’ll levy against AIP is that I think some of the official examples could be improved (e.g., focus on the most minimally-viable correct exemplar and then show some advanced models). Far too many of the base case examples in the documentation are demonstrated with nested sub-resources in mind, an advanced design topic, which adds extra noise for the reader.
It’s left as an exercise to the implementer whether the RunCompactionJob is mutable after the RPC service /RunCompactionJobService.RunCompactionJob has been called. Supporting such mutation certainly could be use case for the Job API.
Question: Does the Client Send a Lot of Data (Node 6)
The crux of this question orients itself around how much data the client needs to send to accomplish what it wants. There’s not a one-size-fits all approach, as much of the questions are answered in application- and technology stack-specific contexts.
Let’s think about the basic scenario of a RPC service that offers a method file file uploads:
| |
If /UploadService.Upload is called exclusively on small text files like FILE_ID.DIZ, it’s easy to imagine this model working reasonably well.
Note: I just picked that file designation out of a hat since I expect the file definition to be commonly understood and also understood to be small.
But suppose /UploadService.Upload is to be general purpose and potentially handle large file sent to cold storage. In the current model, effectively the entire file needs to be sent by the client to the server before the server can even start processing it (this is how unary RPC methods work in gRPC). With large files, this means both client and server need to contend with buffering issues (kernel and hardware) and avoid memory or disk space exhaustion in user space code. There may even be network protocol-specific considerations, too.
So one could make the argument that large amounts of data need to be handled in a piecemeal way. Some of the solutions below explore this further.
Note: A well-formed API contract found in the wild for the scenario described above is the
google.bytestream.ByteStreamservice.
Question: Does the Server Need to Periodically Communicate with Client (Node 7)
This question orients itself around application- or protocol-level needs for the server to send periodic communications with the client in the process of the client sending a large amount of data to the server.
Let’s consider a scenario like this: the server reports back available quota budget to the client as data is sent to it. Node 14 provides an example of this.
Step: Use Bidirectional Streaming (Node 14)
In this case, you’ll be building a service definition using bidirectional (BiDi) streaming. Both client and server send fragments to each other over the course of a long-running RPC session.
We’ll build an example for Node 13 in the Protocol Buffer IDL specification below:
| |
Based on the user’s available quota (known by server and book-kept under UploadFragment.quota_id) weighed against the remaining file size (known by client under UploadFragment.estimated_remaining_file_size_bytes) and the number of remaining files (known by client under UploadFragment.estimated_remaining_files_count), the server will periodically report to the client quota status so that the upload session can gracefully upload the most important files and terminate the upload before quota is exhausted. This semantic might seem strange, but there could be multiple operations happening against a given quota ID in the production system at any given time.
Question: Can the Client Easily Chunk Work (Node 15)
By chunking work, we mean to understand the feasibility of the client breaking what it wants to send into batches to send to the server piecemeal. Here are some considerations:
Several things speak to batching:
The client has a resource budget to tolerate local accumulation of data to send (e.g., on disk or in memory).
The data is already durably stored locally, or the data has no durability requirements.
The data is provided to the server at a best effort-basis and the service-level objective (SLO) tolerates delays and loss.
As do several things against it:
The client has insufficient resource budget to locally accumulate data to send (e.g., small embedded device with limited memory).
The data is very important to send off and is not stored durably locally.
Something about the workflow of batching is just fundamentally toilsome for clients (e.g., just the act of requiring the client to think about batching from a design perspective might be too much).
Some examples:
Sampled environmental observations that are held for historical archiving but that drive no real-time action (e.g., air temperature probe) could be stored locally and sent on periodic intervals or local buffer capacity.
Patient vital signs used to drive hospital staff paging may have near real-time delivery requirements and should be sent as soon as they are available. Chunked delivery is likely inappropriate, but durable backup buffering in case of network unavailability is probably required.
Question: Does each request entail high fixed costs server-side (Nodes 18 and 20)
There are two-types of costs:
fixed costs (§c_f§): independent of the amount of work to be done
variable costs (§c_v§): proportionate to the amount of work to be done (§u§)
They are not independent of each other in a situation but rather can coexist in the context of a system: §c = c_f + c_v \cdot u§. While each is worth considering on their own, we want to consider the fixed part here. There are situations where §c_f§ crowds out the effects of §c_v§ for many plausible sizes of unit of work §u§: §c_f > c_v \cdot u§.
Let’s consider one of the unary upload solutions from Node 6, except that it saves data to slower, cold storage. It could look something like this:
| |
Looking at the code above, can you guess which phases probably take the most time? writeData certainly won’t be fast, but we’re having to wait on a human to do some rather menial tasks in Phases 1 and 3. So in such a situation we certainly could have a large §c_f§. Obviously there are other cases where the fixed cost of setting up the operation are large, too, and these don’t involve human operator intervention.
Question: Does the client or server depend on complex interception? (Nodes 21 and 22)
gRPC offers a powerful feature for creation of various forms of behavior-altering middleware through something called interception APIs. I really encourage you to give the link a look if you are unfamiliar with this concept. Interception can power a whole bunch of things:
- Metrics and Telemetry
- Authorization, Authentication, and Auditing (AAA)
- Tracing
- Caching
- Load Reporting and Shedding
Moreover, interception is built through user-implemented interceptors. Interceptors can be run either client- or server-side. And most interestingly, interceptors can be chained.
If you are familiar with Go’s common http.Handler API, it can be used through the process of wrapping and even currying to form behavioral chains server-side. Consider an example that provides elementary timing telemetry to a request handler:
| |
What I described above for Go’s net/http is very simple compared to gRPC interception. Recall the dimensions I mentioned above; there’s another one: unary or streaming! So suddenly this situation is a lot more complex
- client- and server-side
- can be chained
- unary and streaming
It’s not terribly uncommon for larger organizations to use interception to achieve middleware goals, and often the behaviors are extremely load-bearing.
Let’s look at the prototypical forms of the interception API.
Unary client-side implements grpc.UnaryClientInterceptor, which appears as this:
| |
Here’s an example:
| |
There are three phases:
- pre-processing: interception provides this logic, if any
- running the underlying call: the interceptor arranges this
- post-processing: interception provides this, if any
Now let’s look at the streaming client interceptor signature grpc.StreamClientInterceptor:
| |
Here’s an implementation:
| |
grpc.StreamClientInterceptor is a lot more complex than the unary in terms of phases:
- pre-processing: interception provides this logic, if any
- wrap the streamer, if at all
- sending, if at all
- receiving, if at all
- running the underlying stream: the interceptor arranges this
- for each send and receive of stream messages:
- running the send (or its wrapper) if at all
- running the receive (or its wrapper) if at all
Let’s suppose we are interested in reporting telemetry about operation successes and failures. With unary, this is relatively straightforward: the call succeeds or fails. There is a 1:1 mapping between the call and an outcome; increment the respective counter. But with streaming we suddenly have to consider multiplicity of sends and receives. The streaming call may have §n§ sends and §m§ receives. How do you report that in terms of telemetry:
Do you report a running sum of all 1. successful and 2. erroneous sends and receives (four metrics in total)?
Do you report success if all sends and receives work, but a single failing send or receives then makes the operation fail (two metrics in total)?
Telemetry might sound like an easy problem to solve, but it’s only the tip of the iceberg. To understand why, let’s look at server-side interception.
Server-side interception is achieved with unary methods using the grpc.UnaryServerInterceptor API:
| |
An example implementation:
| |
And here again is what server-side streaming looks like with grpc.StreamServerInterceptor:
| |
And an implementation:
| |
I alluded to there being a tip of the iceberg with complexity and that it could lie server-side. The truth is that it’s not that server-side is more complex; its that the middleware that’s often used there is even more complex.
Let’s consider a likely middleware to be universally used in production-grade systems: query cost reporting. How you estimate query cost for unary versus streaming operations is very different. Take a look at the specified information that is supposed to be returned for cost estimation: Open Request Cost Aggregation (ORCA):
| |
That looks somewhat easy to compute unary-wise, but streaming? Get out of here. You have to sit down and think about application-level utilization semantics. Yes, that’s doable, but will your guess of how to do it work with other services served by you binary? I can’t say, but this sounds precarious when building a system with multiple engineers.
And this leads to the central observation of this question: your systems may rely on powerful interceptors (even unbeknownst to you). Imagine if your organization has a platform team. Are you confident about the following:
- The interceptors exist with parity across unary and streaming forms?
- The interceptors are implemented correctly in both forms?
- The interceptors are semantically expressable in both forms?
Before you say “yes” without thinking about it, you need to deeply understand two things:
- The semantic intended behavior of the interceptor.
- The full complement of lifecycle phases for the given interceptor type.
I’ve worked with RPC systems like this for almost 20 years, including with gRPC’s predecessor, which shares many conceptually similar models to what I’ve just enumerated. I don’t think I could grok no. 2 (lifecycle phases) without building a hello world server and client and testing things for myself just to remind myself how everything works in real life.
All I am going to intone in this is that I do not consider using streaming lightly, especially with load-bearing interception.
Step: Use Client-Side Streaming (Node 19)
This is essentially a trimmed down version of Node 14, where instead of having the server provide periodic updates to the client about the status, only a terminal result is shared with the client once it all finishes.
| |
A more realistic case of this is clients pushing telemetry to a central collection agent:
| |
We could also imagine this situation reversed whereby the collection service polls this data from the clients instead, but that’s not the topic of this section.
Question: Does the Client Receive a Lot of Data (Node 8)
This question is effectively a variation of Node 6, except that instead of the client sending a lot of data to the server it receives a lot of data from the server (e.g., instead of an upload service, we have a download service). I’d apply a similar mental rubric here.
Question: Does the Client Need to Periodically Communicate with Server (Node 9)
This question is a variation of Node 7, except that the client needs to inform the server of something periodically. We could imagine a scenario in which a client receives units of work from the server that it performs and then tells the server it is ready for the next unit of work once the initial ones it completes are done.
This would not be too different in principle from how SETI@home worked way back in the day. If you don’t remember, SETI@home was an early public distributed computing activity, wherein clients would download signal data from the research project, perform expensive analysis on the data, and then send results back to the server.
A flow might look like this:
- Client advertises itself with SETI@home.
- SETI@home assigns a unit of work to the client.
- Client downloads the work.
- Client processes the work over the next minutes to hours (depending on client’s processing power).
- Client uploads work.
- Client receives new unit of work, and the process continues until the heat death of the universe.
This could be modeled with either streaming or unary operations, to be sure, but we’re imagining a long-term session being opened between client and server. We could also imagine the SETI@home story including flow control information to the server: Hey, I’m not ready yet for the next chunk of data; hold off!
Question: Can the Client Easily Use Pagination (Node 16)
Pagination’s considerations are complementary to Node 15:
Perhaps the durability and storage requirements exist on the server instead of the client.
But nothing is to be said that such requirements can’t also exist on the client, either. Fetching large result ranges can constitute significant resource usage on the client as it unmarshals data and operates on it.
Perhaps the SLO of data processing is delivered through expeditious processing by the client?
Perhaps also the ergonomics of pagination aren’t all that nice for clients to think about: page sizes, pagination token, client-side lifecycle management with loops, etc. To be clear, I don’t think this reason is super compelling in and of itself, but it’s worth thinking about.
There are other cases where the server-side data fundamentally is not paginateable barring the server itself undertaking effort to make it so (e.g., real-time signal data that the server buffers).
Step: Use Server-Side Streaming (Node 17)
Server-side streaming complements Node 19. Let’s reformulate it as a download service, instead:
| |
There are plenty of other twists on this, too: streaming real-time sensor readings, etc.
Question: Does the Server Model Something Event-Driven (Node 10)
By this point in the diagram, we are dealing with a system that sends and receives little data for some definition of “little.” We are dealing with something that could be moderately long-running. Something that might fall into that space is an event-driven system from real-time sources that could need relatively expedient processing. A few examples:
A sensor that only sends data when a condition is met versus periodic transmission of observations like an alarm. A fire alarm system is one such case.
A stream of lifecycle change events for a CRUDL resource (e.g., resource Foo created at 2025-03-24 at 20:03, resource Bar modified at 2025-03-24 at 20:04).
Diagnostic telemetry being sent from some agent in a user surface (e.g., Google Analytics).
Question: Would Periodic Polling be Incorrect or Unergonomic for Clients (Node 11)
This question is very tightly coupled with Node 10. When faced with how to pass event-driven data to clients, you need to think of the requirements:
Is there a SLO for processing involved? See Nodes 15 and 16 for a discussion of this. The SLO could be driven by regulatory, product, or safety reasons.
Imagine the case of a fire detection system: you would probably want an interested party to know right away — right?
However, lifecycle change events for CRUDL resources might not have a near-real time notification requirement. They could, to be sure.
Is buffering possible or not? See Node 15.
Are events important, and can they not be stored durably server-side? See Node 15.
It might seem weird to frame these considerations under the rubric of client polling, but they help inform whether the option is even on the table or not.
And then even if the technical requirements are forgiving, how easily can the client managing polling the server for updates? Streaming for both clients and servers entails a bit of coding overhead. Periodic polling has some coding overhead of its own (e.g., determining interval or poll policy, retry policy, etc). These aren’t really hard problems, but perhaps they are enough to just bite the bullet and consider streaming.
- Go (17)
- Grpc (1)
- System-Design (1)