Protocol Buffers are Google’s implementation of an improved data serialization format. Improved, that is, in comparison to JSON — which is after all what the world is mostly running on at this time. After reading various hype posts about how cool and efficient the format is, I was curious to find out how much of it is true.
In case you’re not interested in the details, and because I started my little test with a pretty good idea of what the results would be, let me summarize my findings up front:
Protocol Buffers are more efficient than JSON
This is true for exactly one reason: typing. .proto
files define the schema of data in the buffers, and having this information makes data transfers more efficient because there’s no need for (long) field names, and because certain data types can be encoded to save volume compared to their string representations. The gains you can expect are exactly what you would expect, if you do the maths correctly.
That’s what I thought
That was in fact exactly what I thought, and I’m not even saying it was a secret. The reason I was looking into this was that some articles out there seemed to describe enormous gains through Protocol Buffers, and I had a hard time believing them. On the other hand, these articles never detailed the exact tests used, or did anything to put their findings in an objective context.
My test
The test I used is actually not my own, I found it here in an interesting article written by Nils Larsgård. He wrote a little Go program to find out what advantages Protocol Buffers had over JSON, also taking gzipped content into account (which is something that’s often ignored elsewhere). His findings were that for gzipped content, Protocol Buffers held an advantage of 17% over JSON for his “200 tickers” test case.
I was suspicious that these results were still biased. Why? Well, mainly due to the structure of the data. Nils was using this as his main data item:
message Ticker {
string name =1;
float value =2;
}
In the test code, he assigned a random string of three characters length to the name
field. Looking at a snippet of the generated JSON, it was obvious to me that these three characters were only a very small part of the data compared to an average of nine characters making up the random floating point number in the value
field.
So I did two basic tests. One, I removed the random numbers and used static ones instead. Of course this was just a test to help me understand what was going on! The effect was that gzip got rid of all the duplicated numbers, and the “gzipped proto” advantage shrunk to 0.7%.
Based on that, finding number 1: clearly, JSON is not an efficient format to encode numeric values.
Second, I brought back the random numbers but also extended the random string generated for the name
field. I ran several tests, using strings of lengths 20-100, and the “gzipped proto” advantages were somewhere between 2% and 5% for the “200 tickers” test case.
Finding number 2: Protocol Buffers don’t do anything magic with string content.
What does this mean?
There are several points I’m taking away from this:
- Compressing your JSON is important if you want to compete on data volume. It’s easily done — without it, JSON is obviously inefficient just because it keeps repeating field names endlessly. All further points in this list assume you have activated compression!
- Real-world gains of Protocol Buffers vs JSON content are possibly as high as 15%, but everything more than that is hype.
- Achieving any gains > 8% or so requires the “right” data structure, i.e. one that is heavy on type-encodable information such as numbers or dates. (8% was the gain in my test when I made the string field the same 9 characters long as the random number field.)
- Typing is the reason for all of these gains I looked at. I have not checked out the support Protocol Buffers have for dynamic data (I believe?), but without type information the gains I found would not be possible. If you don’t like the idea of typing your messages, Protocol Buffers and gRPC are not for you.
- As you can see in Nils’ original test, for small payloads Protocol Buffer gains are higher. Of course the actual payload sizes are also small in these cases, so any real gain for your application system will only become relevant with a large number of messages. When payload sizes grow (even if we’re still just talking 20KB or so), Protocol Buffer gains are much smaller and less relevant.
Additional thoughts
I personally find it strange what great value is attached by many (people I’ve had conversations with) to the size benefits of Protocol Buffers. I’ve personally worked with remote call technologies for more than twenty years and I’ve seen data transfer formats come and go. For years now, I’ve seen general broad acceptance of the idea that readable text-based formats are preferable over encoded binary ones, even if they come with a volume penalty. For a long time now, binary formats (which have always been around, of course!) have been regarded a special solution for special requirements.
To me, Protocol Buffers are a solution I’ll consider when faced with the right problem. Nothing wrong with using them if your environment supports that decision, but I don’t see it coming that they will replace JSON and other text-based formats — for most application systems there isn’t any such need either. Maybe I’ll be wrong, who knows!