Unknown's avatar

Filtering Protocol Buffers files; PowerShell’s Unicode tendency

I am using Google’s Protocol Buffers to serialize and deserialize some data structures. Protocol Buffers uses field numbering, so your definitions look like this:

message Foo
{
	required int32 bar = 1;
	optional BazStruct baz = 2;
	repeated string words = 3;
	...
}

The numbers are good because they make your classes extensible – you can add new fields without breaking code that expects the old version. But they are annoying while you write the first version. Every time you delete a field or move fields around, you have to redo the numbering.

I wrote a Python script that adds numbers to a numberless .proto file. Now I can leave the numbers out and use the script to insert them when I’m ready to compile:

# simple text processor that adds numbers to the fields
# in Protocol Buffers message struct definitions.

from sys import *

inbraces = False
c = stdin.read(1)

while len(c) is not 0:
	if '{' == c:
		inbraces = True
		count = 1
	if '}' == c:
		inbraces = False

	if ';' == c and inbraces:
		out = ' = ' + str(count)
		stdout.write(out)
		count += 1
	
	stdout.write(c)
	c = stdin.read(1)

Since I’m developing for Windows, I used PowerShell to redirect the output of my Python script into a file:

cat data_nonumbers.proto | python number_proto.py > data.proto

But I encountered a nasty surprise when I ran the Protocol Buffers compiler on data.proto:

data.proto:1:4: Invalid control characters encountered in text.
data.proto:1:6: Invalid control characters encountered in text.
data.proto:1:8: Invalid control characters encountered in text.
...

There were hundreds of lines with this error message.

I was immediately suspicious of text encoding issues, so I opened data.proto in Notepad and re-saved with Ansi encoding. Sure enough, the problem disappeared.

I tried changing my Python code to ensure that it was outputting ASCII, but this didn’t help. It turns out that the culprit was PowerShell. PowerShell uses Unicode for all its strings internally, and it was writing out a Unicode file. The fix was a simple modification to my PowerShell command:

cat data_nonumbers.proto | python number_proto.py | out-file -encoding ascii data.proto