#59 Haystack Str escape characters for Zinc encoding

Matthew Giannini Fri 5 Oct 2012

According to the BNF for Zinc, it looks like a Str and Uri require escaping $ and `, but from a quick look at the java reference implementation, it does not appear that this is enforced/checked. The test suite does not appear to reflect this requirement either.

Can you clarify the requirements for character escaping in Str and Uri?

Thanks, matthew

Brian Frank Sat 6 Oct 2012

I copied that from Fantom. I think we should do the same thing in Java implementation too, so I will take a look. Thanks for pointing that out.

Brian Frank Mon 15 Oct 2012

Actually what I was doing in the Java code was just disallowing any character that required escaping in HUri. But that causes some round trip problems for values you might read from a server. This is what I propose as the new BNF for Str/Uri:

<str>         := """ <strChar>* """
<uri>         := "`" <uriChar>* "`"
<strChar>     := <unicodeChar> | <strEscChar>
<uriChar>     := <unicodeChar> | <uriEscChar>
<unicodeChar> := any 16-bit Unicode char >= 0x20 (except str/uri quote)
<strEscChar>  := "\b" | "\f" | "\n" | "\r" | "\r" | "\t" | "\"" | "\\" | "\$" | <uEscChar>
<uriEscChar>  := "\:" | "\/" | "\?" | "\#" | "\[" | "\]" | "\@" | "\\" | "\&" | "\=" | "\;" | <uEscChar>
<uEscChar>    := "\u" <hexDigit> <hexDigit> <hexDigit> <hexDigit>

The Str escapes are pretty much the standard C-lang escapes plus "$". The URI escapes are the special characters that have meaning in the URI structure that you might want to use without their special behavior. For example a file name "File#2.txt" is something you run across, but you don't want the "#" to be interpreted as the fragment identifier. We allow \uxxxx escapes in both Str or Uri, but also just any unicode char above 0x20.

Matthew Giannini Mon 15 Oct 2012

Why not use RFC 2396 octal escaping for URIs (similar to java.net.URI)?

file#1.txt would encode to file%231.txt

Using the suggested escaping means that decoded haystack URIs might not be usable without another level of encoding. For example,

http://www.haystack.com/file#1.txt

would need to be encoded in haystack as

http://www.haystack.com/file\#1.txt.

Decoding would yield the same result. But that representation is not actually resolvable if you paste it into a web browser. You'd have to do another round of encoding to change all haystack uri escape sequences into octal.

http://www.haystack.com/file%231.txt

So why not just use octal in the first place?

Brian Frank Mon 15 Oct 2012

I debated that too originally. But the reason I personally prefer backslash escapes:

  • common chars in filenames like space don't need to be escaped
  • unicode chars (also commonly found in file systems) don't need to be escaped (which I've found to especially buggy in % encoding libraries)
  • seems better to be consistent with String and C-like languages that already use backslash escapes

Login or Signup to reply.