#898 Community Feedback: making Trio a proper subset of YAML

Matthew Giannini Thu 25 Mar 2021

We are making this post to request community feedback on interest in making the Trio format compatible with a subset of YAML so that YAML parsers are able to read trio formatted Haystack data. In order to do this we will require some modifications to the Trio format (some of which is breaking), so what follows is a brief analysis of issues with the existing Trio format and an initial proposal of how to fix those issues. 

If you would like to see this feature, please respond with your feedback to this proposal.

When Trio was first designed, it was intended to be a proper subset of YAML. However, there are a few issues with the current format that make it invalid YAML, and some new issues were introduced when nested data structures were added to Haystack. The following sections detail these issues with potential design solutions.

Strings

Multiline strings behave differently in Trio and YAML. Because of how trailing whitespace is handled in Trio, there is no equivalent "block indentation/chomping indicator" option in YAML to achieve equivalent meaning. The example below shows the difference between Trio and YAML:

---
text:  
  This is a string  
  that spans many lines.

  New Paragraph.

---  
  Trio => "This is a string\nthat spans many lines.\n\nNew Paragraph.\n"  
  YAML => "This is a string that spans many lines.\nNew Paragraph."

The proposed fix would be to generate Trio using the YAML |- block indentation indicator which says to treat the string as a literal string and to "strip" trailing whitespace. We would have to change the behavior of Trio reader to also strip trailing whitespace (which it currently preserves). We could also potentially add support for the |+ block chomping indicator to preserve trailing whitespace (or make this the default instead).

Markers

---
site
---  
  Trio => marker tag  
  YAML => by itself interpreted as a string, 
          when part of a Dict, it is invalid syntax

The proposed fix for markers would be to generate the following Trio. We use the unicode checkmark (\u2713) to denote a marker. It must be unquoted to indicate a marker. If quoted, then it is a string with the checkmark. This way Trio readers can still distinguish between a marker and a string with the marker.

---
site: ✓    # marker tag
dis: "✓"   # string

URI and Ref

Both of these encodings in Trio are invalid YAML because they begin with reserved characters in the YAML spec (` and @). There is not a very satisfying solution to this problem, but we propose one of the following solutions. Any solution is essentially a special prefix to denote a URI or Ref, so it is a matter of picking one (or coming up with something better)

# double-quoted string. Trio reader would check if string 
# begins and ends with '"`' and if so treat# it as a 
# URI instead of a string
uri: "`https://project-haystack.org`"

# using '___`' (3 underscores) prefix to denote a URI since it 
# is an unlikely prefix for a string
uri: ___`https://project-haystack.org`

# only use the '___@' prefix for Refs
ref:  ___@abc-123 "Display Name"

List

Trio lists are currently parsable as flow-style YAML sequences. However, the problem is that items in the List are written with their Zinc encoding instead of as Trio. To get around this and maintain backwards compatibility, we'd have to support reading/writing lists using YAML block-style sequences:

["one", "two", "three"] =>

# Trio
- "one"
- "two"
- "three"

See also the discussion below on Dicts and Nested Collections.

Dict

Trio today writes with a flow-style-like YAML map, however, it is not syntactically valid YAML because (this is particularly true when nested)

  1. markers are not key/value - but this can be addressed with the proposed change to markers  
    • similar for URI and Ref
  2. key/value pairs are not separated by commas (which is required by YAML)
  3. Values are encoded as Zinc instead of Trio/Yaml
  4. Values are not separated by a space from their key: (required by YAML)

In order to maintain backwards compatibility, Trio reader/writer would need to use block-style map:

dict := 
Etc.makeDict([  
  "a": "A",  
  "number": Number(3),  
  "list": [true, false, "a", "b"],  
  "site": Marker.val,  
  "bool": true]) =>

# TRIO
a: "A"
number: 3
list:  
  - true  
  - false  
  - "a"  
  - "b"
site: ✓
bool: true

Grid

We will not support encoding Grids to YAML parseable Trio

Nested Collections

We need to make all collections encode/decode recursively as Trio

# Lists with Nested Collections
[ [1,2,3], Etc.makeDict([“a”: “A”, “number”: Number(3)]) ] =>

# Trio
-
  - 1  
  - 2  
  - 3
-  
  a: "A"  
  number: 3

# Dict with Nested Collections
Etc.makeDict([“list”: [1, 2, 3],“dict”: Etc.makeDict([“dis”: “Foo”, “on”: true])]) =>

# Trio
list:   
  - 1  
  - 2  
  - 3
dict:  
  dis: “Foo”  
  on: true

Gareth David Johnson Sat 27 Mar 2021

The new Haystack encoding format (a.k.a Hayson) encodes to YAML very well.

Most of the world's IDEs support the auto-complete and syntax checking for Hayson when you rename a file to fileNameGoesHere.hayson.yaml.

Why not just use this? I think there are bigger fish to fry rather than inventing yet another YAML/JSON format.

Richard McElhinney Sun 28 Mar 2021

I think I tend to agree with Gareth on this one. I'm not sure what problem is being solved here?

Matt - can you articulate the need for another "official" encoding format? We already have Zinc, Hayson, Trio, and CSV. Do we really need another one?

Gareth - I'm curious to know what your "bigger fish" are that need frying??? Do you have a shopping list of things you would like to see? Or perhaps start another thread for that rather than hijackingg this one.

Matthew Giannini Mon 29 Mar 2021

This isn't about adding a new format per-se, it's about making an existing format (Trio) conform to YAML so that Trio can be parsed as YAML. Also, I'm not suggesting we absolutely need this, this thread is more to explore if people perceive a need for this or would like it.

However, since YAML is a superset of JSON, I do like the idea of just using the Haystack 4 JSON format as the way to get YAML.

Gareth David Johnson Wed 31 Mar 2021

We have to remember that with Haystack 4, the standard is a lot more complex to consume than it use to be. It order for it to be a real success that's adopted by everyone everywhere, it's critical it stays focused on its core strengths - (tagging, defs). It needs to loose the parts that don't add any value but increase complexity unnecessarily and confuse newly interested parties who wonder why they exist - (I'm looking at you Zinc and Trio).

What I'm trying to say is I don't think we need Zinc or Trio at all. BOOM! (Gareth drops mic and walks off...)

I will start another thread of discussion as this should be aimed toward's Matthew's requested feedback.

Brian Frank Wed 31 Mar 2021

As Matthew said, the goal was not to create an alternate format, but only to evaluate making Trio easier to parse at the syntactic level.

With regards to "just use the JSON format", its hard to see how that would be practical. For the most part Zinc and JSON fill the same role - a standard way to exchange Haystack data with 100% full fidelity between software applications. They are both human readable, but really designed for machines to generate and consume.

Trio's role on the other hand is for data which is hand edited - its a source language used by humans. As such JSON is a pretty awful human language as strict JSON disallows comments, trailing commas, etc. And even if use the Hayson format in YAML guise, its still incredibly verbose for humans to read/write.

Consider the coilingCode def just discussed in the previous post:

// Trio
def: ^coolingCoil
is: ^coil
cools: ^air
coolingProcess: ^coolingProcessType
childrenFlatten: [^ductDeckType, ^ductSectionType]
doc: 
  Coil used to cool air.
children: [
   {stage:1 cool run cmd point},
   {chilled water cool valve cmd point},
]  

// Hayson flavored YAML
def: 
  _kind: symbol 
  val: coolingCoil
is: 
  _kind: symbol 
  val: coil
cools: 
  _kind: symbol 
  val: air
coolingProcess: 
  _kind: symbol 
  val: coolingProcessType
childrenFlatten: 
  - _kind: symbol  
    val: ductDeckType
  - _kind: symbol 
    val: ductSectionType
doc: |+
  Coil used to cool air. 
children: 
  - stage: 1
    cool: 
      _kind: marker
    run: 
      _kind: marker
    cmd: 
      _kind: marker
    point: 
      _kind: marker
  - chilled:
     _kind: marker
    water: 
     _kind: marker
    cool 
     _kind: marker
    valve 
     _kind: marker
    cmd 
     _kind: marker
    point
     _kind: marker

Trio has become a lot more important lately because its the source of all the defs which is why this topic came up. Its really just a trade-off between the most optimal source format to use vs trying to make it easier to use off-the-shelf tools. Although in the case of the def sources, if you really want to work with them directly, then parsing the source files is just a fraction of the code required for a full on def compiler. I would expect most everybody would work with the normalized def files which we provide in all the formats (zinc, json, csv, turtle, etc).

Plus we should enhance some of the existing software stacks to provide easy command lines to convert from any format to any other format (that has been on my todo list for a while now).

Gareth David Johnson Thu 1 Apr 2021

YAML has a similar purpose to Trio with some advantages. Firstly though let's tweak some of the above code to make it more palatable as I don't think what was presented previously was a fair comparison...

// Trio
def: ^coolingCoil
is: ^coil
cools: ^air
coolingProcess: ^coolingProcessType
childrenFlatten: [^ductDeckType, ^ductSectionType]
doc: 
  Coil used to cool air.
children: [
   {stage:1 cool run cmd point},
   {chilled water cool valve cmd point},
]

// Hayson YAML using inline objects, anchors and aliases
def: { _kind: symbol, val: coolingCoil }
is: { _kind: symbol, val: coil }
cools: {_kind: symbol, val: air }
coolingProcess: {_kind: symbol, val: coolingProcessType }
childrenFlatten: 
  - { _kind: symbol, val: ductDeckType }
  - { _kind: symbol, val: ductSectionType }
doc: Coil used to cool air. 
children: 
  - { stage: 1, cool: &m { _kind: marker }, run: *m, cmd: *m, point: *m }
  - { chilled: *m, water: *m, cool: *m, valve: *m, cmd: *m, point: *m }

YAML has some distinct advantages over Trio. This becomes even more pronounced when working with larger documents...

  • YAML linters and code auto-formatters help with code inconsistencies and general maintenance.
  • Hayson JSON schema provides error checking and autocomplete. Even granular values such as dates are checked and highlighted if not formatted correctly. Without this you need to compile your Trio code each time to detect errors. It's a lot nicer to have them highlighted in the IDE as you type!
  • YAML provides different ways of handling multiline strings in regards to new lines.
  • YAML collections (arrays and objects) can be split into multiple lines or inline as demonstrated above. In a lot of cases, it's preferable to split something like a large array into multiple lines.
  • YAML anchors and aliases provide code reuse between sections of the document. In the above example I'm showing how one can define a marker tag just once and then reuse it as a symbol across a document. This idea can be extended to handle entire object blocks that can be reused in a document.

There's a lot more to YAML than I've mentioned here...

https://learnxinyminutes.com/docs/yaml/

One area of interest would be to see how custom YAML tags could be utilized with haystack types to make the syntax even less verbose and easier to work with. It would be great if we could do something like this (I really don't know if this would work - it needs more research)...

// YAML with possible custom defined tags...
def: !sym coolingCoil
is: !sym coil
cools: !sym air
coolingProcess: !sym coolingProcessType
childrenFlatten: [!sym ductDeckType, !sym ductSectionType]
doc: Coil used to cool air.
children: 
  - { stage: 1, cool: !m, run: !m, cmd: !m, point: !m},
  - { chilled: !m, water: !m, cool: !m, valve: !m, cmd: !m, point: !m, },

The Python community has already done something like this...

https://pypi.org/project/PyYAML/

When comparing the above two examples, I admit the Trio document does look a little bit nicer. IMO it's not a big enough difference to warrant the disadvantages of having custom data formats in the haystack standard.

Jay Herron Thu 1 Apr 2021

I think that the Trio improvements Matthew suggests would improve the usability of the trio format and aren't very invasive, so I'm in support.

As for the "trio VS hayson YAML" discussion, it's been super interesting to me as someone who isn't intimately familiar with YAML. However, it also seems like a false choice. Having trio as a YAML subset sounds useful, and the fact that Hayson translates to YAML also sounds useful. I'm not sure why we would have to choose one or the other.

Correct me if I'm wrong Gareth, but wouldn't making Trio a YAML subset give it some of the advantages that you list above?

Steve Eynon Thu 1 Apr 2021

YAML is indeed a powerful and comprehensive documentation format. But what else would you expect from an 84 page specification!

Gareth, when you list all the cool macro, multi-line, and alias'n'anchor features; while all this is great for the experienced author, I think you forget that not all languages have a fully featured YAML parser.

And those parsers that do exist, with the YAML spec being as extensive / expressive as it is, tend to be either incomplete, or pretty heavyweight.

So while SkyFoundry may agree that full on YAML would be the way to go, I doubt they're interested in developing the accompanying full-on Fantom YAML parser! Hence the question, would updating Trio to conform to a subset of features be acceptable to users?

Myself, I think the current Trio format is pretty, simple, and succinct. The datatypes follow Haystack standards, and it's easily understandable.

To change it (and incur backwards incompatibilities) may make files easier to parse in other languages (that already have a YAML parser), and perhaps have better syntax highlighting support - but myself... I use Fantom and my own .trio files rarely get that complicated.

So I'm leaning towards, if it ain't broke, don't fix it, but I'm struggling to argue strongly in either direction.


Disclaimer: I've been contemplating writing a YAML parser in Fantom for the longest time - so the idea of someone writing it for me really piqued my interest!

Andy Frank Thu 1 Apr 2021

Not sure I have a horse in this race...

And maybe we should break this out into another thread - but as a data point from someone who just wrote a Haystack implementation...

Having to write a Zinc parser/writer was a big pain (the authentication was too but that's another discussion). And the whole time I kept feeling it should be unnecessary...

There is a lot all tied up in "Haystack" - but the most valuable thing is of course the model. So does having these tech decisions woven in cause unnecessary confusion/hinderances to adoption/participation?

Gareth David Johnson Sat 3 Apr 2021

I won’t comment on my aforementioned discussion in this thread anymore. It should be moved to another thread.

In regards to the original discussion, I agree with Steve. For Trio, if it ain’t broke then don’t fix it. Any new version of Trio means any Haystack implementations also need to support the new and old versions.

Login or Signup to reply.