All Topics

#601 Haystack 3.0 data type queries

Stuart Longland Thu 29 Mar 2018

Hi all,

I have a silly question regarding the collection types. Recently, I added list support to hszinc so that it could parse and generate lists. I have an issue in progress to add support for the other data types as well.

I'm also adding unit tests as I go. Some of my test cases have been taken from node-haystack, others (particularly the list tests), are home-grown.

I haven't as yet, bothered to look into Trio. I'd argue that building on YAML for representing Project Haystack types would be a more productive pursuit, but that's just my opinion. :-)

My intent is to make hszinc a robust, independent, implementation of the Project Haystack grid serialisation format. Anyone who's used recent pyhaystack; hszinc is the low-level serialisation/deserialsation guts.

In light of this goal, and having read the specification though, I'm left with a few small queries.

Lists

Implicit`NULL`s

Regarding lists, one of the things I struggled with was implicit NULLs and trailing commas.

The spec says that the following two lists are equivalent.

[1,2,3]
[1,2,3,]

No problem there. But how about these?

[N,N,N]
[,,]

I'm using pyparsing to implement the grammar parsing, and I found it very difficult to get it to handle that case properly. Previous versions of hszinc used a PEG parser called parsimonious, which really couldn't handle the left-recursive nature of ZINC 3.0 well at all.

I wanted to avoid the DIY parser route, as it makes the code harder for people to understand and maintain.

I think for the benefit of those who are implementing parsers, we should formally declare what the status is on implicit NULLs in lists.

Order of elements

When comparing lists, is the order of elements important? This is particularly thorny when we consider lists in filter strings.

Dicts

These appear to be a common building-block of grids; with dict objects appearing in the meta; metadata for each column, and the rows.

In this context, it's pretty clear that the keys are not just strings, they are actually tag names, with the values being any valid Haystack type.

However, we are now going to have the situation where a dict may appear as the value for a given tag. In this case, what data type is used for the keys? Are they likely to be tag names, or is it possible we'll see a dict that is keyed by numbers, for example?

XStr

If I understand this correctly, this is effectively an enumeration type. The XStr's Type field working sort-of like a "class name" and the string value within the brackets giving the value of that specific tag.

Is there going to be a registry somewhere of standard XStr enumerations for us to validate against? How does a Haystack client know what the valid enumeration values are?

Regards, Stuart Longland

Brian Frank Thu 29 Mar 2018

I haven't as yet, bothered to look into Trio. I'd argue that building on YAML

Trio is YAML. We just using Zinc format for typing instead of YAML "!!" explicit typing

[N,N,N] vs [,,]

You are not allowed to omit values in a list. That syntax is only supported at the grid level (to make it like CSV). So it must be [N,N,N] or [N,N,N,]

When comparing lists, is the order of elements important

List ordering is semantically significant. It is not for Dicts

In this context, it's pretty clear that the keys are not just strings, they are actually tag names, with the values being any valid Haystack type.

The keys of any Dict are strings with a restricted char set (the restriction for tag names). So in the generic sense they are "tags", but not always what we would consider formal Haystack ontology tags. Sometimes they are just arbitrary keys to a map/object.

In this case, what data type is used for the keys? Are they likely to be tag names, or is it possible we'll see a dict that is keyed by numbers, for example?

If I understand this correctly, Dicts are always keyed by a string restricted to the tag syntax (start with lower ASCII letter, and contain only ASCII letters, digits, or underbar).

If I understand this correctly, this is effectively an enumeration type.

Its not necessarily an enumerated type with a restricted range. It could be something like Color("#ef7")

Is there going to be a registry somewhere of standard XStr enumerations for us to validate against?

Maybe eventually, but in practice is really just a tuple of a type name and value string which can be passed around generically. At the parser level I wouldn't make any assumptions.

Stuart Longland Thu 3 May 2018

Trio is YAML. We just using Zinc format for typing instead of YAML "!!" explicit typing.

I do note there are some differences that would probably trip up a YAML parser, notably: the use of // instead of # as a comment indicator; I'll have to experiment a bit to see what implications there are for other variations, but by the looks of things, some pre-processing will be needed before parsers like ruamel.yaml will successfully parse Trio.

Are there any test Trio files I can use for parser testing?

You are not allowed to omit values in a list. That syntax is only supported at the grid level (to make it like CSV).

Ahh okay, so sounds like my approach regarding list parsing is on track then. :-)

One thing to consider with lists, is whether some set-like operators might be helpful in performing queries, specifically; is myRef->someListOfRefs ${SUPERSET_OF} [@ref1, @ref2, @ref3] or anotherList ${CONTAINS} "a value".

That would allow some quite powerful relationships to be represented and queried. I had thought that maybe < and > could be overloaded to mean "superset" and "subset", as "less than"/"greater than" has no real meaning.

== though will have to respect order; it might be handy to have something that says "contains all these elements" without implying "superset"; although you could kludge it by saying (tag ${SUPERSET_OR_EQ} [1,2,3,]) and (tag ${SUBSET_OR_EQ} [1,2,3,]).

I'll leave the above thought for WG 551 to consider.

The keys of any Dict are strings with a restricted char set (the restriction for tag names). So in the generic sense they are "tags", but not always what we would consider formal Haystack ontology tags. Sometimes they are just arbitrary keys to a map/object.

Ahh, so beyond validating that it's a string matching the same rules as tags; I shouldn't validate those further. No problems there. :-)

Its not necessarily an enumerated type with a restricted range. It could be something like Color("#ef7") … At the parser level I wouldn't make any assumptions.

Right, so just expose those as given, and something higher-up can do the validation if needed. The above should help greatly in plugging up the gaps. :-)

Project Haystack