Working Group

#792 Haystack JSON Encoding

Gareth David Johnson Fri 28 Feb

This is a proposal for an additional JSON encoding format to the Haystack standard...

Haystack data has its own type system. It has these granular types…

  • String
  • Number (with units)
  • Bool
  • Uri
  • Ref
  • Def
  • Date
  • Time
  • Date time
  • Coord
  • XStr

It also includes these collection types…

  • List: an array
  • Dict: a map
  • Grid: a table

The default encoding used with these values is Zinc.

Here’s an example of a Zinc grid…

ver:"3.0" projName:"test"
dis dis:"Equip Name",equip,siteRef,installed
"RTU-1",M,@153c-699a "HQ",2005-06-01
"RTU-2",M,@153c-699a "HQ",1999-07-12

Zinc can also be encoded to a JSON format. For example…

{
  "meta": {"ver":"3.0", "projName":"test"},
  "cols":[
    {"name":"dis", "dis":"Equip Name"},
    {"name":"equip"},
    {"name":"siteRef"},
    {"name":"installed"}
  ],
  "rows":[
    {"dis":"RTU-1", "equip":"m:", "siteRef":"r:153c-699a HQ", "installed":"d:2005-06-01"},
    {"dis":"RTU-2", "equip":"m:", "siteRef":"r:153c-699a HQ", "installed":"d:999-07-12"}
  ]
}

Problems with Zinc

Web Developers are used to working with JSON. Both servers (Node) and browsers include native JSON parsing libraries. In order to to work with Zinc encoded data using the standard encoding, the browser’s JavaScript engine has to parse a lot of very large strings. Tests show that parsing JSON is faster than parsing large Zinc encoding strings in these environments.

Even the JSON version of Zinc still has a lot string parsing involved for the granular types. For example, in the above table a site ref is ‘r:153c-699a HA'. This string still has to be parsed to get the actual reference value. Therefore if we use standard or JSON encoding for Zinc, a client still has a lot of work involved in parsing strings.

As well as performance issues, there’s also a data accessibility issue. Since no primitive JSON values are being used, a consumer of any REST APIs has to have some form of client library. This prevents customers using a lot of the standard tools and libraries out there for working with this style of JSON. One example is using JSON schema to validate scalar values automatically.

Zinc always requires a sophisticated client library to parse and work with Haystack data. Not requiring a client library (or a far lighter one) would make it easier for Developers to work with Haystack data.

Hayson

A new encoding format for Haystack related data is required that’s JSON based. It’s nicked named Hayson but it should be noted this is just a JSON encoding schema.

The goal of this format is…

  • All types shouldn’t require additional parsing by a client beyond JSON.parse(…) where possible.
  • It should be simple for a developer or system integrator to read and work with.
  • Hayson should look just like standard JSON wherever it possibly can.
  • High fidelity: there is no loss of data in the encoding. For instance…
    • A Hayson dict can never be confused by a client with a Hayson grid.
    • A Haystack Ref should never be accidentally confused as a string by the string starting with @.
  • All JSON data is valid Haystack data.
    • A string is a haystack string, a boolean is a haystack boolean, an object a dict, an array a list etc.

Format

The encoding capitalizes on the extremely fast native JSON parsers all modern web browsers have.

Notes

Kind

Why use _kind below? An underscore is invalid as a tag name therefore making it a valid symbol to be used. The text for kind is directly mapped back to the Kind enumeration used in SkySpark.

A dict is the only object that doesn’t specify a kind. Therefore all JSON objects without a kind are dicts. This is very useful when a grid has to list its rows as dicts.

Value

The ‘val’ is the heart of the encoded value. It never requires further parsing to read the value.

Haystack Types

Each Haystack Type encoded as Hayson...

String

A JSON string.

"a string"

Number

A JSON number. If the number has units, it’s an object with both val and unit.

The kind can be determined via the use of ‘unit’.

123

// or if the number has 
// units...

{
  "_kind": "Num",
  "val": 123,
  "unit": "m"
}

// Handle infinity...
{
  "_kind": "Num",
  "val": "Inf"
}

Bool

A JSON boolean

true
// or
false

List

A JSON array.

[]

// A list with some values...
[ true, 123, "a string" ]

Dict

A JSON object.

If it’s an object and no kind is specified, it’s assumed to be a dict.

{}

// A dict with some values...
{
  "site": "A site",
  "num": 123,
  "bool": true
}

Grid

A grid object specifies a kind.

A column has a name and an optional meta dict.

Rows and columns are optional if the grid is empty.

Each row is encoded as a dict.

{
  "_kind": "Grid",
  "meta": { "ver":"3.0", "foo": "bar" },
  "cols": [
    {
      "name: "id",
      "meta": { "size": 123 }   
    },
    {
      "name": "dis"
    }   
  ],
  "rows": [
    { "id": 1, "dis": "Hall" },
    { "id": 2, "dis": "Bedroom" }
  ]
}

Marker

A marker object.

{
  "_kind": "Marker"
}

Null

A JSON null.

null

Remove

A remove object.

{
  "_kind": "Remove"
}

NA --

A not available object.

{
  "_kind": "NA"
}

Ref

An object for a reference with optional display name.

Since the dis is optional, always specify the kind.

{
  "_kind": "Ref",
  "val": "/foo",
  "dis": "Links to foo"
}

Date

A date object.

{
  "_kind": "Date",
  "val": "2015-06-08"
}

Time

A time object.

{
  "_kind": "Time",
  "val": "15:47:41"
}

DateTime

An object with a date, time and timezone value.

The tz parameter is optional. It defaults to ‘GMT’.

The val is a standard ISO 8601 formatted date time.

{
  "_kind": "DateTime"
  "val": "2015-06-08T15:47:41-04:00",
  "tz": "New_York"
}

Uri

A URI object.

{
  "_kind": "Uri",
  "val": "https://j2inn.com"
}

Coord

A co-ordinate object.

{
  "_kind": "Coord",
  "lat": 51.019371,
  "lng": -0.453980
}

XStr

An XStr object.

{
  "_kind": "XStr",
  "type": "Type",
  "val": "value"
}

Example

Here's a simple grid of sites encoded using Hayson...

{
  "_kind": "Grid",
  "meta": { "ver":"3.0" },
  "cols": [ "id", "area", "dis", "geoAddr", "geoCoord", "geoCountry", "geoPostalCode", 
    "geoState", "geoStreet", "hq", "metro", "occupiedEnd", "occupiedStart", 
    "primaryFunction", "regionRef", "site", "store", "storeNum", "tz", "weatherRef",
    "yearBuilt", "mod"
  ],
  "rows": [
    {
      "id": {
        "_kind": "Ref",
        "val": "p:demo:r:25aa2abd-c365ce5b",
        "dis": "Headquarters"
      },
      "area": {
        "_kind": "Number",
        "val": 140797,
        "unit": "ft²"
      },
      "dis": "Headquarters",
      "geoAddr": "600 W Main St, Richmond, VA",
      "geoCity": "Richmond",
      "geoCoord": {
        "_kind": "Coord",
        "lat": 37.545826,
        "lng": -77.449188
      },
      "geoCountry": "US",
      "geoPostalCode": "23220",
      "geoState": "VA",
      "geoStreet": "600 W Main St",
      "hq": {
        "_kind": "Marker"
      },
      "metro": "Richmond",
      "occupiedEnd": {
        "_kind": "Time",
        "val": "18:00:00"
      },
      "occupiedStart": {
        "_kind": "Time",
        "val": "08:00:00"
      },
      "primaryFunction": "Office",
      "regionRef": {
        "_kind": "Ref",
        "val": "p:demo:r:25aa2abd-5c556aba",
        "dis": "Richmond"
      },
      "site": {
        "_kind": "Marker"
      },
      "tz": "New_York",
      "weatherRef": {
        "_kind": "Ref",
        "val": "p:demo:r:25aa2abd-a02bf086",
        "dis": "Richmond, VA"
      },
      "yearBuilt": 1999,
      "mod": {
        "_kind": "DateTime",
        "val": "2020-01-09T18:17:34.232Z",
        "tz": "UTC"
      }
    },
    {
      "id": {
        "_kind": "Ref",
        "val": "p:demo:r:25aa2abd-96516c18",
        "dis": "Short Pump"
      },
      "area": {
        "_kind": "Number",
        "val": 17122,
        "unit": "ft²"
      },
      "dis": "Short Pump",
      "geoAddr": "11282 W Broad St, Richmond, VA",
      "geoCity": "Glen Allen",
      "geoCoord": {
        "_kind": "Coord",
        "lat": 37.650338,
        "lng": -77.606105
      },
      "geoCountry": "US",
      "geoPostalCode": "23060",
      "geoState": "VA",
      "geoStreet": "11282 W Broad St",
      "metro": "Richmond",
      "occupiedEnd": {
        "_kind": "Time",
        "val": "21:00:00"
      },
      "occupiedStart": {
        "_kind": "Time",
        "val": "10:00:00"
      },
      "primaryFunction": "Retail Store",
      "regionRef": {
        "_kind": "Ref",
        "val": "p:demo:r:25aa2abd-5c556aba",
        "dis": "Richmond"
      },
      "site": {
        "_kind": "Marker"
      },
      "store": {
        "_kind": "Marker"
      },
      "storeNum": 3,
      "tz": "New_York",
      "weatherRef": {
        "_kind": "Ref",
        "val": "p:demo:r:25aa2abd-a02bf086",
        "dis": "Richmond, VA"
      },
      "yearBuilt": 1999,
      "mod": {
        "_kind": "DateTime",
        "val": "2020-01-09T18:17:34.323Z",
        "tz": "UTC"
      }
    }
  ]
}

MIME Type

Hayson has its own MIME type that can be used in HTTP requests and responses for content negotiation…

application/vnd.haystack.v1+json

For example, when an HTTP request is made with the When this MIME type is specified, a server should respond with this data content with this content type.

Since Zinc has been around for a while, the standard application/json MIME type should return Zinc JSON encoded data.

Conclusion

Using Hayson instead of Zinc for all server and client communication provides the following benefits…

  • Less of a learning curve for Developers.
  • Uses a globally accepted encoding format.
  • Simple to understand.
  • No additional client libraries required to work with the data.
  • High performance. Tests show it’s between 10 and 5.5 times faster than parsing standard Zinc encoded data in a browser.

Steve Eynon Fri 28 Feb

Hi Gareth, this reads very well as I have my own gripes with Haystack encoded JSON; please forgive me for not knowing who you are but...

  • is this a proposal for a new standard
  • or are you a Project Haystack representative and this is a new standard?

Many thanks,

Steve. (Project Haystack User / a nobody)

Gareth David Johnson Fri 28 Feb

I'm proposing this as an addition to the Haystack standard.

I originally raised this with Brian. He mentioned creating a Working Group so we could input from across the industry.

If ratified, this would be an alternative to the existing JSON Zinc encoding and would be part of the Haystack standard.

Cheers, Gareth (Software Architect for J2/Siemens. Former Niagara Core Architect. Still considers himself a nobody).

Kevin Smith Fri 28 Feb

Good proposal. Can you provide additional data on your findings of the parsing performance tests that you have done?

Richard McElhinney Sat 29 Feb

Gareth..great work and I agree with your sentiment on developer adoption.

As an industry we have a long way to go in terms of breaking down the barriers for developer adoption and new additions to the standard like you are proposing are a large step in that direction.

So thank you for the time and effort you've taken to propose this.

Steve and Gareth...knowing each of you as I do..can I suggest this become a UK working group specialty.

Looking at Google Maps I think a meeting place in the Cotswolds would be a suitable mid-point location for you guys to get together and have a hackathon to come up with a reference implementation! ;)

andreas hennig Sat 29 Feb

YES !!!

when I implemented the ZINC-via-JSON format I rolled my eyes. This is finally a clean and pragmatic JSON.

The justification for the "_" sounds a bit rough. I am not familiar with JSON5, maybe the NAN / +infinity / -infinit problem could have been solved from there. But only if supported well in browsers and libraries.

Andreas

Gareth David Johnson Mon 2 Mar

Thank you everyone for your kind words. Richard you made me laugh out loud this morning.

In regards to Kevin's performance question, I have a TypeScript library that parses Zinc or Hayson. The Zinc decoding uses a classic recursive descent parser design. For each test, I created three different files in both formats. Obviously there's bias in the fact that I wrote the zinc parser.

  • sites.zinc (2 kb), sites.json (5 kb) - Contains the results of a simple site query.
  • defs.zinc (122 kb), defs.json (168 kb) - The defs database.
  • points.zinc (773 kb), points.json (1,298 kb) - Contains the results of a project wide query for all points.

The results are as follows...

*** Profile test read of sites.zinc: 8.57ms ***
*** Profile test read of sites.json: 0.865999ms ***
*** Profile test read of defs.zinc: 30.534799ms ***
*** Profile test read of defs.json: 7.4754ms ***
*** Profile test read of points.zinc: 79.7598ms ***
*** Profile test read of points.json: 58.1584ms ***

I ran the tests using NodeJS version 12.6.1. Chrome has similar results.

I believe the performance differences are down to the native JSON parsing these environments have.

Now you'll notice the Zinc files aren't as big. However if we gzip these files (like most web servers do)...

  • sites.zinc.gz (1 kb), sites.json.gz (1 kb)
  • defs.zinc.gz (32 kb), defs.json.gz (34 kb)
  • points.zinc.gz (64 kb), points.json.gz (87 kb)

When they're gzipped, I would argue there isn't really too much difference.

BTW, I'd appreciate it if people could also join this Working Group to give it a little bit more momentum in the community. Thanks!

Gareth David Johnson Mon 2 Mar

In regards to the underscore for kind. I use an underscore because it can never be the first letter of a tag in Haystack. Therefore a tag can't overwrite it accidentally.

Since a dict is a very commonly used type, it's the only object that doesn't require a _kind field. This has the added advantage of saving space when using it in a grid (which is basically an array of dicts++). It also means that any JSON object is a dict which is very powerful.

Gareth David Johnson Mon 2 Mar

A few other interesting things we can do with Hayson.

  • OpenAPI: we can have an OpenAPI document that can be used to provide support for Haystack based web services. The OpenAPI document provides documentation and data validation (since it can be used as JSON schema). This is exceptionally powerful. I can create this document and open source it accordingly. The tools for this are great.
  • YAML: I don't normally code JSON documents. I use YAML. YAML has everything Trio has (comments, multi-line string support) and more including great tool support. Here's my original example in YAML...
_kind: Grid
meta: {}
cols:
- name: id
- name: area
- name: dis
- name: geoAddr
- name: geoCity
- name: geoCoord
- name: geoCountry
- name: geoPostalCode
- name: geoState
- name: geoStreet
- name: hq
- name: metro
- name: occupiedEnd
- name: occupiedStart
- name: primaryFunction
- name: regionRef
- name: site
- name: store
- name: storeNum
- name: tz
- name: weatherRef
- name: yearBuilt
- name: mod
rows:
- id:
    _kind: Ref
    val: p:demo:r:25aa2abd-c365ce5b
    dis: Headquarters
  area:
    _kind: Number
    val: 140797
    unit: ft²
  dis: Headquarters
  geoAddr: 600 W Main St, Richmond, VA
  geoCity: Richmond
  geoCoord:
    _kind: Coord
    lat: 37.545826
    lng: -77.449188
  geoCountry: US
  geoPostalCode: '23220'
  geoState: VA
  geoStreet: 600 W Main St
  hq:
    _kind: Marker
  metro: Richmond
  occupiedEnd:
    _kind: Time
    val: '18:00:00'
  occupiedStart:
    _kind: Time
    val: '08:00:00'
  primaryFunction: Office
  regionRef:
    _kind: Ref
    val: p:demo:r:25aa2abd-5c556aba
    dis: Richmond
  site:
    _kind: Marker
  tz: New_York
  weatherRef:
    _kind: Ref
    val: p:demo:r:25aa2abd-a02bf086
    dis: Richmond, VA
  yearBuilt: 1999
  mod:
    _kind: DateTime
    val: '2020-01-09T18:17:34.232Z'
    tz: UTC
- id:
    _kind: Ref
    val: p:demo:r:25aa2abd-96516c18
    dis: Short Pump
  area:
    _kind: Number
    val: 17122
    unit: ft²
  dis: Short Pump
  geoAddr: 11282 W Broad St, Richmond, VA
  geoCity: Glen Allen
  geoCoord:
    _kind: Coord
    lat: 37.650338
    lng: -77.606105
  geoCountry: US
  geoPostalCode: '23060'
  geoState: VA
  geoStreet: 11282 W Broad St
  metro: Richmond
  occupiedEnd:
    _kind: Time
    val: '21:00:00'
  occupiedStart:
    _kind: Time
    val: '10:00:00'
  primaryFunction: Retail Store
  regionRef:
    _kind: Ref
    val: p:demo:r:25aa2abd-5c556aba
    dis: Richmond
  site:
    _kind: Marker
  store:
    _kind: Marker
  storeNum: 3
  tz: New_York
  weatherRef:
    _kind: Ref
    val: p:demo:r:25aa2abd-a02bf086
    dis: Richmond, VA
  yearBuilt: 1999
  mod:
    _kind: DateTime
    val: '2020-01-09T18:17:34.323Z'
    tz: UTC

Samuel Toh Tue 3 Mar

Hi, interesting work there. Always good to have options on the table.

Any chance to share with the community about your use-case which leads you into hayson?

Also, in your problem statement why are we comparing zinc with Hayson? I thought we should be looking at comparing the current zinc like JSON format vs Hayson?

I'm not sure how is hayson an extension to haystack's json format for now. Based on your proposal I think they will break existing JSON implementations as the current one requires the data type prefix to be in the data.

I think there are benefits to the existing JSON implementation even though the prefix can be seen as redundant at times. For instance fooBar vs s:foorBar. However, there are advantages here as well. Code can quickly identify what a data is based on its prefix, without blindly treating it as a piece of string. E.g. Dates and geo-coordinates.

I believe there are alternatives to boost the performance of the current JSON implementation. E.g. stripping away the whitespaces. If concern is due to bloated JSON payload, you can also look into HTTP compression options.

Gareth David Johnson Tue 3 Mar

My initial primary use case for Hayson was dealing with Haystack data when building web applications using React and Mobx as well as working with Haystack data in AWS. I've created a TypeScript library that parses Zinc, handles filters etc but even so I think there's a better way of handling this in general that doesn't require a client library as a dependency.

As stated, Hayson is an alternative JSON format. It has its own MIME Type so nothing will be broken.

The Zinc JSON encoding is still has a lot of the disadvantages of Zinc encoding whereby a client library is required to make sense out of the data. There's still a performance hit when parsing it regardless of white space. Integration with other libraries that expect already decoded JSON data is still an issue. Overall I think Hayson makes Haystack data more approachable for new developers outside of the Haystack community and provides better integration with modern tools and web services.

Samuel Toh Wed 4 Mar

Hayson is an alternative JSON format. It has its own MIME Type so nothing will be broken.

Sounds like if this is adopted then may have to maintain another standard. Looking at the bright side, I think you are saying JSON.parse(...) would be good enough for Javascript fans. Have you look into C# and Java? Do they have the equivalent and will just work?

There's still a performance hit.

I think typically for web apps the JSON we deal with won't be too huge in size. Like 100MB because even if we can parse them, they can have issues rendering it in modern browser. So if the JSON size is small then the performance benefit there can be negligible.

In general, I think Hayson is ok if we do not have the current JSON standard yet. One thing I kind of dislike it is the _kind key. Looks like not every {} object comes with it. The ones that do, Hayson seem to have gone from a simple r: prefix to be a Javascript object. Good thing about this kind of practise is that we can store a lot more meta data about the tag, which will be impossible to encode into a piece of string.

Steve Eynon Thu 5 Mar

For the benefit of others on the forum, here is a link to the existing Project Haystack specification for Zinc encoded JSON for comparison to this proposal.

The big take-away I see from Gareth's proposal is that once the JSON has been parsed into an object graph, all the string literals can be used as are. Unlike the current specification where the string values need to be parsed and de-constructed again to make sense of them.

It's also nice that numbers are stored as, well, numbers!

True, the new proposal is a lot more verbose than the current, but the extra meta is mostly constants so this both gzips very nicely and the string values easily intern'ed.

I don't mind the name _kind; it is common in many tech arenas to prefix system names with an underscore. For example, MongoDB do it with their _id primary key and I occasionally follow suit in code. I see the current one char prefix as a little cryptic so welcome the explicitness of a meta value.

The current standard is really only concerned with transporting Grids as an alternative to Zinc and (as far I can tell) it's nearly impossible to spot nested grids. The new syntax makes it easy to spot any data-type at any nested level.

In all to me, it looks like a much cleaner specification than the current. The big question is then, is it worth maintaining two standards? To that end, it would be really useful to know how many people actively use / have implemented the current JSON standard?

Two points I'd like make:

1. Infinite numeric values could be serialised like this:

{
  "_kind" : "Num",
  "val"   : "Inf"
}

It would mean parsers need to inspect the type of val (is it a number or is it a string?) but I think it is more acceptable than introducing a new explicit tag.

To address andreas hennig, because JSON5 is not natively implemented, and this format is largely for computers not humans, I would stick to pure JSON formats.

2. The format of the cols tag looks very superfluous. I understand the need of stating columns before we start adding rows, but could there not be a simple rule that states if an col element is a string, it is taken to be the name?

"cols": [ "id", "dis", "area", "geoAddr", "geoCoord", ... ]

Meta could still be used as usual, or mixed amongst the list:

"cols": [
  "id",
  "dis",
  {
    "name" : "area",
    "meta" : { "size" : 123 }
  },
  "geoAddr", "geoCoord", ... ]

In all I quite like this new Hayson / JZON (JSON Zinc Object Notation?) idea, so to recap my thoughts:

  1. Who uses the current Project Haystack JSON standard?
  2. Infinite numeric values (and other unseen edge cases) still need to be fleshed out
  3. A simple rule would allow Grid cols to be more succinct

Cheers,

Steve.

P.S. Gareth, I note your examples are all Javascript objects not JSON.

Josiah Johnston Thu 5 Mar

I quite like this concept.

Could _kind move from cell values into the column definitions?

Similarly, could units move into the column definitions?

I can't see any technical problems with moving _kind. units could present problems if different rows needed to use different units, which I don't expect to happen often in practice.

Best regards,

Josiah

Jason Briggs Thu 5 Mar

_kind couldn't go in the columns because each row could have a different kind. Also it happens all the time that unit are different per point.

Let's say I hit the read op, and got back all number points. Each point could have a different unit.

Gareth David Johnson Fri 6 Mar

Thanks for the feedback.

Steve...

I agree with your suggestions regarding infinity and columns. In practice, I imagine meta will be defined a lot of the time due to localisation of a column's display name. I also like the idea for handling infinity. I've also made the grid's meta optional if there's nothing in it.

I've also updated the code to all be JSON and not JavaScript. I had that highlighted in my original document but didn't add it to this post.

I've edited my original post to reflect these improvements.

Josiah Johnston Tue 10 Mar

Jason Briggs, What are some examples of different rows having different kinds?

Jason Briggs Tue 10 Mar

A point could be a Boolean or Number as an example. Here is the 1st 10 points I queried, you can see that they have different kinds, and units on each row.

curVal:false
kind:Bool
navName:Cool-2
---
curVal:25.311142478477333%
kind:Number
navName:OutsideDamper
unit:"%"
---
curVal:825.0391835106967kW
kind:Number
navName:kW
unit:kW
---
curVal:69.54355994698683°F
kind:Number
navName:ZoneTemp
unit:°F
---
curVal:12.701860966837174gal
kind:Number
navName:Consumption
unit:gal
---
curVal:63.25752247497029°F
kind:Number
navName:ReturnTemp
unit:°F
---
curVal:13.56478516137203%
kind:Number
navName:Heat
unit:"%"
---
curVal:false
kind:Bool
navName:Occupancy
---
curVal:770.2421831230303kWh
kind:Number
navName:kWh
unit:kWh
---
curVal:64°F
kind:Number
navName:Temp
unit:°F

Josiah Johnston Wed 11 Mar

Thanks Jason, very helpful!

I'd mostly been thinking of "tidy format" with one point per column rather than 1 point per row after looking at an example exported from SkySpark. Good to know that data can have this shape as well.

Normally, I'd work through a few diverse examples to validate a design proposal like Hayson, but I haven't found illustrative Haystack examples.

Verification of this proposal looks good since it's isomorphic to established formats.

-Josiah

Stuart Longland Fri 3 Apr

I had a read through this (thanks to Christian Tremblay for pointing this thread out… I find this forum isn't the most easy to track so I easily can miss things like this).

Definitely a step in the right direction in my opinion. The concept of using objects to represent the more complex types (e.g. "numbers with units"… what I'd call a "quantity") is a big improvement over the "just stuff it in a string" approach that we see with ZINC/JSON (that is, our existing "JSON" grid format).

ZINC I think was different enough to standard CSV that it stymied the idea of using an off-the-shelf CSV parser from the start: the format was "different" enough that a standard CSV parser would trip up. On top if this, you were then still parsing strings in whatever language you were working with. I tried two different ways in Python, originally using parsimonious (which is a PEG parser) and the other using pyparsing, and neither are particularly swift.

ZINC/JSON did improve this, in that it used JSON as its base. So things like objects and arrays were parsed out for you, using code that has literally seen decades of development (the first JSON-compatible parsing code would have existing in Netscape Navigator from the mid 90s), it absolutely makes sense to throw the bulk of the parsing effort at this.

I particularly welcome the idea of using plain JSON types for simple values (e.g. booleans, plain numbers, and strings) with JSON objects used to represent more complex types ("quantities", Refs, etc…).

The use of a different Content-Type is a good one. Legacy nHaystack used its own JSON format, but still described it as application/json like the current standard. Coupled with the requirement to be authenticated (which differed between Haystack implementation -- and still does) you basically had to know up-front what system you were going to talk to. Today, if we see application/vnd.haystack.v1+json in request or response headers, I as a server or client can know _exactly_ what format is being used.

Some have raised the concern about the payload sizes. The good news with this format is that it actually translates across to binary formats like CBOR rather well, so IoT devices that otherwise couldn't handle JSON easily (this seems to be my day job lately) can potentially directly communicate with a Haystack-like API.

One thing about both the JSON proposals is they use a object to represent the rows. This coupled with the cols array, carries a bit of redundant information. The following two JSON objects could actually convey the same information:

Proposed format:

{
  "_kind": "Grid",
  "cols": ["id", "dis", "area"],
  "rows": [
    { "id": 1, "dis": "Hall", "area": 20 },
    { "id": 2, "dis": "Bedroom", "area": 10 },
    { "id": 3, "dis": "Patio" }
  ]
}

Compact format:

{
  "_kind": "Grid",
  "cols": ["id", "dis", "area"],
  "rows": [
    [1, "Hall", 20],
    [2, "Bedroom", 10],
    [3, "Patio", null]
  ]
}

The latter, would make a very compact object in CBOR (64 bytes):

00000000  a3 65 5f 6b 69 6e 64 64  47 72 69 64 64 63 6f 6c  |.e_kinddGriddcol|
00000010  73 83 62 69 64 63 64 69  73 64 61 72 65 61 64 72  |s.bidcdisdareadr|
00000020  6f 77 73 83 83 01 64 48  61 6c 6c 14 83 02 67 42  |ows...dHall...gB|
00000030  65 64 72 6f 6f 6d 0a 83  03 65 50 61 74 69 6f f6  |edroom...ePatio.|
00000040

In CBOR, I'd probably change _kind to maybe K… since tags cannot start with a capital letter, and in that context you _really_ do want things to be as short as possible. I'd almost suggesting doing the same with cols and rows too; capitalise those (and shorten to first letter if using CBOR).

There's sufficient detail in cols to reconstruct which element of a row array belongs with which tag in a dict, and does not reduce readability much more than what using ZINC would. We're a little bit more verbose, but hopefully the overheads would be small.

As I say, I think by far the bigger overhead felt is the processing time parsing and dumping these objects though, and this should make a big difference. I say, definitely a step in the right direction.

Chris Breederveld Wed 8 Apr

I like this format so much, as a new API to be used by a third party I am using this (currently generated by my own code) as the preliminary data format.

As I was looking for a nice example of the discussed logic so far and did not find one; Here is a live example of one of our reports in the new format (don't mind the Dutch values) it does not showcase everything though:

{
  "_kind": "Grid",
  "meta": {},
  "cols": [
    {
      "name": "ts",
      "meta": {}
    },
    {
      "name": "degreeDays",
      "meta": {
        "dis": "Graaddagen koeling",
        "disAxis": "Graaddagen",
        "referenceData": {
          "_kind": "Marker"
        }
      }
    },
    {
      "name": "generation",
      "meta": {
        "dis": "Opwekking",
        "unit": "kWh",
        "disAxis": "Elektriciteitsgebruik"
      }
    },
    {
      "name": "generation_perc",
      "meta": {
        "unit": "%",
        "tooltipOnly": {
          "_kind": "Marker"
        }
      }
    },
    {
      "name": "purchased",
      "meta": {
        "dis": "Inkoop",
        "unit": "kWh",
        "disAxis": "Elektriciteitsgebruik"
      }
    },
    {
      "name": "purchased_perc",
      "meta": {
        "unit": "%",
        "tooltipOnly": {
          "_kind": "Marker"
        }
      }
    },
    {
      "name": "total",
      "meta": {
        "dis": "Totaal",
        "unit": "kWh",
        "disAxis": "Elektriciteitsgebruik",
        "tooltipOnly": {
          "_kind": "Marker"
        }
      }
    },
    {
      "name": "total_perc",
      "meta": {
        "unit": "%",
        "tooltipOnly": {
          "_kind": "Marker"
        }
      }
    },
    {
      "name": "surplus",
      "meta": {
        "dis": "Teruglevering",
        "unit": "kWh",
        "disAxis": "Elektriciteitsgebruik",
        "hideFromTooltip": {
          "_kind": "Marker"
        }
      }
    },
    {
      "name": "surplus_tooltip",
      "meta": {
        "unit": "kWh",
        "tooltipOnly": {
          "_kind": "Marker"
        }
      }
    },
    {
      "name": "surplus_perc",
      "meta": {
        "unit": "%",
        "tooltipOnly": {
          "_kind": "Marker"
        }
      }
    }
  ],
  "rows": [
    [
      {
        "_kind": "Date",
        "val": "2020-02-02"
      },
      {
        "_kind": "Num",
        "val": 0.0,
        "unit": "°daysC"
      },
      0.0,
      {
        "_kind": "Num",
        "val": 0.0,
        "unit": "%"
      },
      {
        "_kind": "Num",
        "val": 12119.498300075531,
        "unit": "kWh"
      },
      {
        "_kind": "Num",
        "val": 100.0,
        "unit": "%"
      },
      {
        "_kind": "Num",
        "val": 12119.498300075531,
        "unit": "kWh"
      },
      {
        "_kind": "Num",
        "val": 100.0,
        "unit": "%"
      },
      null,
      null,
      null
    ],
    [
      {
        "_kind": "Date",
        "val": "2020-02-09"
      },
      {
        "_kind": "Num",
        "val": 0.0,
        "unit": "°daysC"
      },
      0.0,
      {
        "_kind": "Num",
        "val": 0.0,
        "unit": "%"
      },
      {
        "_kind": "Num",
        "val": 12128.10887336731,
        "unit": "kWh"
      },
      {
        "_kind": "Num",
        "val": 100.0,
        "unit": "%"
      },
      {
        "_kind": "Num",
        "val": 12128.10887336731,
        "unit": "kWh"
      },
      {
        "_kind": "Num",
        "val": 100.0,
        "unit": "%"
      },
      null,
      null,
      null
    ],
    [
      {
        "_kind": "Date",
        "val": "2020-03-22"
      },
      {
        "_kind": "Num",
        "val": 0.0,
        "unit": "°daysC"
      },
      0.0,
      {
        "_kind": "Num",
        "val": 0.0,
        "unit": "%"
      },
      {
        "_kind": "Num",
        "val": 10476.8685836792,
        "unit": "kWh"
      },
      {
        "_kind": "Num",
        "val": 100.0,
        "unit": "%"
      },
      {
        "_kind": "Num",
        "val": 10476.8685836792,
        "unit": "kWh"
      },
      {
        "_kind": "Num",
        "val": 100.0,
        "unit": "%"
      },
      null,
      null,
      null
    ]
  ]
}

Gareth David Johnson Wed 22 Apr

Thanks for the feedback. In regards to some of the comments...

I think using Kind instead of _kind is a sensible suggestion. The advantage of the underscore is it stands out as something that's not a tag in a dict.

Indeed the rows are an array of objects instead of an array of arrays. There's good reason for this when it comes to working with these data structures in code...

  • I find working with an array of objects is a easier. One doesn't have to reference the column name or even use a client library to find some data.
  • The data structure isn't flooded with null values when multiple records don't contain a particular column value - something that I see quite a lot.
  • Conceptually I prefer to think of a grid as an array of objects with some extra meta data. The columns are merely a set of all the keys of the row objects.
  • In a grid, each row is a dict. A dict is defined as an object with no kind specified. Hence it works out nicely.

Gareth David Johnson Wed 22 Apr

Chris Breederveld, you don't need to specify null if a column doesn't exist in a row. You can just leave it out :).

Chris Breederveld Thu 23 Apr

I think currently there are two broadly suggested options:

  1. Have col names and values in each row (dict)
  2. Have only values in each row (array)

Your suggestion opts for option 1 (and then I wouldn't need to add the null's), but my example was op option 2 (and then I do need to add the nulls).

Personally I prefer option 2 as it is much less verbose (even though you might have repeating nulls) but it will result in a smaller payload (and not much more work in processing the data), but I think this is something we might want to put up to a vote?

@Everyone: How do you feel about the options presented? Would you prefer option 1 (dict) or 2 (array)? Please provide your arguments :-)

Steve Eynon Thu 23 Apr

I'm all for using _kind over Kind as using _kind should result in far fewer errors and typos.

The Haystack standard is based on camelCase so one would expect HSON to be also, so rules to the contrary are easily overlooked. To the casual reader, seeing a few property names of Kind interspersed with the rest of the data will simply look like typos. And then guaranteed, some good natured newbie will go in and manually groom the data to correct the errors! In code too, every HSON parser / writer will need a comment to say:

// The key 'Kind' needs to start with a capital K
// DO NOT CHANGE! (see spec for details why)

Using _kind on the other hand shows intention . That _leading _underscore does not get typed by accident so readers immediately know it is different by design. Keeping the camelCase complements Project Haystack and gives a natural fit with the rest of the data. Because it looks so different, users will generally be more accepting of it and the reasoning won't have to be explained as often.

Also, as mentioned, a leading underscore for system properties already has precedence in other data formats.

Steve Eynon Thu 23 Apr

On the topic of rows being represented as Objects vs Arrays, I personally prefer objects, for pretty much for all the reasons Gareth gave - which, to reiterate a couple, are:

  • With arrays you need to know the magic index number to reference the geoStreet column, which may change in any given HSON document. With objects, it's just row["geoStreet"] - always.
  • The underlying Grid data structure that HSON represents is essentially an List of Dicts (not an List of Lists) so why not represent it as such?

The other plus for objects is self contained data. Looking at this item in a row...

{
  "totalSuplus" : {
    "_kind": "Num",
    "val": 1228,
    "unit": "kWh"
  }
}

...tells me a lot more than just:

{
  "_kind": "Num",
  "val": 1228,
  "unit": "kWh"
}

And having an array of nulls looks like Zinc to me. Which is fine, but then why not just use Zinc?

(Of course, there is also the option to have both objects AND arrays! The parser could simply detect the type of each row value and parse it accordingly. But that then complicates the parser / writer and bloats the standard. So probably best if we stick to just the one.)

I don't see the payload size of HSON being an important issue. If a few extra 100 bytes here and there is so important, then why would you even consider using JSON? JSON is terribly bloated; I mean, it's a string format to start with, and then it has repeated key names, excessive use of double quotes, numbers written out in base 10 ASCII, etc.... that's why we gzip these suckers!

What JSON does give us though, being a verbose textual format, is understandble and human readable data; so I beleive it is these traits that HSON should be embraced.

Brian Frank Thu 23 Apr

How do you feel about the options presented? Would you prefer option 1 (dict) or 2 (array)? Please provide your arguments :-)

Without a doubt I think it should be option 1 as a dict. You work with rows as Dicts and want an easy way in code to get cells with a name. It would be very awkward to lookup cells with an index.

Three other ideas, I'd to add to discussion...

First, how do people feel about having two different JSON formats? I'm inclined to say that if the majority of people prefer this new JSON format, then we should deprecate the old format and plan on this new format to replace it completely after X years.

Second, interested to hear what people think about using underbar prefix for all Hayson keys to make a clear distinction b/w keys related to the encoding vs keys related to the data:

// current proposal
{"_kind":"Number", "val":123, "unit":"kWh"}
{"_kind":"Date", "val":"2020-04-23"}

// encoding keys all use underbar prefix
{"_kind":"Number", "_val":123, "_unit":"kWh"}
{"_kind":"Date", "_val":"2020-04-23"}

The format is a little more ugly, but its crystal clear what are data keys are vs encoding keys:

{"_kind":"Dict", "val":123, "unit":"kWh"}
{"_kind":"Number", "_val":123, "_unit":"kWh"}

Third, this might be an opportunity to consider if we design the JSON format for streaming. The current Zinc and JSON formats require columns to be pre-computed, which means you have to know all your rows and their tags ahead of time before serialization starts. If we relax that you even have a "cols" list, or allow you to put it at the end, then you can start writing rows lazily in streaming fashion. Of course it puts more burden on the client because they might have to buffer the rows if they want to optimize internal data structures. But for JSON you are almost always going to be using a built-in library to parse it anyways.

Steve Eynon Thu 30 Apr

we should deprecate the old format and plan on this new format to replace it

Yes, I'd be happy to just have the one format - makes life easier!

what people think about using underbar prefix for all Hayson keys

I really liked your thinking on this... but on closer inspection I'm not too keen. As Grids are of Haystack scalar values, most keys in the JSON would then have the underscore prefix:

"_kind" : "Grid",
"_rows" : [
  {
    "weatherRef": {
      "_kind": "Ref",
      "_val" : "p:demo:r:25aa2abd-a02bf086",
      "_dis" : "Richmond, VA"
    },
    "mod": {
      "_kind": "DateTime",
      "_val" : "2020-01-09T18:17:34.323Z",
      "_tz"  : "UTC"
    }
    "geoCoord": {
      "_kind": "Coord",
      "_lat" : 37.650338,
      "_lng" : -77.606105
    },
    ...

When writing HSON (by hand) I'll be thinking things like, "I have dis tag, but is it a Haystack dis tag or a HSON _dis tag...?".

And when parsing I'd then be tempted to check if any keys have an underscore prefix as oppose to just checking for the existence of _kind.

Although my opinion on this is not particularly strong.

if we design the JSON format for streaming

In my experience you either optimise to read streams or you optimise to write streams - it sounds like you're proposing we optimise HSON for writing streams.

I'm not a heavy user of Haystack data so I don't know of any use cases where it'd be beneficial to suddenly append extra columns mid-way through writing a data grid.

Column definitions up front is the expected format, so I'd be inclined to keep it that way unless there's a burning use-case for anything different.

Gareth David Johnson Wed 5 Aug

Today I gave Webcast on the new Haystack JSON encoding standard. Here's a link to the video...

https://drive.google.com/file/d/1_1KloXtNe7I2WZr4Ag_l_HYC-zagXRNA/view?usp=sharing

Here's a link to the presentation...

https://drive.google.com/file/d/1MuYAgRUUbs16Z38rhfw5NZQOks5zda7_/view?usp=sharing

I recommend downloading the presentation and opening it in Powerpoint instead of using the Google Slides viewer.

The feedback from today's presentation is also listed on the last slide of the presentation...

  • Kind to match defs. For instance, ‘Num’ becomes ‘number’ to match https://project-haystack.dev/doc/lib-ph/number.
  • Kind is optional for Dict.
  • JSON objects could technically have non-valid tag names. This should be noted in the docs.
  • ver is required in a grid.
  • Keep objects for cols. Not an array of strings.
  • Use V4 in MIME type for Hayson or application/json. Use V3 for old JSON standard. This will eventually be removed.
  • Post consensus to Working Group for a vote.
  • Gareth to write up Hayson proposal as a markdown document. To be made available in github.

Gareth David Johnson Wed 5 Aug

There is one topic I need to post to the wider community to get some feedback on.

Should Hayson replace the existing Haystack JSON encoding? Or should Haystack support both the older and newer encoding?

Replacing the encoding will mean that HTTP requests made with the MIME type of application/json will return Hayson instead of the old encoding.

If we support both then we could deprecate the old encoding. The old encoding could use this MIME type application/vnd.haystack.v3+json.

The new MIME type would use application/json or application/vnd.haystack.v4+json. This would of course still break some clients.

After today's presentation, Brian, Steve and myself would like to outright replace the JSON encoding to use Hayson. It keeps it simple and removes legacy.

Does everyone agree to outright replace it? Or should we support both?

Chris Breederveld Thu 6 Aug

Hi Gareth,

Personally I do not use the json format, but would like to use the new format.

However I'm usually hesitant to just switch over, breaking existing systems. What do you think about using the application/vnd.haystack.v3+json and application/vnd.haystack.v4+json and gracefully deprecate application/json by using the old format for it, but making it obsolete? Perhaps later (in several months) replacing it with the v4 format.

Gareth David Johnson Tue 1 Sep

Chris, I think that's a very reasonable approach that stops systems from breaking. I agree with a more guarded approach for now. We can always re-evaluate this and do a hard switch at a later time.

Gareth David Johnson Tue 1 Sep

The proposed standard with all feedback incorporated for the first version has been added to this repo...

https://bitbucket.org/finproducts/hayson/src/master/

Login or Signup to reply.