Kinds – Project Haystack

Kinds

OverviewNamesMarkerNARemoveBoolNumberStrUriRefSymbolDateTimeDateTimeCoordXStrListDictGridDefs

Overview

Haystack defines a fixed set of data types we call kinds.

There are three special singleton types:

  • Marker: label for a "type" or "is-a" relationship
  • NA: singleton value that represents not available, missing, or invalid data
  • Remove: singleton value that represents a removal operation

There are eleven scalar atomic types:

  • Bool: boolean "true" or "false"
  • Number: floating point number annotated with an optional unit of measurement
  • Str: string of Unicode characters
  • Uri: Universal Resource Identifier
  • Ref: reference used to identify an entity instance
  • Symbol: name constant used to identify a def
  • Date: an ISO 8601 date as year, month, day: 2011-06-07.
  • Time: an ISO 8601 time as hour, minute, seconds: 09:51:27.354.
  • DateTime: an ISO 8601 timestamp followed by timezone name
  • Coord: geographic coordinate in latitude/longitude
  • XStr: extended typed string

And there are three collection types:

  • List: linear sequence of zero or more items
  • Dict: hashmap of name/value pairs
  • Grid: two dimension table

Each of these kinds is discussed in further detail below.

Names

Names used for dict tags and grid columns are restricted to the following characters:

  • Must start with ASCII lower case letter (a-z)
  • Must contain only ASCII letters, digits, or underbar (a-z, A-Z, 0-9, _)

By convention we use camel case (fooBarBaz) for generating names. Restricting names ensures they may be safely and easily used as identifiers in programming languages and databases.

Marker

Marker is a singleton used to create "label" tags. Markers are used to express typing information. For example the equip tag is used on any dict that represents an equipment asset.

Encodings:

M                     // Zinc cell
name without colon    // Zinc meta, nested dict, Trio
{ "_kind": "marker" } // JSON

NA

NA is a singleton for not available. It fills a similar role as the NA constant in the R language as a place holding for missing or invalid data values. In Haystack it is most often used in historized data to indicate that a timestamp sample is in error.

Encodings:

NA                // Zinc, Trio
{ "_kind": "na" } // JSON

Remove

Remove is a singleton used in dicts to indicate removal of a tag. It is reserved for future HTTP ops that perform entity updates.

Encodings:

R                     // Zinc, Trio
{ "_kind": "remove" } // JSON

Bool

Bool is the truth data type with the two values true and false.

Encodings:

T or F           // Zinc, Trio
true or false    // JSON boolean

Number

Number is an integer or floating point value with an optional unit of measurement. Implementations should represent a number as a 64-bit IEEE 754 floating point and provide 52 bits of lossless integer representation.

All Haystack Numbers may include an optional unit of measurement. This unit must be a symbol defined in the standard unit database.

Encodings:

// Unitless in Zinc, Trio, JSON are just simple numbers
45             // unitless integer
-23.45         // unitless floating point
5.4e-7         // unitless exponent format

// Zinc, Trio append unit using a literal syntax
45°F           // integer with unit
-23.45m²       // floating point with unit
10_000         // underbar is allowed as a separator
5.4E+8kW       // exponent format with unit

// JSON uses an object for numbers with a unit
{ "_kind": "number", "val": 45, "unit": "°F" }
{ "_kind": "number", "val": -23.45, "unit": "m²" }
{ "_kind": "number", "val": 5.4E+8, "unit": "kW" }

The three special Number values:

// Zinc, Trio use a literal syntax
NaN
INF
-INF

// JSON uses object syntax
{ "_kind": "number", "val": "NaN" }
{ "_kind": "number", "val": "INF" }
{ "_kind": "number", "val": "-INF" }

It is invalid for NaN to include a unit. A unit may be included with INF and -INF, however it is discouraged.

Str

Str is a sequence of zero or more Unicode characters. Implementations must fully support at least the Basic Multilingual Plane (plane 0), which covers all the 16-bit code points. All text formats must be encoded using UTF-8 unless explicitly specified otherwise (such as via a charset parameter in an HTTP Content-Type).

Strings are encoded using double quotes and C style backslash escapes:

"haystack"         // Zinc, Trio, JSON
"Line 1\nLine 2"   // Zinc, Trio, JSON with backslash escape newline

Note that Zinc and Trio require the "$" character to be backslash escaped.

Strings are also used for enumerated types. Enumerations define their range via the enum type.

Uri

Uri is the data type used to represent Universal Resource Identifiers according to RFC 3986.

Encodings:

// Zinc, Trio use back tick quotes
`http://project-haystack.org/`

// JSON
{ "_kind": "uri", "val": "http://project-haystack.org/" }

Ref

Refs are the data type for instance data identifiers. All entities are identified via the id tag and a unique ref data value. Relationships cross-reference the entity with ref tags. And, operations such as the read or hisRead ops will identify the entity with its ref id.

Refs must adhere to the following limited set of ASCII characters:

  • ASCII lower case letter a-z
  • ASCII upper case letter A-Z
  • ASCII digit 0-9
  • Underbar "_"
  • Colon ":"
  • Dash "-"
  • Period "."
  • Tilde "~"

Haystack does not prescribe any specific format for refs. Client software must treat refs as opaque identifiers. It is suggested that implementations generate their refs as UUIDs to discourage their use as anything other than an opaque id; the dis tag should be used for human display names.

The scope of uniqueness for a ref is based on the contextual dataset. If working with a flat file of Haystack data, then the ids are guaranteed unique only within that data set (such as RDF blank nodes). When working with the HTTP API, then refs must be unique within the endpoint. It must never be assumed that refs are globally unique.

Refs are encoded using "@" as a prefix and may optionally include the display name of the entity with a quoted string literal:

@foo-bar                  // Zinc, Trio
@foo-bar "Display Name"   // Zinc, Trio with display name

// JSON
{ "_kind": "ref", "val": "foo-bar" }

// JSON with display name
{ "_kind": "ref", "val": "foo-bar", "dis": "Display Name" }

Symbol

Symbols are the data type for def identifiers.

Symbols follow the same naming conventions as refs - only ASCII letters, digits, underbar, colon, dash, period, or tilde. Although only a subset of these punctuation characters are used today. Dashes are used for conjunct symbols and the colon is used for feature key symbols.

Symbols are encoded using "^" as a prefix:

^elec-meter  // Zinc, Trio

// JSON
{ "_kind": "symbol", "val": "elec-meter" }

Date

Date is an ISO 8601 calendar date. It is encoded as YYYY-MM-DD:

2020-07-17   // Zinc, Trio

// JSON
{ "_kind": "date", "val": "2020-07-17" }

Time

Time is an ISO 8601 time of day. It is encoded as hh:mm:ss.sss:

14:30:00    // Zinc, Trio for 2:30pm

// JSON
{ "_kind": "time", "val": "14:30:00" }

DateTime

DateTime is an ISO 8601 timestamp paired with a timezone name. Haystack requires all timestamps to include a timezone. Timezone names are standardized in the timezone database (city name from zoneinfo database). Implementations should support DateTime precision at least down to the millisecond.

The encoding of DateTime is the ISO 8601 representation followed by a space and the timezone name:

2020-07-17T16:55:42.977-04:00 New_York  // Zinc, Trio
2020-07-17T23:30:00Z                    // May omit UTC timezone if offset is Z

// JSON
{  "_kind": "dateTime", "val": "2020-07-17T16:55:42.977-04:00", "tz": "New_York" }

Coord

Coord is a specialized data type to represent a geographic coordinate as a latitude and longitude. Haystack uses a special atomic type for coordinates to optimize historization of geolocation for transportation applications (versus a collection data type such as dict).

Latitude and longitude are represented in decimal degrees. Implementations should support precision down to the micro-degree (6 decimal places) which provides accuracy to ~100mm and can be packed into a 64-bit integer.

Coord is encoded using positive/negative latitude, longitude in decimal degrees:

C(37.5458266,-77.4491888)   // Trio, Zinc

// JSON
{ "_kind": "coord", "lat": 37.548266, "lng": -77.4491888 }

XStr

XStr is a tuple of a "type name" and string encoded value. The type name must follow tag naming rules except it must start with an ASCII upper case letter (A-Z). XStrs provide a mechanism for vendors to round trip specific string encoded atomic values. The type name is not currently standardized by Project Haystack. However it should be assumed that future versions of this specification may standardize a set of XStr type names.

Encodings:

Type("value")   // Zinc, Trio
Color("red")    // Zinc, Trio

// JSON
{ "_kind": "xstr", "type": "Color","val": "red" }

List

List is a collection data type. Lists are ordered sequences and may contain any other valid Haystack data types.

Lists are encoded using square brackets exactly like JSON arrays:

[1, "two", 3]    // Zinc, Trio, JSON
[]               // empty list

Dict

Dict (or dictionary) is the primary collection data type in Haystack. Dicts are an unordered collection of name/value pairs that we call tags. The name keys of a dict are restricted to ASCII letters, digits and underbar as discussed in the names section. The values may be be any other valid Haystack data type.

Dicts are encoded using curly braces similiar to JSON objects:

{x:123, y:456}       // Zinc, Trio (commas optional, trailing comma allowed)
{"x":123, "y":456}   // JSON object

Grid

Grid is a two dimensional tabular data type. Grids are essentially a list of dicts. However, grids may include grid level and column level meta data that is modeled as a dict. Grids are the fundamental unit of data exchange over the HTTP API.

Grids are commonly used to encode a list of dicts into a single table. Consider three dicts that model buildings:

id: @site-a
dis: "Site A"
site
area: 45000ft²

id: @site-b
dis: "Site B"
site

id: @site-c
dis: "Site C"
site
area: 62000ft²
phone: "(804) 555-1234"

Our three entities all have an id, dis, and site tags. In addition, two have the area tag, and one has a phone tag. To combine these three entities into a grid we end up with five columns and three rows:

id       dis       site  area      phone
-------  -------   ----  --------  ------
@site-a  "Site A"  ✓     45000ft²
@site-b  "Site B"  ✓
@site-c  "Site C"  ✓     62000ft²  "(804) 555-1234"

Note the columns are union of all tags shared by the entities. Because not every entity shares the same columns, we have sparse or null cells. We could further add grid level or column level meta.

The data above is encoded into Zinc as follows:

Zinc JSON
ver:"3.0"
id,       dis,       site,  area,      phone
@site-a,  "Site A",  M,     45000ft²,  N
@site-b,  "Site B",  M,     N,         N
@site-c,  "Site C",  M,     62000ft²,  "(804) 555-1234"
  

See Zinc and Json chapters for the details for grid encoding.

Defs

All the data types are formally defined by name with a def:

Subtyping is used to narrow the core kinds into tag definitions. For example all ref tags will subtype from ref.