#1125 Proposal: Synthetic Ontology

Mike Melillo Wed 26 Jun

Synthetic Ontology Proposal

Intro & Purpose

This proposal originally started as an effort to model simulation data using Project Haystack, really a follow up to a presentation at last year's Haystack Connect. However, after digging at the ideas for awhile, it became evident that there is a lot of common ground between simulation and machine learning. In an effort to group these ideas under one umbrella, but leave room to distinguish and develop within each camp, the term synthetic was chosen as a sort of parent type. At this point, we would like to put the proposal out for community review and comment.

Thanks to the folks on the Haystack Labs group for helping work through the details on this, and especially thanks to Jan Široký from the Machine Learning working group for helping to iron out the synergies here. I reference his proposal in several places below, but just so it is not missed, it can be found here: https://project-haystack.org/forum/topic/1070#c10. For the sake of simplicity, I have included the defs from his post below to allow them to be viewed as a whole.

Synthetics have been a part of the Haystack practitioner’s toolkit for a long time, but never officially. With the growth in popularity of methods to generate and/or create time-series trend data for use in digital twins and analytical models, like physics-based whole-building sustainability modeling, machine learning, artificial intelligence, and others, the Haystack community would benefit from formalizing an ontology to support the use of these data sets. A key point of this ontology solution proposal is that it is method-agnostic. That is, if practitioners are producing Synthetic contextual time-series data using one toolkit or another, they can use the Synthetic ontology structure to manage the points.

To create consistency within this Synthetic ontology proposal, the synthetic tag is proposed as a pointFunction. Everything beneath that uses is: ^synthetic to extend and specify the method/use-case.

Main Terms

Synthetic: A point that contains time-series trend data (historical and future point values). Time-series points created from physics-based whole-building sustainability modeling and/or time-series points generated from historical sensor/meter data using analytics or statistical regression calculations.

Sim: Sim is an abbreviation for Simulation, which designates a type of synthetic point where the data is created by a physics-based whole-building sustainability model.

ML: ML is an abbreviation for Machine Learning, which designates a type of synthetic point where the data is generated by a machine learning model (random forest, linear regressions, etc.).

Computed: Computed designates a type of synthetic point where the data is generated by simple traditional mathematical calculations based on other input data.


simRef is used to link points (usually sensor-points) to its related Sims. This tag should be applied as a list in the use case for multiple Synthetics for one sensor data point (e.g., predictive data based upon one building performance strategy vs. another).

simScenario is used as a choice to identify Synthetic scenarios for optionality.

For example, physics-based whole-building sustainability modeling can create numerous decarbonization scenarios of energy conservation measures and bundles of measures for consideration. Similarly, machine learning and artificial intelligence can test optional scenarios for ultimate selection. For further reading on machine learning within Project Haystack, see this thread from the Machine Learning working group: https://project-haystack.org/forum/topic/1070.

This proposal introduces three (3) base Sim types, but practitioners may add further ad hoc cases by using is: simScenario for their own custom definitions. These scenarios are detailed in the full list below.

Base Defs

is: ^pointFunction
tagOn: ^point
Doc: Synthetic point which can be Sim, Ml, and Ch mandatory
Note: This implies sim-synthetic-point as a conjunct.

is: ^synthetic
doc: Simulation point

is: ^synthetic
doc: Machine Learning point
Note: Full definitions related to machine learning found in: https://project-haystack.org/forum/topic/1070#c9 

is: ^synthetic
doc: Computed data point

is: ^ref
doc: Refs a synthetic point back to a real value
of: ^point
tagOn: ^sim-point

Model Definitions

is: ^entity
doc: Generic model entity definition. This can be a model that exists wholly within the application, or the proxy of a model from a remote application.

is: ^ref
of: ^model
doc: Some synthetic point referring to the model that generated it.
tagOn: ^synthetic-point

is: ^model
doc: Simulation model for a group of sim-points.

is: ^model
doc: Machine learning model for a group of ml-points.

is: ^model
doc: Computed data model for group of computed-points.

Sim Model Defs

is: ^choice
of: ^simScenario
doc:  Defines the type of simulation scenario
tagOn: ^simModel

is: ^simScenario
doc: Operational physics-based whole-building sustainability model.  Represents the as-designed & as-constructed operational conditions. The operational model is the base calibrated model upon which all other simulations are created.

is: ^simScenario
doc: Interrogation physics-based whole-building sustainability model.  Simulation to interrogate system performance under certain criteria.

is: ^simScenario
doc: Optimum physics-based whole-building sustainability model.  Simulation to define the optimum decarbonization potential of the building.

ML Model Defs(included for reference from Jan's ML group post)

def: ^model
is: ^entity
doc: Generic model entity definition. This can be a model that exists wholly 
within the application, or the proxy of a model from a remote application.

def: ^mlModel
is: ^model
doc: Machine learning model entity representing an overarching container for 
various components, including inputs, outputs, parameters, and metrics. 

List of independent variables, also known as model inputs or features,
associated with a machine learning model.

Dependent variable, also known as the model output or target,
associated, with a machine learning model.
Represents the predicted outcome generated by the model.

Training period description, known as the identification period
or baseline, utilized during the model training process.

Result of model identification, which may appear as a list of
model parameters for simpler models or as a reference to a stored model,
in the form of a file uri. The structure of the dict is user-specific.

Goodness-of-fit metrics provided in the form of a simple dictionary.
For example: {r2:0.7889, cvrmse:58}.

Machine learning variable representing both model inputs and outputs.

Reference to a point associated with a machine learning variable,
known as a machine learning variable point.

Filter used for querying points by tags, providing more flexibility
than mlVarPoint, although it is not mandatory.

doc: Reference to a machine learning variable.

def: ^modelRef
is: ^ref
of: ^model
tagOn: ^synthetic-point
doc: Some synthetic point referring to the model that generated it.

is: ^pointFunction
tagOn: ^point
Doc: Synthetic point which can be Sim, Ml, and Ch

is: ^synthetic
doc: Machine Learning point
doc: Point is a machine learning based prediction or forecast of another point. 

Brian Frank Wed 26 Jun

Thanks Mike, this is great write-up.

I think in our last webcast, we talked about not using ch because computed points are not necessarily just computing historical data, but could and/or be computing a real-time curVal . So I think it might be better to use the tag computed (which is consistent with the term computedModel).

Mike Melillo Wed 26 Jun

Ah, good catch, revised above. Thanks Brian.

annie dehghani Thu 27 Jun

Agreed with Brian, this is a great writeup. Thank you for posting this Mike!

A question came up for us about where these synth points should "live" in the hierarchy. Normally we would put them under the same equipment as the actual sensor point.

Is that what others are doing as well? If that's the standard approach should it be included in the proposal or should it be left to the the modeler's discretion?

Example to illustrate my purpose. Say you have a simulated CO2 sensor on an AHU.

@synthCO2Sensor - point, synthetic, pointRef: @realCO2Sensor 
@realCO2Sensor - point, air, co2, concentration, sensor, equipRef: @ahu
@ahu - ahu, equip 

Should @synthCO2Sensor also have an equipRef to @ahu in this example?

Sherri Simms Thu 27 Jun


Thank you for providing more documentation! This is super helpful!

I commented on Jan's Machine Learning proposal (https://project-haystack.org/forum/topic/1070#c10), but I also want to mention here one more thing I think to consider but is in regards to simulations, so I am posting on this topic instead...

As the IOT industry expands, SIM cards and their information may also one day be incorporated into Haystack and the prefix sim may cause confusion. I know using abbreviations has both pros and cons, and I don't know how much we spend time preventing things moving forward now for unknowns in the future, but I figured it is worth mentioning this. Maybe those who developed the existing haystack tags for the Information and Communication Technology library (https://www.project-haystack.org/doc/lib-phIct/index) would have any input about whether abbreviating simulation with "sim" would impact or not impact anything.

Thanks, ~Sherri SIMms


Mike Melillo Mon 1 Jul

Thanks for the feedback, answering comments in order.

From Annie:

Should @synthCO2Sensor also have an equipRef to @ahu in this example?

The two options that come to mind are:

  • Only synthetic-point records exist, and they can equipRef to real equips
  • synthetic-points can also exist as part of a completely synthetic equip

I think my preference is to put all points with the real equip, but I don't see a reason to exclude one way or another (or others I'm not seeing).

From Sherri:

I think both of your notes on model and sim are worth taking into account. For sim/SIM Card, I wonder how often the term sim is used in isolation to refer to "Subscriber Identification Model" and if it's an option this term would just become simCard in haystack if the need arose? That said, it probably shouldn't be our goal to close doors for folks down the road.

For model, perhaps the generic just becomes syntheticModel which removes any ambiguity to other uses of the model term... after all, in this ontology, syntheticModel is a sort of abstract parent just to get you to simModel or mlModel depending on your application.

Brian Frank Mon 1 Jul

I think sim is a pretty safe prefix to use our domain, so I would say we stick with that.

From a navigation perspective, putting the points under the equipment just like the real points would be simplest. You can imagine a UI where you are can select actual points or points from a specific synthetic model.

However, I agree model might be too generic. Since the key marker tag is called synthetic, then it syntheticModel would make the most sense. But I think if we do that, then the ref tag should be syntheticModelRef too.

Richard McElhinney Tue 2 Jul

Hi All,

thank you to everyone involved for preparing this write up and thank you to Mike for posting. It's fantastic to yet again see the power of community in continuing to evolve Project Haystack and to see domain experts carry on this work.

Just an observation on many of the definitions above.

In a number of places the definitions of terms, tags, etc. refer to seemingly only being relevant to "physics based whole-building sustainability modelling".

In our work we do a lot of modelling using ML to be able to predict future behaviour of complex machines only, not whole of building. It seems that the current definition is a little exclusive. There is a lot of modelling that is not whole-building based and focuses on components, sub-systems, of types of equipment.

So I was wondering if the definitions could be a little broader so as to not be so focused on "whole-building" modelling which we actually don't see much of in our work in the field when doing chiller plant optimisation.

Cheers, Richard

Jon Schoenfeld Wed 3 Jul

Hi All,

Great work above. We are excited to put this into practice.

I'd like to recommend an addition to the simScenario choices. After taking corrective actions in a building or implementing ECMs, the operational model must change to reflect the new operation of the building. The old operational model then becomes the baseline, for lack of a better term, for the quantification of the impact of the work that was completed. As new work is performed, the process repeats, ie new operational models are created and old operational models become "baseline" models.

I believe a different simScenario is needed for this baseline. The simInterrogation is used to simulate future scenarios, ie if I did this, what would the outcome be. simBaseline (or whatever it should be called) is what the performance used to be and is essential for quantifying energy savings or performance improvements.

Thanks, Jon

Mike Melillo Sun 7 Jul

Summarizing a few changes/requests from the above:

  • syntheticModel + ref probably best transition to avoid the generic model tag
  • Some doc language around sim should be generalized to not focus solely on whole-building. Richard if you have recommendations here, I'm all ears.
  • Additional simScenario def for simBaseline if I'm reading Jon's comments correctly

We should be having a Labs WG meeting edit: next week, ideally we can clarify these items + provide resolutions afterward. Following that/barring other comments, I'll look to draft up the actual defs.

Login or Signup to reply.