All Topics

#1125 Proposal: Synthetic Ontology

Mike Melillo Wed 26 Jun 2024

Synthetic Ontology Proposal

Intro & Purpose

This proposal originally started as an effort to model simulation data using Project Haystack, really a follow up to a presentation at last year's Haystack Connect. However, after digging at the ideas for awhile, it became evident that there is a lot of common ground between simulation and machine learning. In an effort to group these ideas under one umbrella, but leave room to distinguish and develop within each camp, the term synthetic was chosen as a sort of parent type. At this point, we would like to put the proposal out for community review and comment.

Thanks to the folks on the Haystack Labs group for helping work through the details on this, and especially thanks to Jan Široký from the Machine Learning working group for helping to iron out the synergies here. I reference his proposal in several places below, but just so it is not missed, it can be found here: https://project-haystack.org/forum/topic/1070#c10. For the sake of simplicity, I have included the defs from his post below to allow them to be viewed as a whole.

Synthetics have been a part of the Haystack practitioner’s toolkit for a long time, but never officially. With the growth in popularity of methods to generate and/or create time-series trend data for use in digital twins and analytical models, like physics-based whole-building sustainability modeling, machine learning, artificial intelligence, and others, the Haystack community would benefit from formalizing an ontology to support the use of these data sets. A key point of this ontology solution proposal is that it is method-agnostic. That is, if practitioners are producing Synthetic contextual time-series data using one toolkit or another, they can use the Synthetic ontology structure to manage the points.

To create consistency within this Synthetic ontology proposal, the synthetic tag is proposed as a pointFunction. Everything beneath that uses is: ^synthetic to extend and specify the method/use-case.

Main Terms

Synthetic: A point that contains time-series trend data (historical and future point values). Time-series points created from physics-based whole-building sustainability modeling and/or time-series points generated from historical sensor/meter data using analytics or statistical regression calculations.

Sim: Sim is an abbreviation for Simulation, which designates a type of synthetic point where the data is created by a physics-based whole-building sustainability model.

ML: ML is an abbreviation for Machine Learning, which designates a type of synthetic point where the data is generated by a machine learning model (random forest, linear regressions, etc.).

Computed: Computed designates a type of synthetic point where the data is generated by simple traditional mathematical calculations based on other input data.

Structure

simRef is used to link points (usually sensor-points) to its related Sims. This tag should be applied as a list in the use case for multiple Synthetics for one sensor data point (e.g., predictive data based upon one building performance strategy vs. another).

simScenario is used as a choice to identify Synthetic scenarios for optionality.

For example, physics-based whole-building sustainability modeling can create numerous decarbonization scenarios of energy conservation measures and bundles of measures for consideration. Similarly, machine learning and artificial intelligence can test optional scenarios for ultimate selection. For further reading on machine learning within Project Haystack, see this thread from the Machine Learning working group: https://project-haystack.org/forum/topic/1070.

This proposal introduces three (3) base Sim types, but practitioners may add further ad hoc cases by using is: simScenario for their own custom definitions. These scenarios are detailed in the full list below.

Base Defs

synthetic
is: ^pointFunction
tagOn: ^point
Doc: Synthetic point which can be Sim, Ml, and Ch mandatory
Note: This implies sim-synthetic-point as a conjunct.

sim
is: ^synthetic
doc: Simulation point

ml
is: ^synthetic
doc: Machine Learning point
Note: Full definitions related to machine learning found in: https://project-haystack.org/forum/topic/1070#c9 

computed
is: ^synthetic
doc: Computed data point

pointRef
is: ^ref
doc: Refs a synthetic point back to a real value
of: ^point
tagOn: ^sim-point

Model Definitions

model
is: ^entity
doc: Generic model entity definition. This can be a model that exists wholly within the application, or the proxy of a model from a remote application.
mandatory

modelRef
is: ^ref
of: ^model
doc: Some synthetic point referring to the model that generated it.
tagOn: ^synthetic-point


simModel
is: ^model
doc: Simulation model for a group of sim-points.

mlModel
is: ^model
doc: Machine learning model for a group of ml-points.

computedModel
is: ^model
doc: Computed data model for group of computed-points.

Sim Model Defs

simScenario
is: ^choice
of: ^simScenario
doc:  Defines the type of simulation scenario
tagOn: ^simModel

simOperational
is: ^simScenario
doc: Operational physics-based whole-building sustainability model.  Represents the as-designed & as-constructed operational conditions. The operational model is the base calibrated model upon which all other simulations are created.

simInterrogation
is: ^simScenario
doc: Interrogation physics-based whole-building sustainability model.  Simulation to interrogate system performance under certain criteria.

simOptimum
is: ^simScenario
doc: Optimum physics-based whole-building sustainability model.  Simulation to define the optimum decarbonization potential of the building.

ML Model Defs(included for reference from Jan's ML group post)

def: ^model
is: ^entity
mandatory
doc: Generic model entity definition. This can be a model that exists wholly 
within the application, or the proxy of a model from a remote application.

def: ^mlModel
is: ^model
doc: Machine learning model entity representing an overarching container for 
various components, including inputs, outputs, parameters, and metrics. 

def:^mlInputVarRefs
is:^list
of:^mlVarRef
tagOn:^mlModel
doc:
List of independent variables, also known as model inputs or features,
associated with a machine learning model.


def:^mlOutputVarRef
is:^ref
of:^mlVar
tagOn:^mlModel
doc:
Dependent variable, also known as the model output or target,
associated, with a machine learning model.
Represents the predicted outcome generated by the model.

def:^mlIdentificationPeriod
is:^span
tagOn:^mlModel
doc:
Training period description, known as the identification period
or baseline, utilized during the model training process.

def:^mlModelParameters
is:^dict
ro
tagOn:^mlModel
doc:
Result of model identification, which may appear as a list of
model parameters for simpler models or as a reference to a stored model,
in the form of a file uri. The structure of the dict is user-specific.

def:^mlModelMetrics
is:^dict
ro
tagOn:^mlModel
doc:
Goodness-of-fit metrics provided in the form of a simple dictionary.
For example: {r2:0.7889, cvrmse:58}.

def:^mlVar
is:^entity
mandatory
doc:
Machine learning variable representing both model inputs and outputs.

def:^mlVarPoint
is:^ref
of:^point
tagOn:^mlVar
doc:
Reference to a point associated with a machine learning variable,
known as a machine learning variable point.

def:^mlVarFilter
is:^filterStr
tagOn:^mlVar
doc:
Filter used for querying points by tags, providing more flexibility
than mlVarPoint, although it is not mandatory.

def:^mlVarRef
is:^ref
of:^mlVar
doc: Reference to a machine learning variable.

def: ^modelRef
is: ^ref
of: ^model
tagOn: ^synthetic-point
doc: Some synthetic point referring to the model that generated it.

def:^synthetic
is: ^pointFunction
tagOn: ^point
mandatory
Doc: Synthetic point which can be Sim, Ml, and Ch

def:^ml
is: ^synthetic
doc: Machine Learning point
doc: Point is a machine learning based prediction or forecast of another point.

Brian Frank Wed 26 Jun 2024

Thanks Mike, this is great write-up.

I think in our last webcast, we talked about not using ch because computed points are not necessarily just computing historical data, but could and/or be computing a real-time curVal . So I think it might be better to use the tag computed (which is consistent with the term computedModel).

Mike Melillo Wed 26 Jun 2024

Ah, good catch, revised above. Thanks Brian.

annie dehghani Thu 27 Jun 2024

Agreed with Brian, this is a great writeup. Thank you for posting this Mike!

A question came up for us about where these synth points should "live" in the hierarchy. Normally we would put them under the same equipment as the actual sensor point.

Is that what others are doing as well? If that's the standard approach should it be included in the proposal or should it be left to the the modeler's discretion?

Example to illustrate my purpose. Say you have a simulated CO2 sensor on an AHU.

@synthCO2Sensor - point, synthetic, pointRef: @realCO2Sensor 
@realCO2Sensor - point, air, co2, concentration, sensor, equipRef: @ahu
@ahu - ahu, equip

Should @synthCO2Sensor also have an equipRef to @ahu in this example?

Sherri Simms Thu 27 Jun 2024

Mike,

Thank you for providing more documentation! This is super helpful!

I commented on Jan's Machine Learning proposal (https://project-haystack.org/forum/topic/1070#c10), but I also want to mention here one more thing I think to consider but is in regards to simulations, so I am posting on this topic instead...

As the IOT industry expands, SIM cards and their information may also one day be incorporated into Haystack and the prefix sim may cause confusion. I know using abbreviations has both pros and cons, and I don't know how much we spend time preventing things moving forward now for unknowns in the future, but I figured it is worth mentioning this. Maybe those who developed the existing haystack tags for the Information and Communication Technology library (https://www.project-haystack.org/doc/lib-phIct/index) would have any input about whether abbreviating simulation with "sim" would impact or not impact anything.

Thanks, ~Sherri SIMms

;)

Mike Melillo Mon 1 Jul 2024

Thanks for the feedback, answering comments in order.

From Annie:

Should @synthCO2Sensor also have an equipRef to @ahu in this example?

The two options that come to mind are:

Only synthetic-point records exist, and they can equipRef to real equips
synthetic-points can also exist as part of a completely synthetic equip

I think my preference is to put all points with the real equip, but I don't see a reason to exclude one way or another (or others I'm not seeing).

From Sherri:

I think both of your notes on model and sim are worth taking into account. For sim/SIM Card, I wonder how often the term sim is used in isolation to refer to "Subscriber Identification Model" and if it's an option this term would just become simCard in haystack if the need arose? That said, it probably shouldn't be our goal to close doors for folks down the road.

For model, perhaps the generic just becomes syntheticModel which removes any ambiguity to other uses of the model term... after all, in this ontology, syntheticModel is a sort of abstract parent just to get you to simModel or mlModel depending on your application.

Brian Frank Mon 1 Jul 2024

I think sim is a pretty safe prefix to use our domain, so I would say we stick with that.

From a navigation perspective, putting the points under the equipment just like the real points would be simplest. You can imagine a UI where you are can select actual points or points from a specific synthetic model.

However, I agree model might be too generic. Since the key marker tag is called synthetic, then it syntheticModel would make the most sense. But I think if we do that, then the ref tag should be syntheticModelRef too.

Richard McElhinney Tue 2 Jul 2024

Hi All,

thank you to everyone involved for preparing this write up and thank you to Mike for posting. It's fantastic to yet again see the power of community in continuing to evolve Project Haystack and to see domain experts carry on this work.

Just an observation on many of the definitions above.

In a number of places the definitions of terms, tags, etc. refer to seemingly only being relevant to "physics based whole-building sustainability modelling".

In our work we do a lot of modelling using ML to be able to predict future behaviour of complex machines only, not whole of building. It seems that the current definition is a little exclusive. There is a lot of modelling that is not whole-building based and focuses on components, sub-systems, of types of equipment.

So I was wondering if the definitions could be a little broader so as to not be so focused on "whole-building" modelling which we actually don't see much of in our work in the field when doing chiller plant optimisation.

Cheers, Richard

Jon Schoenfeld Wed 3 Jul 2024

Hi All,

Great work above. We are excited to put this into practice.

I'd like to recommend an addition to the simScenario choices. After taking corrective actions in a building or implementing ECMs, the operational model must change to reflect the new operation of the building. The old operational model then becomes the baseline, for lack of a better term, for the quantification of the impact of the work that was completed. As new work is performed, the process repeats, ie new operational models are created and old operational models become "baseline" models.

I believe a different simScenario is needed for this baseline. The simInterrogation is used to simulate future scenarios, ie if I did this, what would the outcome be. simBaseline (or whatever it should be called) is what the performance used to be and is essential for quantifying energy savings or performance improvements.

Thanks, Jon

Mike Melillo Sun 7 Jul 2024

Summarizing a few changes/requests from the above:

syntheticModel + ref probably best transition to avoid the generic model tag
Some doc language around sim should be generalized to not focus solely on whole-building. Richard if you have recommendations here, I'm all ears.
Additional simScenario def for simBaseline if I'm reading Jon's comments correctly

We should be having a Labs WG meeting edit: next week, ideally we can clarify these items + provide resolutions afterward. Following that/barring other comments, I'll look to draft up the actual defs.

Mike Melillo Fri 30 Aug 2024

Update, after much too much laboring on syntax and formatting, I have created a pull request incorporating relevant notes and comments from both this thread and conversations with the Haystack Labs group.

You can view the PR here: https://github.com/Project-Haystack/haystack-defs/pull/22

Thanks all and again especially to Jan for his input on the machine learning piece.

Brian Frank Tue 3 Sep 2024

I merged the PR.

However, there were conflicts with the new computed tag because that was previously being used in relationships to mark the side that was computed from its reciprocal. I renamed the usage of that tag to computedFromReciprocal (it was only used in three places for some core defs and was kind of a special case)

bessie koelpin Sat 21 Sep 2024

I really didn’t know that Simulation is the abbreviation for SIM. Thanks for expanding my knowledge! I also found more information related to SIM registration on this website https://tmsimregistration.info/

James Gessel Thu 6 Feb

Love all the work that has gone into this - very valuable, thanks all!!

I haven’t seen much discussion on confidence intervals or other statistically relevant data for synthetic points. For me, having this data is non-negotiable, but I struggle with where to put it. Do I bother modeling it? If so, what tags?

One approach is to calculate it on the fly, but that's not always possible. Another is to create separate points for upper and lower bounds, but that feels wrong—it’s not really a standalone data point since it requires context to get any value. Its more like an appendage to the ML-generated point, or time-series metadata.

I've run into a few "point appendage" scenarios. Some of them get out of hand quickly - dozens of extraneous "points". Should these be modeled, if so how? I'd certainly prefer some sort of "appendage" type entity that can store extraneous his data, but avoiding higher point counts doesn't seem like a strong motivator for platform providers :)

What do you all do? I assume I'm not the only one who likes confidence intervals :)