Hi, my colleagues and I have been developing various machine learning applications in Haystack-compatible environments (see https://energytwin.io/). We have consistently faced challenges in defining machine learning model tags. Questions have arisen, such as how to link a model with a specific point, whether it is possible to define multiple models for one point, and how to reference model-independent variables, among others.
I would like to propose the creation of a working group to help define tags related to machine learning. We will be happy to share our experiences and ideas with the community and collaborate in the process of defining these machine learning tags.
Who in the community would be interested in joining this working group?
Rick JenningsThu 8 Jun 2023
Hi Jan,
This sounds like a great initiative and I would be interested in joining this working group.
Thanks for setting this up!
Rick
Georgios GrigoriouFri 9 Jun 2023
Hi Jan,
Happy to join the cause and contribute!
Georgios
Stephen FrankFri 9 Jun 2023
I am also interested. We have some custom tagging we developed at NREL for this purpose that we can share.
annie dehghaniFri 9 Jun 2023
I would be happy to contribute what we have developed for this purpose as well. Like Stephen, we have developed some custom tagging. It would be great to define this in a more formal and uniform way.
Also, there was an ask at Haystack Connect to help model simulation data, and it occurs to me that there may be a lot of cross over with this group. Maybe this group is open to modelling "simulation" results more broadly whether it's physics-based simulation, machine learning, simple regression etc?
Adam J WallenMon 12 Jun 2023
Hi ML Team,
Craig Stevenson and I presented at Haystack Connect and are interested in the possible synergy between physics-based simulations tagging and machine learning tagging. Thanks for the shout-out, Annie!
Thanks, Adam
Jan ŠirokýFri 16 Jun 2023
Hi Annie,
This is a good point. We can consider referencing simulation data as well. This referencing principle can be applied to both physics-based models and machine learning models.
Thanks, Jan
Jan ŠirokýTue 20 Jun 2023
Hi all,
I have sent you the invitation for the kickoff meeting scheduled on August 10th at 11 am ET.
Please feel free to share your ideas or requirements with me prior to the kickoff. I will make an effort to incorporate them into the first draft.
Thanks, Jan
Keith BishoρTue 20 Jun 2023
Looking forward to working with this group.
Jan ŠirokýFri 10 Nov 2023
It took us some time to process all the inputs (special thanks to Keith B. and Stephen F.). However, we have successfully formulated the initial proposal.
As deliberated during the ML WG calls, our preference is to commence with a less exhaustive definition, allowing flexibility for the end user (e.g., we refrain from specifying how the identified ML parameters should be stored).
The definition provided below is a preliminary draft; we invite your comments on any aspect, with a particular focus on naming conventions, any missing elements, and potential use cases that may not be addressed.
def:^mlModel
is:^entity
mandatory
doc:
Machine learning model entity representing an overarching container for
various components, including inputs, outputs, parameters, and metrics.
---
def:^mlInputVarRefs
is:^list
of:^mlVarRef
tagOn:^mlModel
doc:
List of independent variables, also known as model inputs or features,
associated with a machine learning model.
---
def:^mlOutputVarRef
is:^ref
of:^mlVar
tagOn:^mlModel
doc:
Dependent variable, also known as the model output or target,
associated, with a machine learning model.
Represents the predicted outcome generated by the model.
---
def:^mlIdentificationPeriod
is:^span
tagOn:^mlModel
doc:
Training period description, known as the identification period
or baseline, utilized during the model training process.
---
def:^mlModelParameters
is:^dict
ro
tagOn:^mlModel
doc:
Result of model identification, which may appear as a list of
model parameters for simpler models or as a reference to a stored model,
in the form of a file uri. The structure of the dict is user-specific.
---
def:^mlModelMetrics
is:^dict
ro
tagOn:^mlModel
doc:
Goodness-of-fit metrics provided in the form of a simple dictionary.
For example: {r2:0.7889, cvrmse:58}.
---
def:^mlVar
is:^entity
mandatory
doc:
Machine learning variable representing both model inputs and outputs.
---
def:^mlVarPoint
is:^ref
of:^point
tagOn:^mlVar
doc:
Reference to a point associated with a machine learning variable,
known as a machine learning variable point.
---
def:^mlVarFilter
is:^filterStr
tagOn:^mlVar
doc:
Filter used for querying points by tags, providing more flexibility
than mlVarPoint, although it is not mandatory.
---
def:^mlVarRef
is:^ref
of:^mlVar
doc: Reference to a machine learning variable.
---
def:^mlModelRef
is:^ref
of:^mlModel
tagOn:^mlPrediction
doc:
Applied to a prediction point, referencing the specific
machine learning model used for generating predictions.
---
def:^mlPrediction
is:^pointFunction
doc: Point is a prediction or forecast of another point.
Jan ŠirokýTue 25 Jun 2024
We have successfully used this ML WG proposal in real-world applications over the last few months. We did not find any need for changes.
However, there has been an initiative related to the Synthetic Ontology Proposal due to its overlap with ML WG. The Synthetic Ontology Proposal will be introduced separately, but it is worth noting that it provides an abstraction for three categories of synthetic points:
Simulation (created by a physics-based whole-building sustainability model),
Machine Learning (generated by a machine learning model),
Computed History (generated by simple traditional mathematical calculations from computed history).
Based on a fruitful discussion with the Synthetic Ontology Proposal team (special thanks to Michael Melillo), we found a convenient way to merge the ML WG and Synthetic Ontology Proposal. See the updated proposal below. Here is a brief summary of the changes introduced compared to the ML WG proposal from November 2023:
mlModel is not an entity and it is not mandatory, it is a newly introduced entity model
mlPrediction was replaced by ml which is a newly introduced synthetic that is a pointFunction
mlModelRef was replaced by the more general modelRef
def: ^model
is: ^entity
mandatory
doc: Generic model entity definition. This can be a model that exists wholly
within the application, or the proxy of a model from a remote application.
---
Def: ^mlModel
is: ^model
doc: Machine learning model entity representing an overarching container for
various components, including inputs, outputs, parameters, and metrics.
---
def:^mlInputVarRefs
is:^list
of:^mlVarRef
tagOn:^mlModel
doc:
List of independent variables, also known as model inputs or features,
associated with a machine learning model.
---
def:^mlOutputVarRef
is:^ref
of:^mlVar
tagOn:^mlModel
doc:
Dependent variable, also known as the model output or target,
associated, with a machine learning model.
Represents the predicted outcome generated by the model.
---
def:^mlIdentificationPeriod
is:^span
tagOn:^mlModel
doc:
Training period description, known as the identification period
or baseline, utilized during the model training process.
---
def:^mlModelParameters
is:^dict
ro
tagOn:^mlModel
doc:
Result of model identification, which may appear as a list of
model parameters for simpler models or as a reference to a stored model,
in the form of a file uri. The structure of the dict is user-specific.
---
def:^mlModelMetrics
is:^dict
ro
tagOn:^mlModel
doc:
Goodness-of-fit metrics provided in the form of a simple dictionary.
For example: {r2:0.7889, cvrmse:58}.
---
def:^mlVar
is:^entity
mandatory
doc:
Machine learning variable representing both model inputs and outputs.
---
def:^mlVarPoint
is:^ref
of:^point
tagOn:^mlVar
doc:
Reference to a point associated with a machine learning variable,
known as a machine learning variable point.
---
def:^mlVarFilter
is:^filterStr
tagOn:^mlVar
doc:
Filter used for querying points by tags, providing more flexibility
than mlVarPoint, although it is not mandatory.
---
def:^mlVarRef
is:^ref
of:^mlVar
doc: Reference to a machine learning variable.
---
Def: ^modelRef
is: ^ref
of: ^model
tagOn: ^synthetic-point
doc: Some synthetic point referring to the model that generated it.
---
def:^synthetic
is: ^pointFunction
tagOn: ^point
mandatory
Doc: Synthetic point which can be Sim, Ml, and Ch
---
def:^ml
is: ^synthetic
doc: Machine Learning point
doc: Point is a machine learning based prediction or forecast of another point.
These are looking really good! Some things that popped out to me as considerations are:
Model entity: Using ^scientificModel (https://en.wikipedia.org/wiki/Scientific_modelling) instead of plain ^model (https://en.wikipedia.org/wiki/Model, specifically see section on Model in specific contexts) to eliminate possible confusion of other models such as design, layout, equipment or product models. The last discussion from the Labs WG I recall regarding attributes used ^modelName and ^modelNumber and appears to be on pause, so it may not be a conflict at all, but it may cause confusion for those using the tag if it is plain "model" and the potential for someone mistakenly thinking these three tags are possibly related. One of the things I see a lot when evaluating already tagged data is the misuse of tags, so anything we can do to make it intuitively clearer, the better. Note other names I considered were ^syntheticModel or ^dataModel but after a lot of thought, I think ^scientificModel allows for the most flexibility and clarity.
Algorithms: How do you know how the Machine Learning Model was generated? What if you wanted to have a couple different ones on the same points and compare them? How do you differentiate between which type of modeling was used (Linear Regression, etc)? Consider another def for relaying the algorithm used as a tagOn the model.
Baseline Span: A more generic tag such as ^baselineSpan could be a useful def as a way for tracking any baseline data modeling timeframe and be the tagOn ^mlModel (and any other model) instead of specifically ^mlIdentificationPeriod.
Vars: Wondering how useful it may be for potential other purposes to make the ^mlVar entity (and its related proposed defs: ^mlInputVarRefs, ^mlOutputVarRef, etc) be more generic such as ^var, ^varPoint, ^varInputRefs and ^varOutputRefs and then have a ^scientificModelRef on it.
Computed History: Change "Ch" to "computed" in def for ^synthetic doc; this goes in line with topic #1125 and makes more sense.
I am having trouble wrapping my head around this all quickly and being able to provide adequate feedback, but I hope this helps and either way, at least you know your group's hard work is being appreciated!
~Sherri
Jan ŠirokýMon 8 Jul 2024
Hi Sherri,
Thank you for the fruitful reply. Please see my comments below.
1, Model entity - I agree that model is quite general and can have different contexts in the build environment. Based on the discussion in Synthetic Ontology, it seems we are heading towards syntheticModel. Let's wait for the conclusion from the Synthetic Ontology team.
2, Algorithms - There is certainly a need to know the ML algorithm. However, if you want to include this in the ML WG proposal and expect some interoperability, you may also need to have a dictionary of expected algorithms (Linear regression != linear_regression != OLS, etc.). This is exactly the borderline we do not want to cross in the ML WG proposal since there are countless ML algorithms available. It can also raise questions about ML algorithm parameters, which are even more complex. Therefore, we want to keep this out of the ML WG proposal and let users define their model algorithms and parameters freely.
3, Baseline span - There is also a discussion in the Synthetic Ontology WG on this topic; however, in ML, it has a quite specific meaning. We want to explicitly express that a particular span was used for the identification (training) of the ML model. In such cases, you may focus on excluding time spans with erroneous data measurements, outliers, non-routine events, etc. I see the "base span" as a more general term used in conjunction with the evaluation span. However, in my understanding, the base span does not always have to be 100% equal to the identification span.
4, Vars - That is an interesting idea. It depends on whether other syntheticModels can make use of that, if I understand your point correctly. I see simModel as a model calculated in dedicated software, such as TRNSYS, using building construction parameters rather than data points as inputs. In the case of computedModel, it may be relevant.
5, Computed History - We will use the version of def:^synthetic and other general defs from the final version of Synthetic Ontology Proposal.
Thanks, Jan
Jan ŠirokýTue 24 Sep 2024
As announced earlier, we have merged the ML WG proposal into the Synthetic Ontology proposal. This work has recently been finalized; you can see the details in this Git pull request: https://github.com/Project-Haystack/haystack-defs/pull/22.
This concludes the recent work on the ML WG—thanks to all the contributors.
Jan Široký Thu 8 Jun 2023
Hi, my colleagues and I have been developing various machine learning applications in Haystack-compatible environments (see https://energytwin.io/). We have consistently faced challenges in defining machine learning model tags. Questions have arisen, such as how to link a model with a specific point, whether it is possible to define multiple models for one point, and how to reference model-independent variables, among others.
I would like to propose the creation of a working group to help define tags related to machine learning. We will be happy to share our experiences and ideas with the community and collaborate in the process of defining these machine learning tags.
Who in the community would be interested in joining this working group?
Rick Jennings Thu 8 Jun 2023
Hi Jan,
This sounds like a great initiative and I would be interested in joining this working group.
Thanks for setting this up!
Rick
Georgios Grigoriou Fri 9 Jun 2023
Hi Jan,
Happy to join the cause and contribute!
Georgios
Stephen Frank Fri 9 Jun 2023
I am also interested. We have some custom tagging we developed at NREL for this purpose that we can share.
annie dehghani Fri 9 Jun 2023
I would be happy to contribute what we have developed for this purpose as well. Like Stephen, we have developed some custom tagging. It would be great to define this in a more formal and uniform way.
Also, there was an ask at Haystack Connect to help model simulation data, and it occurs to me that there may be a lot of cross over with this group. Maybe this group is open to modelling "simulation" results more broadly whether it's physics-based simulation, machine learning, simple regression etc?
Adam J Wallen Mon 12 Jun 2023
Hi ML Team,
Craig Stevenson and I presented at Haystack Connect and are interested in the possible synergy between physics-based simulations tagging and machine learning tagging. Thanks for the shout-out, Annie!
Thanks, Adam
Jan Široký Fri 16 Jun 2023
Hi Annie,
This is a good point. We can consider referencing simulation data as well. This referencing principle can be applied to both physics-based models and machine learning models.
Thanks, Jan
Jan Široký Tue 20 Jun 2023
Hi all,
I have sent you the invitation for the kickoff meeting scheduled on August 10th at 11 am ET.
Please feel free to share your ideas or requirements with me prior to the kickoff. I will make an effort to incorporate them into the first draft.
Thanks, Jan
Keith Bishoρ Tue 20 Jun 2023
Looking forward to working with this group.
Jan Široký Fri 10 Nov 2023
It took us some time to process all the inputs (special thanks to Keith B. and Stephen F.). However, we have successfully formulated the initial proposal.
As deliberated during the ML WG calls, our preference is to commence with a less exhaustive definition, allowing flexibility for the end user (e.g., we refrain from specifying how the identified ML parameters should be stored).
The definition provided below is a preliminary draft; we invite your comments on any aspect, with a particular focus on naming conventions, any missing elements, and potential use cases that may not be addressed.
Jan Široký Tue 25 Jun 2024
We have successfully used this ML WG proposal in real-world applications over the last few months. We did not find any need for changes.
However, there has been an initiative related to the Synthetic Ontology Proposal due to its overlap with ML WG. The Synthetic Ontology Proposal will be introduced separately, but it is worth noting that it provides an abstraction for three categories of synthetic points:
Based on a fruitful discussion with the Synthetic Ontology Proposal team (special thanks to Michael Melillo), we found a convenient way to merge the ML WG and Synthetic Ontology Proposal. See the updated proposal below. Here is a brief summary of the changes introduced compared to the ML WG proposal from November 2023:
mlModel
is not an entity and it is not mandatory, itis
a newly introduced entitymodel
mlPrediction
was replaced byml
whichis
a newly introducedsynthetic
thatis
apointFunction
mlModelRef
was replaced by the more generalmodelRef
Sherri Simms Thu 27 Jun 2024
Re: https://www.project-haystack.org/forum/topic/1125 and this topic: https://project-haystack.org/forum/topic/1070#c10
Jan,
These are looking really good! Some things that popped out to me as considerations are:
I am having trouble wrapping my head around this all quickly and being able to provide adequate feedback, but I hope this helps and either way, at least you know your group's hard work is being appreciated!
~Sherri
Jan Široký Mon 8 Jul 2024
Hi Sherri,
Thank you for the fruitful reply. Please see my comments below.
1, Model entity - I agree that
model
is quite general and can have different contexts in the build environment. Based on the discussion in Synthetic Ontology, it seems we are heading towardssyntheticModel
. Let's wait for the conclusion from the Synthetic Ontology team.2, Algorithms - There is certainly a need to know the ML algorithm. However, if you want to include this in the ML WG proposal and expect some interoperability, you may also need to have a dictionary of expected algorithms (
Linear regression != linear_regression != OLS
, etc.). This is exactly the borderline we do not want to cross in the ML WG proposal since there are countless ML algorithms available. It can also raise questions about ML algorithm parameters, which are even more complex. Therefore, we want to keep this out of the ML WG proposal and let users define their model algorithms and parameters freely.3, Baseline span - There is also a discussion in the Synthetic Ontology WG on this topic; however, in ML, it has a quite specific meaning. We want to explicitly express that a particular span was used for the identification (training) of the ML model. In such cases, you may focus on excluding time spans with erroneous data measurements, outliers, non-routine events, etc. I see the "base span" as a more general term used in conjunction with the evaluation span. However, in my understanding, the base span does not always have to be 100% equal to the identification span.
4, Vars - That is an interesting idea. It depends on whether other
syntheticModels
can make use of that, if I understand your point correctly. I seesimModel
as a model calculated in dedicated software, such as TRNSYS, using building construction parameters rather than data points as inputs. In the case ofcomputedModel
, it may be relevant.5, Computed History - We will use the version of
def:^synthetic
and other general defs from the final version of Synthetic Ontology Proposal.Thanks, Jan
Jan Široký Tue 24 Sep 2024
As announced earlier, we have merged the ML WG proposal into the Synthetic Ontology proposal. This work has recently been finalized; you can see the details in this Git pull request: https://github.com/Project-Haystack/haystack-defs/pull/22.
This concludes the recent work on the ML WG—thanks to all the contributors.