Working Group

#1070 Working Group for Machine Learning

Jan Široký Thu 8 Jun 2023

Hi, my colleagues and I have been developing various machine learning applications in Haystack-compatible environments (see https://energytwin.io/). We have consistently faced challenges in defining machine learning model tags. Questions have arisen, such as how to link a model with a specific point, whether it is possible to define multiple models for one point, and how to reference model-independent variables, among others.

I would like to propose the creation of a working group to help define tags related to machine learning. We will be happy to share our experiences and ideas with the community and collaborate in the process of defining these machine learning tags.

Who in the community would be interested in joining this working group?

Rick Jennings Thu 8 Jun 2023

Hi Jan,

This sounds like a great initiative and I would be interested in joining this working group.

Thanks for setting this up!

Rick

Georgios Grigoriou Fri 9 Jun 2023

Hi Jan,

Happy to join the cause and contribute!

Georgios

Stephen Frank Fri 9 Jun 2023

I am also interested. We have some custom tagging we developed at NREL for this purpose that we can share.

annie dehghani Fri 9 Jun 2023

I would be happy to contribute what we have developed for this purpose as well. Like Stephen, we have developed some custom tagging. It would be great to define this in a more formal and uniform way.

Also, there was an ask at Haystack Connect to help model simulation data, and it occurs to me that there may be a lot of cross over with this group. Maybe this group is open to modelling "simulation" results more broadly whether it's physics-based simulation, machine learning, simple regression etc?

Adam J Wallen Mon 12 Jun 2023

Hi ML Team,

Craig Stevenson and I presented at Haystack Connect and are interested in the possible synergy between physics-based simulations tagging and machine learning tagging. Thanks for the shout-out, Annie!

Thanks, Adam

Jan Široký Fri 16 Jun 2023

Hi Annie,

This is a good point. We can consider referencing simulation data as well. This referencing principle can be applied to both physics-based models and machine learning models.

Thanks, Jan

Jan Široký Tue 20 Jun 2023

Hi all,

I have sent you the invitation for the kickoff meeting scheduled on August 10th at 11 am ET.

Please feel free to share your ideas or requirements with me prior to the kickoff. I will make an effort to incorporate them into the first draft.

Thanks, Jan

Keith Bishoρ Tue 20 Jun 2023

Looking forward to working with this group.

Jan Široký Fri 10 Nov 2023

It took us some time to process all the inputs (special thanks to Keith B. and Stephen F.). However, we have successfully formulated the initial proposal.

As deliberated during the ML WG calls, our preference is to commence with a less exhaustive definition, allowing flexibility for the end user (e.g., we refrain from specifying how the identified ML parameters should be stored).

The definition provided below is a preliminary draft; we invite your comments on any aspect, with a particular focus on naming conventions, any missing elements, and potential use cases that may not be addressed.

def:^mlModel
is:^entity
mandatory
doc:
  Machine learning model entity representing an overarching container for 
  various components, including inputs, outputs, parameters, and metrics.
---
def:^mlInputVarRefs
is:^list
of:^mlVarRef
tagOn:^mlModel
doc:
  List of independent variables, also known as model inputs or features,
  associated with a machine learning model.
---
def:^mlOutputVarRef
is:^ref
of:^mlVar
tagOn:^mlModel
doc:
  Dependent variable, also known as the model output or target,
  associated, with a machine learning model.
  Represents the predicted outcome generated by the model.
---
def:^mlIdentificationPeriod
is:^span
tagOn:^mlModel
doc:
  Training period description, known as the identification period
  or baseline, utilized during the model training process.
---
def:^mlModelParameters
is:^dict
ro
tagOn:^mlModel
doc:
  Result of model identification, which may appear as a list of
  model parameters for simpler models or as a reference to a stored model,
  in the form of a file uri. The structure of the dict is user-specific.
---
def:^mlModelMetrics
is:^dict
ro
tagOn:^mlModel
doc:
  Goodness-of-fit metrics provided in the form of a simple dictionary.
  For example: {r2:0.7889, cvrmse:58}.
---
def:^mlVar
is:^entity
mandatory
doc:
  Machine learning variable representing both model inputs and outputs.
---
def:^mlVarPoint
is:^ref
of:^point
tagOn:^mlVar
doc:
  Reference to a point associated with a machine learning variable,
  known as a machine learning variable point.
---
def:^mlVarFilter
is:^filterStr
tagOn:^mlVar
doc:
  Filter used for querying points by tags, providing more flexibility
  than mlVarPoint, although it is not mandatory.
---
def:^mlVarRef
is:^ref
of:^mlVar
doc: Reference to a machine learning variable.
---
def:^mlModelRef
is:^ref
of:^mlModel
tagOn:^mlPrediction
doc:
  Applied to a prediction point, referencing the specific
  machine learning model used for generating predictions.
---
def:^mlPrediction
is:^pointFunction
doc: Point is a prediction or forecast of another point. 

Jan Široký Tue 25 Jun

We have successfully used this ML WG proposal in real-world applications over the last few months. We did not find any need for changes.

However, there has been an initiative related to the Synthetic Ontology Proposal due to its overlap with ML WG. The Synthetic Ontology Proposal will be introduced separately, but it is worth noting that it provides an abstraction for three categories of synthetic points:

  • Simulation (created by a physics-based whole-building sustainability model),
  • Machine Learning (generated by a machine learning model),
  • Computed History (generated by simple traditional mathematical calculations from computed history).

Based on a fruitful discussion with the Synthetic Ontology Proposal team (special thanks to Michael Melillo), we found a convenient way to merge the ML WG and Synthetic Ontology Proposal. See the updated proposal below. Here is a brief summary of the changes introduced compared to the ML WG proposal from November 2023:

  • mlModel is not an entity and it is not mandatory, it is a newly introduced entity model
  • mlPrediction was replaced by ml which is a newly introduced synthetic that is a pointFunction
  • mlModelRef was replaced by the more general modelRef
def: ^model
is: ^entity
mandatory
doc: Generic model entity definition. This can be a model that exists wholly 
  within the application, or the proxy of a model from a remote application.
---
Def: ^mlModel
is: ^model
doc: Machine learning model entity representing an overarching container for 
  various components, including inputs, outputs, parameters, and metrics. 
---
def:^mlInputVarRefs
is:^list
of:^mlVarRef
tagOn:^mlModel
doc:
  List of independent variables, also known as model inputs or features,
  associated with a machine learning model.
---
def:^mlOutputVarRef
is:^ref
of:^mlVar
tagOn:^mlModel
doc:
  Dependent variable, also known as the model output or target,
  associated, with a machine learning model.
  Represents the predicted outcome generated by the model.
---
def:^mlIdentificationPeriod
is:^span
tagOn:^mlModel
doc:
  Training period description, known as the identification period
  or baseline, utilized during the model training process.
---
def:^mlModelParameters
is:^dict
ro
tagOn:^mlModel
doc:
  Result of model identification, which may appear as a list of
  model parameters for simpler models or as a reference to a stored model,
  in the form of a file uri. The structure of the dict is user-specific.
---
def:^mlModelMetrics
is:^dict
ro
tagOn:^mlModel
doc:
  Goodness-of-fit metrics provided in the form of a simple dictionary.
  For example: {r2:0.7889, cvrmse:58}.
---
def:^mlVar
is:^entity
mandatory
doc:
  Machine learning variable representing both model inputs and outputs.
---
def:^mlVarPoint
is:^ref
of:^point
tagOn:^mlVar
doc:
  Reference to a point associated with a machine learning variable,
  known as a machine learning variable point.
---
def:^mlVarFilter
is:^filterStr
tagOn:^mlVar
doc:
  Filter used for querying points by tags, providing more flexibility
  than mlVarPoint, although it is not mandatory.
---
def:^mlVarRef
is:^ref
of:^mlVar
doc: Reference to a machine learning variable.
---
Def: ^modelRef
is: ^ref
of: ^model
tagOn: ^synthetic-point
doc: Some synthetic point referring to the model that generated it.
---
def:^synthetic
is: ^pointFunction
tagOn: ^point
mandatory
Doc: Synthetic point which can be Sim, Ml, and Ch
---
def:^ml
is: ^synthetic
doc: Machine Learning point
doc: Point is a machine learning based prediction or forecast of another point. 

Sherri Simms Thu 27 Jun

Re: https://www.project-haystack.org/forum/topic/1125 and this topic: https://project-haystack.org/forum/topic/1070#c10

Jan,

These are looking really good! Some things that popped out to me as considerations are:

  1. Model entity: Using ^scientificModel (https://en.wikipedia.org/wiki/Scientific_modelling) instead of plain ^model (https://en.wikipedia.org/wiki/Model, specifically see section on Model in specific contexts) to eliminate possible confusion of other models such as design, layout, equipment or product models. The last discussion from the Labs WG I recall regarding attributes used ^modelName and ^modelNumber and appears to be on pause, so it may not be a conflict at all, but it may cause confusion for those using the tag if it is plain "model" and the potential for someone mistakenly thinking these three tags are possibly related. One of the things I see a lot when evaluating already tagged data is the misuse of tags, so anything we can do to make it intuitively clearer, the better. Note other names I considered were ^syntheticModel or ^dataModel but after a lot of thought, I think ^scientificModel allows for the most flexibility and clarity.
  2. Algorithms: How do you know how the Machine Learning Model was generated? What if you wanted to have a couple different ones on the same points and compare them? How do you differentiate between which type of modeling was used (Linear Regression, etc)? Consider another def for relaying the algorithm used as a tagOn the model.
  3. Baseline Span: A more generic tag such as ^baselineSpan could be a useful def as a way for tracking any baseline data modeling timeframe and be the tagOn ^mlModel (and any other model) instead of specifically ^mlIdentificationPeriod.
  4. Vars: Wondering how useful it may be for potential other purposes to make the ^mlVar entity (and its related proposed defs: ^mlInputVarRefs, ^mlOutputVarRef, etc) be more generic such as ^var, ^varPoint, ^varInputRefs and ^varOutputRefs and then have a ^scientificModelRef on it.
  5. Computed History: Change "Ch" to "computed" in def for ^synthetic doc; this goes in line with topic #1125 and makes more sense.

I am having trouble wrapping my head around this all quickly and being able to provide adequate feedback, but I hope this helps and either way, at least you know your group's hard work is being appreciated!

~Sherri

Jan Široký Mon 8 Jul

Hi Sherri,

Thank you for the fruitful reply. Please see my comments below.

1, Model entity - I agree that model is quite general and can have different contexts in the build environment. Based on the discussion in Synthetic Ontology, it seems we are heading towards syntheticModel. Let's wait for the conclusion from the Synthetic Ontology team.

2, Algorithms - There is certainly a need to know the ML algorithm. However, if you want to include this in the ML WG proposal and expect some interoperability, you may also need to have a dictionary of expected algorithms (Linear regression != linear_regression != OLS, etc.). This is exactly the borderline we do not want to cross in the ML WG proposal since there are countless ML algorithms available. It can also raise questions about ML algorithm parameters, which are even more complex. Therefore, we want to keep this out of the ML WG proposal and let users define their model algorithms and parameters freely.

3, Baseline span - There is also a discussion in the Synthetic Ontology WG on this topic; however, in ML, it has a quite specific meaning. We want to explicitly express that a particular span was used for the identification (training) of the ML model. In such cases, you may focus on excluding time spans with erroneous data measurements, outliers, non-routine events, etc. I see the "base span" as a more general term used in conjunction with the evaluation span. However, in my understanding, the base span does not always have to be 100% equal to the identification span.

4, Vars - That is an interesting idea. It depends on whether other syntheticModels can make use of that, if I understand your point correctly. I see simModel as a model calculated in dedicated software, such as TRNSYS, using building construction parameters rather than data points as inputs. In the case of computedModel, it may be relevant.

5, Computed History - We will use the version of def:^synthetic and other general defs from the final version of Synthetic Ontology Proposal.

Thanks, Jan

Login or Signup to reply.