Provenance Tracking#

In brief#

Provenance Tracking represents the tracking of “information that describes the production process of an end product, which can be anything from a piece of data to a physical object. […] Essentially, provenance can be seen as meta-data that, instead of describing data, describes a production process.” [1]

More in Detail#

“Traceability in AI shares part of its scope with general-purpose recommendations for provenance … ”[2]. In fact, provenance is “any information that describes the production process of an end product, which can be anything from a piece of data to a physical object” [1].

Keeping track of provenance is vital in various settings. It contributes to increasing the trust on produced systems. According to Gil et al. [3], the provenance of scientific results, i.e., “how results were derived, what parameters influenced the derivation, what datasets were used as input to the experiment, etc.” facilitates the reproducibility of the whole process.

Data and workflows deployed in an AI system are two key ingredients in traceability and provenance tracking. In particular, the distinction between prospective and retrospective provenance is introduced in the literature when dealing with workflows. The prospective provenance models workflows in an abstract and informative way, as templates composed of tasks that can be instantiated, modified, and combined. The retrospective provenance models past workflows, highlighting what task was executed and how data or other artifacts were derived [4]. Herschel et al. [1] claims that “provenance can be seen as metadata that, instead of describing data, describes a production process”. Considering the central role metadata play in tracking provenance, this section discusses some popular models to track provenance. In particular, Garijo et al. [5] have provided a holistic, Linked Data compliant, and ready-to-use solution to document workflow specifications and their executions, which exploits PROV-O [6], P-PLAN [7] and the Open Provenance Model for Workflows (OPMW)1.

PROV is a metadata model defined as a W3C recommendation. It captures the provenance documenting the entities, agents, actions, and the involved in a production chain, and the relations among them (e.g., attribution and usage). PROV acknowledges the need to represent workflows, also called plans, by including a construct such as prov:Plan. However, “it does not elaborate any further on how plans can be described or related to other provenance elements of the execution.” [7].

P-PLAN vocabulary extends PROV-O introducing constructs for plans (p-plan:Plan subclass of prov:Plan), their steps (p-plan:Step) and their input and output variables (p-plan:Variable). Still, P-PLAN does not model a full-fledged notion of workflow.

OPMW extends P-PLAN and the “Open Provenance Model (OPM), a legacy provenance model developed by the workflow community that was used as a reference to create PROV” [5]. OPMW distinguishes between workflow specifications, namely templates, and their workflow execution traces.

OPMW specifies workflow templates as instances of the class opmw:WorkflowTemplate (subclass of p-plan:Plan); the template processes/actions as opmw:WorkflowTemplateProcess (subclass of p-plan:Step); the template artifacts, manipulated or produced by processes, as opmw:WorkflowTemplateArtifact (subclass of p-plan:Variable). Accordingly, the template for the generic n-th step is an instance of opmw:WorkflowTemplateProcess. The n-th template steps’ input and output are indicated by the properties p-plan:hasInputVar and p-plan:isOutputVarOf, and are instances of opmw:WorkflowTemplateArtifact, representing any expected file, parameter, and collection of documents considered and manipulated by the template step. The classes opmw:WorkflowExecutionAccount, opmw:WorkflowExecutionProcess and opmw:WorkflowExecutionTemplate represent the execution counterparts of the template instances. The properties opmw:correspondsToTemplate, opmw:correspondsToTemplateProcess, opmw:correspondsToTemplateArtifact bind the execution and the template counterparts. Thus, n-th step is the actual execution of the n-th template step and it is an instance of opmw:WorkflowExecutionProcess, which is a specialization of the class prov:Activity. The actual execution’s n-th input and output steps are indicated by the PROV properties prov:used and prov:wasGeneratedby, and are instances of opmw:workflowExecutionArtifact which is a particular kind of prov:Entity. Albertoni et al. [8] provides examples of the use of the above metadata models when documenting scientific experiments.

Although not specific to AI experiments and systems, the models mentioned above offer some excellent standing and a backbone for describing data, actors, other kinds of entities, and how these might relate in experiments. Such a standing needs to be refined and extended to capture the gist of specific AI experiments. AI-related controlled terminologies might be required, for example, to complements the backbones with the hyper-parameters, tasks and metrics for AI techniques. Adopting a backbone, which is defined according to linked data best practices, offers the ability to combine different models and terminologies as needed, easing the tailoring of such backbone with the required AI-specific and community-governed refinements.

Bibliography#

1(1,2,3)

Melanie Herschel, Ralf Diestelkämper, and Houssem Ben Lahmar. A survey on provenance: What for? What form? What from? VLDB Journal, 26(6):881–906, 2017. doi:10.1007/s00778-017-0486-1.

2

Marçal Mora-Cantallops, Salvador Sánchez-Alonso, Elena García-Barriocanal, and Miguel-Angel Sicilia. Traceability for trustworthy AI: a review of models and tools. Big Data and Cognitive Computing, 2021. URL: https://www.mdpi.com/2504-2289/5/2/20, doi:10.3390/bdcc5020020.

3

Yolanda Gil, Ewa Deelman, Mark H. Ellisman, Thomas Fahringer, Geoffrey C. Fox, Dennis Gannon, Carole A. Goble, Miron Livny, Luc Moreau, and Jim Myers. Examining the challenges of scientific workflows. IEEE Computer, 40(12):24–32, 2007. URL: https://doi.org/10.1109/MC.2007.421, doi:10.1109/MC.2007.421.

4

Chunhyeok Lim, Shiyong Lu, Artem Chebotko, and Farshad Fotouhi. Prospective and retrospective provenance collection in scientific workflow environments. In IEEE SCC, 449–456. IEEE Computer Society, 2010. URL: http://dblp.uni-trier.de/db/conf/IEEEscc/scc2010.html#LimLCF10.

5(1,2)

Daniel Garijo, Yolanda Gil, and Óscar Corcho. Abstract, link, publish, exploit: an end to end framework for workflow sharing. Future Generation Comp. Syst., 75:271–283, 2017. doi:10.1016/j.future.2017.01.008.

6

Deborah McGuinness, Timothy Lebo, and Satya Sahoo. PROV-o: the PROV ontology. W3C Recommendation, W3C, April 2013. URL: http://www.w3.org/TR/2013/REC-prov-o-20130430/.

7(1,2)

Daniel Garijo and Yolanda Gil. Augmenting PROV with Plans in P-PLAN: scientific processes as linked data. In Proceedings of the 2nd International Workshop on Linked Science, volume 951 of CEUR Workshop Proceedings. 2012. URL: http://oa.upm.es/19478/.

8

Riccardo Albertoni, Monica De Martino, and Alfonso Quarati. Documenting context-based quality assessment of controlled vocabularies. IEEE Trans. Emerg. Top. Comput., 9(1):144–160, 2021. URL: https://doi.org/10.1109/TETC.2018.2865094, doi:10.1109/TETC.2018.2865094.

This entry was written by Riccardo Albertoni.


1

http://www.opmw.org/model/OPMW/