Internet Engineering Task Force (IETF)                         B. Claise
Request for Comments: 9417                                   J. Quilbeuf
Category: Informational                                           Huawei
ISSN: 2070-1721                                                 D. Lopez
                                                         Telefonica I+D
                                                               D. Voyer
                                                            Bell Canada
                                                            T. Arumugam
                                                             Consultant
                                                              July 2023


      Service Assurance for Intent-Based Networking Architecture

Abstract

  This document describes an architecture that provides some assurance
  that service instances are running as expected.  As services rely
  upon multiple subservices provided by a variety of elements,
  including the underlying network devices and functions, getting the
  assurance of a healthy service is only possible with a holistic view
  of all involved elements.  This architecture not only helps to
  correlate the service degradation with symptoms of a specific network
  component but, it also lists the services impacted by the failure or
  degradation of a specific network component.

Status of This Memo

  This document is not an Internet Standards Track specification; it is
  published for informational purposes.

  This document is a product of the Internet Engineering Task Force
  (IETF).  It represents the consensus of the IETF community.  It has
  received public review and has been approved for publication by the
  Internet Engineering Steering Group (IESG).  Not all documents
  approved by the IESG are candidates for any level of Internet
  Standard; see Section 2 of RFC 7841.

  Information about the current status of this document, any errata,
  and how to provide feedback on it may be obtained at
  https://www.rfc-editor.org/info/rfc9417.

Copyright Notice

  Copyright (c) 2023 IETF Trust and the persons identified as the
  document authors.  All rights reserved.

  This document is subject to BCP 78 and the IETF Trust's Legal
  Provisions Relating to IETF Documents
  (https://trustee.ietf.org/license-info) in effect on the date of
  publication of this document.  Please review these documents
  carefully, as they describe your rights and restrictions with respect
  to this document.  Code Components extracted from this document must
  include Revised BSD License text as described in Section 4.e of the
  Trust Legal Provisions and are provided without warranty as described
  in the Revised BSD License.

Table of Contents

  1.  Introduction
  2.  Terminology
  3.  A Functional Architecture
    3.1.  Translating a Service Instance Configuration into an
          Assurance Graph
      3.1.1.  Circular Dependencies
    3.2.  Intent and Assurance Graph
    3.3.  Subservices
    3.4.  Building the Expression Graph from the Assurance Graph
    3.5.  Open Interfaces with YANG Modules
    3.6.  Handling Maintenance Windows
    3.7.  Flexible Functional Architecture
    3.8.  Time Window for Symptoms' History
    3.9.  New Assurance Graph Generation
  4.  IANA Considerations
  5.  Security Considerations
  6.  References
    6.1.  Normative References
    6.2.  Informative References
  Acknowledgements
  Contributors
  Authors' Addresses

1.  Introduction

  Network Service YANG Modules [RFC8199] describe the configuration,
  state data, operations, and notifications of abstract representations
  of services implemented on one or multiple network elements.

  Service orchestrators use Network Service YANG Modules that will
  infer network-wide configuration and, therefore, the invocation of
  the appropriate device modules (Section 3 of [RFC8969]).  Knowing
  that a configuration is applied doesn't imply that the provisioned
  service instance is up and running as expected.  For instance, the
  service might be degraded because of a failure in the network, the
  service quality may be degraded, or a service function may be
  reachable at the IP level but does not provide its intended function.
  Thus, the network operator must monitor the service's operational
  data at the same time as the configuration (Section 3.3 of
  [RFC8969]).  To fuel that task, the industry has been standardizing
  on telemetry to push network element performance information (e.g.,
  [RFC9375]).

  A network administrator needs to monitor its network and services as
  a whole, independently of the management protocols.  With different
  protocols come different data models and different ways to model the
  same type of information.  When network administrators deal with
  multiple management protocols, the network management entities have
  to perform the difficult and time-consuming job of mapping data
  models, e.g., the model used for configuration with the model used
  for monitoring when separate models or protocols are used.  This
  problem is compounded by a large, disparate set of data sources
  (e.g., MIB modules, YANG data models [RFC7950], IP Flow Information
  Export (IPFIX) information elements [RFC7011], syslog plain text
  [RFC5424], Terminal Access Controller Access-Control System Plus
  (TACACS+) [RFC8907], RADIUS [RFC2865], etc.).  In order to avoid this
  data model mapping, the industry converged on model-driven telemetry
  to stream the service operational data, reusing the YANG data models
  used for configuration.  Model-driven telemetry greatly facilitates
  the notion of closed-loop automation, whereby events and updated
  operational states streamed from the network drive remediation change
  back into the network.

  However, it proves difficult for network operators to correlate the
  service degradation with the network root cause, for example, "Why
  does my layer 3 virtual private network (L3VPN) fail to connect?" or
  "Why is this specific service not highly responsive?"  The reverse,
  i.e., which services are impacted when a network component fails or
  degrades, is also important for operators, for example, "Which
  services are impacted when this specific optic decibel milliwatt
  (dBm) begins to degrade?", "Which applications are impacted by an
  imbalance in this Equal-Cost Multipath (ECMP) bundle?", or "Is that
  issue actually impacting any other customers?"  This task usually
  falls under the so-called "Service Impact Analysis" functional block.

  This document defines an architecture implementing Service Assurance
  for Intent-based Networking (SAIN).  Intent-based approaches are
  often declarative, starting from a statement of "The service works as
  expected" and trying to enforce it.  However, some already-defined
  services might have been designed using a different approach.
  Aligned with Section 3.3 of [RFC7149], and instead of requiring a
  declarative intent as a starting point, this architecture focuses on
  already-defined services and tries to infer the meaning of "The
  service works as expected".  To do so, the architecture works from an
  assurance graph, deduced from the configuration pushed to the device
  for enabling the service instance.  If the SAIN orchestrator supports
  it, the service model (Section 2 of [RFC8309]) or the network model
  (Section 2.1 of [RFC8969]) can also be used to build the assurance
  graph.  In that case and if the service model includes the
  declarative intent as well, the SAIN orchestrator can rely on the
  declared intent instead of inferring it.  The assurance graph may
  also be explicitly completed to add an intent not exposed in the
  service model itself.

  The assurance graph of a service instance is decomposed into
  components, which are then assured independently.  The top of the
  assurance graph represents the service instance to assure, and its
  children represent components identified as its direct dependencies;
  each component can have dependencies as well.  Components involved in
  the assurance graph of a service are called subservices.  The SAIN
  orchestrator updates the assurance graph automatically when the
  service instance is modified.

  When a service is degraded, the SAIN architecture will highlight
  where in the assurance service graph to look, as opposed to going hop
  by hop to troubleshoot the issue.  More precisely, the SAIN
  architecture will associate to each service instance a list of
  symptoms originating from specific subservices, corresponding to
  components of the network.  These components are good candidates for
  explaining the source of a service degradation.  Not only can this
  architecture help to correlate service degradation with network root
  cause/symptoms, but it can deduce from the assurance graph the list
  of service instances impacted by a component degradation/failure.
  This added value informs the operational team where to focus its
  attention for maximum return.  Indeed, the operational team is likely
  to focus their priority on the degrading/failing components impacting
  the highest number of their customers, especially the ones with the
  Service-Level Agreement (SLA) contracts involving penalties in case
  of failure.

  This architecture provides the building blocks to assure both
  physical and virtual entities and is flexible with respect to
  services and subservices of (distributed) graphs and components
  (Section 3.7).

  The architecture presented in this document is implemented by a set
  of YANG modules defined in a companion document [RFC9418].  These
  YANG modules properly define the interfaces between the various
  components of the architecture to foster interoperability.

2.  Terminology

  SAIN agent:  A functional component that communicates with a device,
     a set of devices, or another agent to build an expression graph
     from a received assurance graph and perform the corresponding
     computation of the health status and symptoms.  A SAIN agent might
     be running directly on the device it monitors.

  Assurance case:  "An assurance case is a structured argument,
     supported by evidence, intended to justify that a system is
     acceptably assured relative to a concern (such as safety or
     security) in the intended operating environment" [Piovesan2017].

  Service instance:  A specific instance of a service.

  Intent:  "A set of operational goals (that a network should meet) and
     outcomes (that a network is supposed to deliver) defined in a
     declarative manner without specifying how to achieve or implement
     them" [RFC9315].

  Subservice:  A part or functionality of the network system that can
     be independently assured as a single entity in an assurance graph.

  Assurance graph:  A Directed Acyclic Graph (DAG) representing the
     assurance case for one or several service instances.  The nodes
     (also known as vertices in the context of DAG) are the service
     instances themselves and the subservices; the edges indicate a
     dependency relation.

  SAIN collector:  A functional component that fetches or receives the
     computer-consumable output of the SAIN agent(s) and processes it
     locally (including displaying it in a user-friendly form).

  DAG:  Directed Acyclic Graph.

  ECMP:  Equal-Cost Multipath.

  Expression graph:  A generic term for a DAG representing a
     computation in SAIN.  More specific terms are listed below:

     Subservice expressions:
        An expression graph representing all the computations to
        execute for a subservice.

     Service expressions:
        An expression graph representing all the computations to
        execute for a service instance, i.e., including the
        computations for all dependent subservices.

     Global computation graph:
        An expression graph representing all the computations to
        execute for all services instances (i.e., all computations
        performed).

  Dependency:  The directed relationship between subservice instances
     in the assurance graph.

  Metric:  A piece of information retrieved from the network running
     the assured service.

  Metric engine:  A functional component, part of the SAIN agent, that
     maps metrics to a list of candidate metric implementations,
     depending on the network element.

  Metric implementation:  The actual way of retrieving a metric from a
     network element.

  Network Service YANG Module:  The characteristics of a service, as
     agreed upon with consumers of that service [RFC8199].

  Service orchestrator:  "Network Service YANG Modules describe the
     characteristics of a service, as agreed upon with consumers of
     that service.  That is, a service module does not expose the
     detailed configuration parameters of all participating network
     elements and features but describes an abstract model that allows
     instances of the service to be decomposed into instance data
     according to the Network Element YANG Modules of the participating
     network elements.  The service-to-element decomposition is a
     separate process; the details depend on how the network operator
     chooses to realize the service.  For the purpose of this document,
     the term "orchestrator" is used to describe a system implementing
     such a process" [RFC8199].

  SAIN orchestrator:  A functional component that is in charge of
     fetching the configuration specific to each service instance and
     converting it into an assurance graph.

  Health status:  The score and symptoms indicating whether a service
     instance or a subservice is "healthy".  A non-maximal score must
     always be explained by one or more symptoms.

  Health score:  An integer ranging from 0 to 100 that indicates the
     health of a subservice.  A score of 0 means that the subservice is
     broken, a score of 100 means that the subservice in question is
     operating as expected, and the special value -1 can be used to
     specify that no value could be computed for that health score, for
     instance, if some metric needed for that computation could not be
     collected.

  Strongly connected component:  A subset of a directed graph such that
     there is a (directed) path from any node of the subset to any
     other node.  A DAG does not contain any strongly connected
     component.

  Symptom:  A reason explaining why a service instance or a subservice
     is not completely healthy.

3.  A Functional Architecture

  The goal of SAIN is to assure that service instances are operating as
  expected (i.e., the observed service is matching the expected
  service) and, if not, to pinpoint what is wrong.  More precisely,
  SAIN computes a score for each service instance and outputs symptoms
  explaining that score.  The only valid situation where no symptoms
  are returned is when the score is maximal, indicating that no issues
  were detected for that service instance.  The score augmented with
  the symptoms is called the health status.  The exact meaning of the
  health score value is out of scope of this document.  However, the
  following constraints should be followed: the higher the score, the
  better the service health is and the two extrema are 0 meaning the
  service is completely broken, and 100 meaning the service is
  completely operational.

  The SAIN architecture is a generic architecture, which generates an
  assurance graph from service instance(s), as specified in
  Section 3.1.  This architecture is applicable to not only multiple
  environments (e.g., wireline and wireless) but also different domains
  (e.g., 5G network function virtualization (NFV) domain with a virtual
  infrastructure manager (VIM), etc.) and, as already noted, for
  physical or virtual devices, as well as virtual functions.  Thanks to
  the distributed graph design principle, graphs from different
  environments and orchestrators can be combined to obtain the graph of
  a service instance that spans over multiple domains.

  As an example of a service, let us consider a point-to-point layer 2
  virtual private network (L2VPN).  [RFC8466] specifies the parameters
  for such a service.  Examples of symptoms might be symptoms reported
  by specific subservices, including "Interface has high error rate",
  "Interface flapping", or "Device almost out of memory", as well as
  symptoms more specific to the service (such as "Site disconnected
  from VPN").

  To compute the health status of an instance of such a service, the
  service definition is decomposed into an assurance graph formed by
  subservices linked through dependencies.  Each subservice is then
  turned into an expression graph that details how to fetch metrics
  from the devices and compute the health status of the subservice.
  The subservice expressions are combined according to the dependencies
  between the subservices in order to obtain the expression graph that
  computes the health status of the service instance.

  The overall SAIN architecture is presented in Figure 1.  Based on the
  service configuration provided by the service orchestrator, the SAIN
  orchestrator decomposes the assurance graph.  It then sends to the
  SAIN agents the assurance graph along with some other configuration
  options.  The SAIN agents are responsible for building the expression
  graph and computing the health statuses in a distributed manner.  The
  collector is in charge of collecting and displaying the current
  inferred health status of the service instances and subservices.  The
  collector also detects changes in the assurance graph structures
  (e.g., an occurrence of a switchover from primary to backup path) and
  forwards the information to the orchestrator, which reconfigures the
  agents.  Finally, the automation loop is closed by having the SAIN
  collector provide feedback to the network/service orchestrator.

  In order to make agents, orchestrators, and collectors from different
  vendors interoperable, their interface is defined as a YANG module in
  a companion document [RFC9418].  In Figure 1, the communications that
  are normalized by this YANG module are tagged with a "Y".  The use of
  this YANG module is further explained in Section 3.5.

       +-----------------+
       | Service         |
       | Orchestrator    |<----------------------+
       |                 |                       |
       +-----------------+                       |
          |            ^                         |
          |            | Network                 |
          |            | Service                 | Feedback
          |            | Instance                | Loop
          |            | Configuration           |
          |            |                         |
          |            V                         |
          |        +-----------------+  Graph  +-------------------+
          |        | SAIN            | Updates | SAIN              |
          |        | Orchestrator    |<--------| Collector         |
          |        +-----------------+         +-------------------+
          |            |                          ^
          |           Y| Configuration            | Health Status
          |            | (Assurance Graph)       Y| (Score + Symptoms)
          |            V                          | Streamed
          |     +-------------------+             | via Telemetry
          |     |+-------------------+            |
          |     ||+-------------------+           |
          |     +|| SAIN              |-----------+
          |      +| Agent             |
          |       +-------------------+
          |               ^ ^ ^
          |               | | |
          |               | | |  Metric Collection
          V               V V V
      +-------------------------------------------------------------+
      |           (Network) System                                  |
      |                                                             |
      +-------------------------------------------------------------+

                       Figure 1: SAIN Architecture

  In order to produce the score assigned to a service instance, the
  various involved components perform the following tasks:

  *  Analyze the configuration pushed to the network device(s) for
     configuring the service instance.  From there, determine which
     information (called a metric) must be collected from the device(s)
     and which operations to apply to the metrics to compute the health
     status.

  *  Stream (via telemetry, such as YANG-Push [RFC8641]) operational
     and config metric values when possible, else continuously poll.

  *  Continuously compute the health status of the service instances
     based on the metric values.

  The SAIN architecture requires time synchronization, with the Network
  Time Protocol (NTP) [RFC5905] as a candidate, between all elements:
  monitored entities, SAIN agents, service orchestrator, the SAIN
  collector, as well as the SAIN orchestrator.  This guarantees the
  correlations of all symptoms in the system, correlated with the right
  assurance graph version.

3.1.  Translating a Service Instance Configuration into an Assurance
     Graph

  In order to structure the assurance of a service instance, the SAIN
  orchestrator decomposes the service instance into so-called
  subservice instances.  Each subservice instance focuses on a specific
  feature or subpart of the service.

  The decomposition into subservices is an important function of the
  architecture for the following reasons:

  *  The result of this decomposition provides a relational picture of
     a service instance, which can be represented as a graph (called an
     assurance graph) to the operator.

  *  Subservices provide a scope for particular expertise and thereby
     enable contribution from external experts.  For instance, the
     subservice dealing with the optic's health should be reviewed and
     extended by an expert in optical interfaces.

  *  Subservices that are common to several service instances are
     reused for reducing the amount of computation needed.  For
     instance, the subservice assuring a given interface is reused by
     any service instance relying on that interface.

  The assurance graph of a service instance is a DAG representing the
  structure of the assurance case for the service instance.  The nodes
  of this graph are service instances or subservice instances.  Each
  edge of this graph indicates a dependency between the two nodes at
  its extremities, i.e., the service or subservice at the source of the
  edge depends on the service or subservice at the destination of the
  edge.

  Figure 2 depicts a simplistic example of the assurance graph for a
  tunnel service.  The node at the top is the service instance; the
  nodes below are its dependencies.  In the example, the tunnel service
  instance depends on the "peer1" and "peer2" tunnel interfaces (the
  tunnel interfaces created on the peer1 and peer2 devices,
  respectively), which in turn depend on the respective physical
  interfaces, which finally depend on the respective "peer1" and
  "peer2" devices.  The tunnel service instance also depends on the IP
  connectivity that depends on the IS-IS routing protocol.

                           +------------------+
                           | Tunnel           |
                           | Service Instance |
                           +------------------+
                                     |
                +--------------------+-------------------+
                |                    |                   |
                v                    v                   v
           +-------------+    +--------------+    +-------------+
           | Peer1       |    | IP           |    | Peer2       |
           | Tunnel      |    | Connectivity |    | Tunnel      |
           | Interface   |    |              |    | Interface   |
           +-------------+    +--------------+    +-------------+
                  |                  |                  |
                  |    +-------------+--------------+   |
                  |    |             |              |   |
                  v    v             v              v   v
           +-------------+    +-------------+     +-------------+
           | Peer1       |    | IS-IS       |     | Peer2       |
           | Physical    |    | Routing     |     | Physical    |
           | Interface   |    | Protocol    |     | Interface   |
           +-------------+    +-------------+     +-------------+
                  |                                     |
                  v                                     v
           +-------------+                        +-------------+
           |             |                        |             |
           | Peer1       |                        | Peer2       |
           | Device      |                        | Device      |
           +-------------+                        +-------------+

                    Figure 2: Assurance Graph Example

  Depicting the assurance graph helps the operator to understand (and
  assert) the decomposition.  The assurance graph shall be maintained
  during normal operation with addition, modification, and removal of
  service instances.  A change in the network configuration or topology
  shall automatically be reflected in the assurance graph.  As a first
  example, a change of the routing protocol from IS-IS to OSPF would
  change the assurance graph accordingly.  As a second example, assume
  that the ECMP is in place for the source router for that specific
  tunnel; in that case, multiple interfaces must now be monitored, in
  addition to monitoring the ECMP health itself.

3.1.1.  Circular Dependencies

  The edges of the assurance graph represent dependencies.  An
  assurance graph is a DAG if and only if there are no circular
  dependencies among the subservices, and every assurance graph should
  avoid circular dependencies.  However, in some cases, circular
  dependencies might appear in the assurance graph.

  First, the assurance graph of a whole system is obtained by combining
  the assurance graph of every service running on that system.  Here,
  combining means that two subservices having the same type and the
  same parameters are in fact the same subservice and thus a single
  node in the graph.  For instance, the subservice of type "device"
  with the only parameter (the device ID) set to "PE1" will appear only
  once in the whole assurance graph, even if several service instances
  rely on that device.  Now, if two engineers design assurance graphs
  for two different services, and Engineer A decides that an interface
  depends on the link it is connected to, but Engineer B decides that
  the link depends on the interface it is connected to, then when
  combining the two assurance graphs, we will have a circular
  dependency interface -> link -> interface.

  Another case possibly resulting in circular dependencies is when
  subservices are not properly identified.  Assume that we want to
  assure a cloud-based computing cluster that runs containers.  We
  could represent the cluster by a subservice and the network service
  connecting containers on the cluster by another subservice.  We would
  likely model that as the network service depending on the cluster,
  because the network service runs in a container supported by the
  cluster.  Conversely, the cluster depends on the network service for
  connectivity between containers, which creates a circular dependency.
  A finer decomposition might distinguish between the resources for
  executing containers (a part of our cluster subservice) and the
  communication between the containers (which could be modeled in the
  same way as communication between routers).

  In any case, it is likely that circular dependencies will show up in
  the assurance graph.  A first step would be to detect circular
  dependencies as soon as possible in the SAIN architecture.  Such a
  detection could be carried out by the SAIN orchestrator.  Whenever a
  circular dependency is detected, the newly added service would not be
  monitored until more careful modeling or alignment between the
  different teams (Engineers A and B) remove the circular dependency.

  As a more elaborate solution, we could consider a graph
  transformation:

  *  Decompose the graph into strongly connected components.

  *  For each strongly connected component:

     -  remove all edges between nodes of the strongly connected
        component;

     -  add a new "synthetic" node for the strongly connected
        component;

     -  for each edge pointing to a node in the strongly connected
        component, change the destination to the "synthetic" node; and

     -  add a dependency from the "synthetic" node to every node in the
        strongly connected component.

  Such an algorithm would include all symptoms detected by any
  subservice in one of the strongly connected components and make it
  available to any subservice that depends on it.  Figure 3 shows an
  example of such a transformation.  On the left-hand side, the nodes
  c, d, e, and f form a strongly connected component.  The status of
  node a should depend on the status of nodes c, d, e, f, g, and h, but
  this is hard to compute because of the circular dependency.  On the
  right-hand side, node a depends on all these nodes as well, but the
  circular dependency has been removed.

        +---+    +---+          |                +---+    +---+
        | a |    | b |          |                | a |    | b |
        +---+    +---+          |                +---+    +---+
          |        |            |                  |        |
          v        v            |                  v        v
        +---+    +---+          |                +------------+
        | c |--->| d |          |                |  synthetic |
        +---+    +---+          |                +------------+
          ^        |            |               /   |      |   \
          |        |            |              /    |      |    \
          |        v            |             v     v      v     v
        +---+    +---+          |          +---+  +---+  +---+  +---+
        | f |<---| e |          |          | f |  | c |  | d |  | e |
        +---+    +---+          |          +---+  +---+  +---+  +---+
          |        |            |            |                    |
          v        v            |            v                    v
        +---+    +---+          |          +---+                +---+
        | g |    | h |          |          | g |                | h |
        +---+    +---+          |          +---+                +---+

           Before                                     After
        Transformation                           Transformation

                      Figure 3: Graph Transformation

  We consider a concrete example to illustrate this transformation.
  Let's assume that Engineer A is building an assurance graph dealing
  with IS-IS and Engineer B is building an assurance graph dealing with
  OSPF.  The graph from Engineer A could contain the following:

                  +------------+
                  | IS-IS Link |
                  +------------+
                        |
                        v
                  +------------+
                  | Phys. Link |
                  +------------+
                    |       |
                    v       v
         +-------------+  +-------------+
         | Interface 1 |  | Interface 2 |
         +-------------+  +-------------+

        Figure 4: Fragment of the Assurance Graph from Engineer A

  The graph from Engineer B could contain the following:

                  +------------+
                  | OSPF Link  |
                  +------------+
                    |   |   |
                    v   |   v
       +-------------+  |  +-------------+
       | Interface 1 |  |  | Interface 2 |
       +-------------+  |  +-------------+
                     |  |   |
                     v  v   v
                  +------------+
                  | Phys. Link |
                  +------------+

        Figure 5: Fragment of the Assurance Graph from Engineer B

  The Interface subservices and the Physical Link subservice are common
  to both fragments above.  Each of these subservices appear only once
  in the graph merging the two fragments.  Dependencies from both
  fragments are included in the merged graph, resulting in a circular
  dependency:

        +------------+      +------------+
        | IS-IS Link |      | OSPF Link  |---+
        +------------+      +------------+   |
              |               |     |        |
              |     +-------- +     |        |
              v     v               |        |
        +------------+              |        |
        | Phys. Link |<-------+     |        |
        +------------+        |     |        |
          |  ^     |          |     |        |
          |  |     +-------+  |     |        |
          v  |             v  |     v        |
        +-------------+  +-------------+     |
        | Interface 1 |  | Interface 2 |     |
        +-------------+  +-------------+     |
              ^                              |
              |                              |
              +------------------------------+

             Figure 6: Merging Graphs from Engineers A and B

  The solution presented above would result in a graph looking as
  follows, where a new "synthetic" node is included.  Using that
  transformation, all dependencies are indirectly satisfied for the
  nodes outside the circular dependency, in the sense that both IS-IS
  and OSPF links have indirect dependencies to the two interfaces and
  the link.  However, the dependencies between the link and the
  interfaces are lost since they were causing the circular dependency.

              +------------+      +------------+
              | IS-IS Link |      | OSPF Link  |
              +------------+      +------------+
                         |          |
                         v          v
                        +------------+
                        |  synthetic |
                        +------------+
                              |
                  +-----------+-------------+
                  |           |             |
                  v           v             v
        +-------------+ +------------+ +-------------+
        | Interface 1 | | Phys. Link | | Interface 2 |
        +-------------+ +------------+ +-------------+

      Figure 7: Removing Circular Dependencies after Merging Graphs
                          from Engineers A and B

3.2.  Intent and Assurance Graph

  The SAIN orchestrator analyzes the configuration of a service
  instance to do the following:

  *  Try to capture the intent of the service instance, i.e., What is
     the service instance trying to achieve?  At a minimum, this
     requires the SAIN orchestrator to know the YANG modules that are
     being configured on the devices to enable the service.  Note that,
     if the service model or the network model is known to the SAIN
     orchestrator, the latter can exploit it.  In that case, the intent
     could be directly extracted and include more details, such as the
     notion of sites for a VPN, which is out of scope of the device
     configuration.

  *  Decompose the service instance into subservices representing the
     network features on which the service instance relies.

  The SAIN orchestrator must be able to analyze the configuration
  pushed to various devices of a service instance and produce the
  assurance graph for that service instance.

  To schematize what a SAIN orchestrator does, assume that a service
  instance touches two devices and configures a virtual tunnel
  interface on each device.  Then:

  *  Capturing the intent would start by detecting that the service
     instance is actually a tunnel between the two devices and stating
     that this tunnel must be operational.  This solution is minimally
     invasive, as it does not require modifying nor knowing the service
     model.  If the service model or network model is known by the SAIN
     orchestrator, it can be used to further capture the intent and
     include more information, such as Service-Level Objectives (e.g.,
     the latency and bandwidth requirements for the tunnel) if present
     in the service model.

  *  Decomposing the service instance into subservices would result in
     the assurance graph depicted in Figure 2, for instance.

  The assurance graph, or more precisely the subservices and
  dependencies that a SAIN orchestrator can instantiate, should be
  curated.  The organization of such a process (i.e., ensure that
  existing subservices are reused as much as possible and avoid
  circular dependencies) is out-of-scope for this document.

  To be applied, SAIN requires a mechanism mapping a service instance
  to the configuration actually required on the devices for that
  service instance to run.  While Figure 1 makes a distinction between
  the SAIN orchestrator and a different component providing the service
  instance configuration, in practice those two components are most
  likely combined.  The internals of the orchestrator are out of scope
  of this document.

3.3.  Subservices

  A subservice corresponds to a subpart or a feature of the network
  system that is needed for a service instance to function properly.
  In the context of SAIN, a subservice is associated to its assurance,
  which is the method for assuring that a subservice behaves correctly.

  Subservices, just as with services, have high-level parameters that
  specify the instance to be assured.  The needed parameters depend on
  the subservice type.  For example, assuring a device requires a
  specific deviceId as a parameter and assuring an interface requires a
  specific combination of deviceId and interfaceId.

  When designing a new type of subservice, one should carefully define
  what is the assured object or functionality.  Then, the parameters
  must be chosen as a minimal set that completely identifies the object
  (see examples from the previous paragraph).  Parameters cannot change
  during the life cycle of a subservice.  For instance, an IP address
  is a good parameter when assuring a connectivity towards that address
  (i.e., a given device can reach a given IP address); however, it's
  not a good parameter to identify an interface, as the IP address
  assigned to that interface can be changed.

  A subservice is also characterized by a list of metrics to fetch and
  a list of operations to apply to these metrics in order to infer a
  health status.

3.4.  Building the Expression Graph from the Assurance Graph

  From the assurance graph, a so-called global computation graph is
  derived.  First, each subservice instance is transformed into a set
  of subservice expressions that take metrics and constants as input
  (i.e., sources of the DAG) and produce the status of the subservice
  based on some heuristics.  For instance, the health of an interface
  is 0 (minimal score) with the symptom "interface admin-down" if the
  interface is disabled in the configuration.  Then, for each service
  instance, the service expressions are constructed by combining the
  subservice expressions of its dependencies.  The way service
  expressions are combined depends on the dependency types (impacting
  or informational).  Finally, the global computation graph is built by
  combining the service expressions to get a global view of all
  subservices.  In other words, the global computation graph encodes
  all the operations needed to produce health statuses from the
  collected metrics.

  The two types of dependencies for combining subservices are:

  Informational Dependency:
     The type of dependency whose health score does not impact the
     health score of its parent subservice or service instance(s) in
     the assurance graph.  However, the symptoms should be taken into
     account in the parent service instance or subservice instance(s)
     for informational reasons.

  Impacting Dependency:
     The type of dependency whose health score impacts the health score
     of its parent subservice or service instance(s) in the assurance
     graph.  The symptoms are taken into account in the parent service
     instance or subservice instance(s) as the impacting reasons.

  The set of dependency types presented here is not exhaustive.  More
  specific dependency types can be defined by extending the YANG
  module.  For instance, a connectivity subservice depending on several
  path subservices is partially impacted if only one of these paths
  fails.  Adding these new dependency types requires defining the
  corresponding operation for combining statuses of subservices.

  Subservices shall not be dependent on the protocol used to retrieve
  the metrics.  To justify this, let's consider the interface
  operational status.  Depending on the device capabilities, this
  status can be collected by an industry-accepted YANG module (e.g.,
  IETF or Openconfig [OpenConfig]), by a vendor-specific YANG module,
  or even by a MIB module.  If the subservice was dependent on the
  mechanism to collect the operational status, then we would need
  multiple subservice definitions in order to support all different
  mechanisms.  This also implies that, while waiting for all the
  metrics to be available via standard YANG modules, SAIN agents might
  have to retrieve metric values via nonstandard YANG data models, MIB
  modules, the Command-Line Interface (CLI), etc., effectively
  implementing a normalization layer between data models and
  information models.

  In order to keep subservices independent of metric collection method
  (or, expressed differently, to support multiple combinations of
  platforms, OSes, and even vendors), the architecture introduces the
  concept of "metric engine".  The metric engine maps each device-
  independent metric used in the subservices to a list of device-
  specific metric implementations that precisely define how to fetch
  values for that metric.  The mapping is parameterized by the
  characteristics (i.e., model, OS version, etc.) of the device from
  which the metrics are fetched.  This metric engine is included in the
  SAIN agent.

3.5.  Open Interfaces with YANG Modules

  The interfaces between the architecture components are open thanks to
  the YANG modules specified in [RFC9418]; they specify objects for
  assuring network services based on their decomposition into so-called
  subservices, according to the SAIN architecture.

  These modules are intended for the following use cases:

  *  Assurance graph configuration:

     -  Subservices: Configure a set of subservices to assure by
        specifying their types and parameters.

     -  Dependencies: Configure the dependencies between the
        subservices, along with their types.

  *  Assurance telemetry: Export the health status of the subservices,
     along with the observed symptoms.

  Some examples of YANG instances can be found in Appendix A of
  [RFC9418].

3.6.  Handling Maintenance Windows

  Whenever network components are under maintenance, the operator wants
  to inhibit the emission of symptoms from those components.  A typical
  use case is device maintenance, during which the device is not
  supposed to be operational.  As such, symptoms related to the device
  health should be ignored.  Symptoms related to the device-specific
  subservices, such as the interfaces, might also be ignored because
  their state changes are probably the consequence of the maintenance.

  The ietf-service-assurance model described in [RFC9418] enables
  flagging subservices as under maintenance and, in that case, requires
  a string that identifies the person or process that requested the
  maintenance.  When a service or subservice is flagged as under
  maintenance, it must report a generic "Under Maintenance" symptom for
  propagation towards subservices that depend on this specific
  subservice.  Any other symptom from this service or by one of its
  impacting dependencies must not be reported.

  We illustrate this mechanism on three independent examples based on
  the assurance graph depicted in Figure 2:

  *  Device maintenance, for instance, upgrading the device OS.  The
     operator flags the subservice "Peer1" device as under maintenance.
     This inhibits the emission of symptoms, except "Under Maintenance"
     from "Peer1 Physical Interface", "Peer1 Tunnel Interface", and
     "Tunnel Service Instance".  All other subservices are unaffected.

  *  Interface maintenance, for instance, replacing a broken optic.
     The operator flags the subservice "Peer1 Physical Interface" as
     under maintenance.  This inhibits the emission of symptoms, except
     "Under Maintenance" from "Peer 1 Tunnel Interface" and "Tunnel
     Service Instance".  All other subservices are unaffected.

  *  Routing protocol maintenance, for instance, modifying parameters
     or redistribution.  The operator marks the subservice "IS-IS
     Routing Protocol" as under maintenance.  This inhibits the
     emission of symptoms, except "Under Maintenance" from "IP
     connectivity" and "Tunnel Service Instance".  All other
     subservices are unaffected.

  In each example above, the subservice under maintenance is completely
  impacting the service instance, putting it under maintenance as well.
  There are use cases where the subservice under maintenance only
  partially impacts the service instance.  For instance, consider a
  service instance supported by both a primary and backup path.  If a
  subservice impacting the primary path is under maintenance, the
  service instance might still be functional but degraded.  In that
  case, the status of the service instance might include "Primary path
  Under Maintenance", "No redundancy", as well as other symptoms from
  the backup path to explain the lower health score.  In general, the
  computation of the service instance status from the subservices is
  done in the SAIN collector whose implementation is out of scope for
  this document.

  The maintenance of a subservice might modify or hide modifications of
  the structure of the assurance graph.  Therefore, unflagging a
  subservice as under maintenance should trigger an update of the
  assurance graph.

3.7.  Flexible Functional Architecture

  The SAIN architecture is flexible in terms of components.  While the
  SAIN architecture in Figure 1 makes a distinction between two
  components, the service orchestrator and the SAIN orchestrator, in
  practice the two components are most likely combined.  Similarly, the
  SAIN agents are displayed in Figure 1 as being separate components.
  In practice, the SAIN agents could be either independent components
  or directly integrated in monitored entities.  A practical example is
  an agent in a router.

  The SAIN architecture is also flexible in terms of services and
  subservices.  In the defined architecture, the SAIN orchestrator is
  coupled to a service orchestrator, which defines the kinds of
  services that the architecture handles.  Most examples in this
  document deal with the notion of Network Service YANG Modules with
  well-known services, such as L2VPN or tunnels.  However, the concept
  of services is general enough to cross into different domains.  One
  of them is the domain of service management on network elements,
  which also require their own assurance.  Examples include a DHCP
  server on a Linux server, a data plane, an IPFIX export, etc.  The
  notion of "service" is generic in this architecture and depends on
  the service orchestrator and underlying network system, as
  illustrated by the following examples:

  *  If a main service orchestrator coordinates several lower-level
     controllers, a service for the controller can be a subservice from
     the point of view of the orchestrator.

  *  A DHCP server / data plane / IPFIX export can be considered
     subservices for a device.

  *  A routing instance can be considered a subservice for an L3VPN.

  *  A tunnel can be considered a subservice for an application in the
     cloud.

  *  A service function can be considered a subservice for a service
     function chain [RFC7665].

  The assurance graph is created to be flexible and open, regardless of
  the subservice types, locations, or domains.

  The SAIN architecture is also flexible in terms of distributed
  graphs.  As shown in Figure 1, the architecture comprises several
  agents.  Each agent is responsible for handling a subgraph of the
  assurance graph.  The collector is responsible for fetching the
  subgraphs from the different agents and gluing them together.  As an
  example, in the graph from Figure 2, the subservices relative to Peer
  1 might be handled by a different agent than the subservices relative
  to Peer 2, and the Connectivity and IS-IS subservices might be
  handled by yet another agent.  The agents will export their partial
  graph, and the collector will stitch them together as dependencies of
  the service instance.

  And finally, the SAIN architecture is flexible in terms of what it
  monitors.  Most, if not all, examples in this document refer to
  physical components, but this is not a constraint.  Indeed, the
  assurance of virtual components would follow the same principles, and
  an assurance graph composed of virtualized components (or a mix of
  virtualized and physical ones) is supported by this architecture.

3.8.  Time Window for Symptoms' History

  The health status reported via the YANG modules contains, for each
  subservice, the list of symptoms.  Symptoms have a start and end
  date, making it is possible to report symptoms that are no longer
  occurring.

  The SAIN agent might have to remove some symptoms for specific
  subservice symptoms because they are outdated and no longer relevant
  or simply because the SAIN agent needs to free up some space.
  Regardless of the reason, it's important for a SAIN collector
  connecting/reconnecting to a SAIN agent to understand the effect of
  this garbage collection.

  Therefore, the SAIN agent contains a YANG object specifying the date
  and time at which the symptoms' history starts for the subservice
  instances.  The subservice reports only symptoms that are occurring
  or that have been occurring after the history start date.

3.9.  New Assurance Graph Generation

  The assurance graph will change over time, because services and
  subservices come and go (changing the dependencies between
  subservices) or as a result of resolving maintenance issues.
  Therefore, an assurance graph version must be maintained, along with
  the date and time of its last generation.  The date and time of a
  particular subservice instance (again dependencies or under
  maintenance) might be kept.  From a client point of view, an
  assurance graph change is triggered by the value of the assurance-
  graph-version and assurance-graph-last-change YANG leaves.  At that
  point in time, the client (collector) follows the following process:

  *  Keep the previous assurance-graph-last-change value (let's call it
     time T).

  *  Run through all the subservice instances and process the
     subservice instances for which the last-change is newer than the
     time T.

  *  Keep the new assurance-graph-last-change as the new referenced
     date and time.

4.  IANA Considerations

  This document has no IANA actions.

5.  Security Considerations

  The SAIN architecture helps operators to reduce the mean time to
  detect and the mean time to repair.  However, the SAIN agents must be
  secured; a compromised SAIN agent may be sending incorrect root
  causes or symptoms to the management systems.  Securing the agents
  falls back to ensuring the integrity and confidentiality of the
  assurance graph.  This can be partially achieved by correctly setting
  permissions of each node in the YANG data model, as described in
  Section 6 of [RFC9418].

  Except for the configuration of telemetry, the agents do not need
  "write access" to the devices they monitor.  This configuration is
  applied with a YANG module, whose protection is covered by Secure
  Shell (SSH) [RFC6242] for the Network Configuration Protocol
  (NETCONF) or TLS [RFC8446] for RESTCONF.  Devices should be
  configured so that agents have their own credentials with write
  access only for the YANG nodes configuring the telemetry.

  The data collected by SAIN could potentially be compromising to the
  network or provide more insight into how the network is designed.
  Considering the data that SAIN requires (including CLI access in some
  cases), one should weigh data access concerns with the impact that
  reduced visibility will have on being able to rapidly identify root
  causes.

  For building the assurance graph, the SAIN orchestrator needs to
  obtain the configuration from the service orchestrator.  The latter
  should restrict access of the SAIN orchestrator to information needed
  to build the assurance graph.

  If a closed loop system relies on this architecture, then the well-
  known issue of those systems also applies, i.e., a lying device or
  compromised agent could trigger partial reconfiguration of the
  service or network.  The SAIN architecture neither augments nor
  reduces this risk.  An extension of SAIN, which is out of scope for
  this document, could detect discrepancies between symptoms reported
  by different agents, and thus detect anomalies if an agent or a
  device is lying.

  If NTP service goes down, the devices clocks might lose their
  synchronization.  In that case, correlating information from
  different devices, such as detecting symptoms about a link or
  correlating symptoms from different devices, will give inaccurate
  results.

6.  References

6.1.  Normative References

  [RFC8309]  Wu, Q., Liu, W., and A. Farrel, "Service Models
             Explained", RFC 8309, DOI 10.17487/RFC8309, January 2018,
             <https://www.rfc-editor.org/info/rfc8309>.

  [RFC8969]  Wu, Q., Ed., Boucadair, M., Ed., Lopez, D., Xie, C., and
             L. Geng, "A Framework for Automating Service and Network
             Management with YANG", RFC 8969, DOI 10.17487/RFC8969,
             January 2021, <https://www.rfc-editor.org/info/rfc8969>.

  [RFC9418]  Claise, B., Quilbeuf, J., Lucente, P., Fasano, P., and T.
             Arumugam, "A YANG Data Model for Service Assurance",
             RFC 9418, DOI 10.17487/RFC9418, July 2023,
             <https://www.rfc-editor.org/info/rfc9418>.

6.2.  Informative References

  [OpenConfig]
             "OpenConfig", <https://openconfig.net>.

  [Piovesan2017]
             Piovesan, A. and E. Griffor, "7 - Reasoning About Safety
             and Security: The Logic of Assurance",
             DOI 10.1016/B978-0-12-803773-7.00007-3, 2017,
             <https://doi.org/10.1016/B978-0-12-803773-7.00007-3>.

  [RFC2865]  Rigney, C., Willens, S., Rubens, A., and W. Simpson,
             "Remote Authentication Dial In User Service (RADIUS)",
             RFC 2865, DOI 10.17487/RFC2865, June 2000,
             <https://www.rfc-editor.org/info/rfc2865>.

  [RFC5424]  Gerhards, R., "The Syslog Protocol", RFC 5424,
             DOI 10.17487/RFC5424, March 2009,
             <https://www.rfc-editor.org/info/rfc5424>.

  [RFC5905]  Mills, D., Martin, J., Ed., Burbank, J., and W. Kasch,
             "Network Time Protocol Version 4: Protocol and Algorithms
             Specification", RFC 5905, DOI 10.17487/RFC5905, June 2010,
             <https://www.rfc-editor.org/info/rfc5905>.

  [RFC6242]  Wasserman, M., "Using the NETCONF Protocol over Secure
             Shell (SSH)", RFC 6242, DOI 10.17487/RFC6242, June 2011,
             <https://www.rfc-editor.org/info/rfc6242>.

  [RFC7011]  Claise, B., Ed., Trammell, B., Ed., and P. Aitken,
             "Specification of the IP Flow Information Export (IPFIX)
             Protocol for the Exchange of Flow Information", STD 77,
             RFC 7011, DOI 10.17487/RFC7011, September 2013,
             <https://www.rfc-editor.org/info/rfc7011>.

  [RFC7149]  Boucadair, M. and C. Jacquenet, "Software-Defined
             Networking: A Perspective from within a Service Provider
             Environment", RFC 7149, DOI 10.17487/RFC7149, March 2014,
             <https://www.rfc-editor.org/info/rfc7149>.

  [RFC7665]  Halpern, J., Ed. and C. Pignataro, Ed., "Service Function
             Chaining (SFC) Architecture", RFC 7665,
             DOI 10.17487/RFC7665, October 2015,
             <https://www.rfc-editor.org/info/rfc7665>.

  [RFC7950]  Bjorklund, M., Ed., "The YANG 1.1 Data Modeling Language",
             RFC 7950, DOI 10.17487/RFC7950, August 2016,
             <https://www.rfc-editor.org/info/rfc7950>.

  [RFC8199]  Bogdanovic, D., Claise, B., and C. Moberg, "YANG Module
             Classification", RFC 8199, DOI 10.17487/RFC8199, July
             2017, <https://www.rfc-editor.org/info/rfc8199>.

  [RFC8446]  Rescorla, E., "The Transport Layer Security (TLS) Protocol
             Version 1.3", RFC 8446, DOI 10.17487/RFC8446, August 2018,
             <https://www.rfc-editor.org/info/rfc8446>.

  [RFC8466]  Wen, B., Fioccola, G., Ed., Xie, C., and L. Jalil, "A YANG
             Data Model for Layer 2 Virtual Private Network (L2VPN)
             Service Delivery", RFC 8466, DOI 10.17487/RFC8466, October
             2018, <https://www.rfc-editor.org/info/rfc8466>.

  [RFC8641]  Clemm, A. and E. Voit, "Subscription to YANG Notifications
             for Datastore Updates", RFC 8641, DOI 10.17487/RFC8641,
             September 2019, <https://www.rfc-editor.org/info/rfc8641>.

  [RFC8907]  Dahm, T., Ota, A., Medway Gash, D.C., Carrel, D., and L.
             Grant, "The Terminal Access Controller Access-Control
             System Plus (TACACS+) Protocol", RFC 8907,
             DOI 10.17487/RFC8907, September 2020,
             <https://www.rfc-editor.org/info/rfc8907>.

  [RFC9315]  Clemm, A., Ciavaglia, L., Granville, L. Z., and J.
             Tantsura, "Intent-Based Networking - Concepts and
             Definitions", RFC 9315, DOI 10.17487/RFC9315, October
             2022, <https://www.rfc-editor.org/info/rfc9315>.

  [RFC9375]  Wu, B., Ed., Wu, Q., Ed., Boucadair, M., Ed., Gonzalez de
             Dios, O., and B. Wen, "A YANG Data Model for Network and
             VPN Service Performance Monitoring", RFC 9375,
             DOI 10.17487/RFC9375, April 2023,
             <https://www.rfc-editor.org/info/rfc9375>.

Acknowledgements

  The authors would like to thank Stephane Litkowski, Charles Eckel,
  Rob Wilton, Vladimir Vassiliev, Gustavo Alburquerque, Stefan Vallin,
  Éric Vyncke, Mohamed Boucadair, Dhruv Dhody, Michael Richardson, and
  Rob Wilton for their reviews and feedback.

Contributors

  *  Youssef El Fathi

  *  Éric Vyncke

Authors' Addresses

  Benoit Claise
  Huawei
  Email: [email protected]


  Jean Quilbeuf
  Huawei
  Email: [email protected]


  Diego R. Lopez
  Telefonica I+D
  Don Ramon de la Cruz, 82
  28006 Madrid
  Spain
  Email: [email protected]


  Dan Voyer
  Bell Canada
  Canada
  Email: [email protected]


  Thangavelu Arumugam
  Consultant
  Milpitas, California
  United States of America
  Email: [email protected]