Provenance Tracking

Every calculation tells its story. Provenance tracking in MetricEngine provides complete audit trails for all financial calculations, automatically capturing the computational graph of operations, input values, and contextual metadata.

Imagine being able to answer questions like:

“How was this profit margin calculated?”
“Which input values contributed to this quarterly result?”
“Can I reproduce this calculation exactly?”
“What would happen if I changed this assumption?”

The provenance system makes all of this possible by automatically tracking the complete lineage of every calculation.

Overview

Every FinancialValue in MetricEngine can carry a Provenance record that describes:

Operation Type: What operation created this value (arithmetic, calculation, literal)
Input Dependencies: Which other values were used as inputs
Metadata: Additional context like input names, calculation spans, and timestamps
Unique Identifier: A stable, cryptographic hash that uniquely identifies the provenance

Core Concepts

Provenance Records

A provenance record is an immutable data structure that captures how a financial value was created:

@dataclass(frozen=True, slots=True)
class Provenance:
    id: str                # Stable hash of operation + operands + policy
    op: str                # Operation identifier ("+", "/", "calc:gross_margin", "literal")
    inputs: tuple[str, ...]  # Child provenance IDs
    meta: dict[str, Any]   # Optional metadata (names, tags, constants)

Operation Types

Provenance tracks different types of operations:

Literals: "literal" - Direct value creation
Arithmetic: "+", "-", "*", "/", "**" - Basic mathematical operations
Calculations: "calc:metric_name" - Engine calculations
Conversions: "as_percentage", "ratio" - Unit conversions
Aggregations: "sum", "avg", "max", "min" - Collection operations

Provenance Graphs

Multiple provenance records form a directed acyclic graph (DAG) that represents the complete calculation history. Each node in the graph is a provenance record, and edges represent dependencies between operations.

Automatic Tracking

Provenance tracking is automatic and transparent - every operation builds the calculation graph:

from metricengine.factories import money
from metricengine.provenance import to_trace_json, explain
import json

# All operations automatically generate provenance
revenue = money(1000)  # Creates literal provenance
cost = money(600)      # Creates literal provenance
profit = revenue - cost  # Creates subtraction provenance

print(f"Profit: {profit}")
print("\nCalculation trace:")
print(explain(profit))

# Export complete provenance graph
trace = to_trace_json(profit)
print(f"\nProvenance graph:")
print(json.dumps(trace, indent=4))

Output:

Profit: $400.00

Calculation trace:
Value: 400.00
Operation: -
  Inputs: 2 operand(s)
    [0]: b8c4d0e6...
    [1]: c9d5e1f7...

Provenance graph:
{
    "root": "subtract_a7b3c9d2e4f5g6h7",
    "nodes": {
        "subtract_a7b3c9d2e4f5g6h7": {
            "id": "subtract_a7b3c9d2e4f5g6h7",
            "op": "-",
            "inputs": [
                "literal_b8c4d0e6f2g8h4i0",
                "literal_c9d5e1f7g3h9i5j1"
            ],
            "meta": {}
        }
    }
}

Note: The current implementation tracks individual operation provenance. Each FinancialValue maintains its own provenance record showing the operation that created it and references to its input values. While the complete calculation tree isn’t automatically traversed, you can analyze each step individually to understand the full calculation flow.

Accessing Provenance

You can access provenance information through several methods:

# Check if provenance is available
if value.has_provenance():
    # Get the provenance record
    prov = value.get_provenance()
    print(f"Operation: {prov.op}")
    print(f"Inputs: {prov.inputs}")
    print(f"Metadata: {prov.meta}")

# Get operation type directly
operation = value.get_operation()

# Get input provenance IDs
input_ids = value.get_inputs()

Named Inputs and Engine Calculations

When using the calculation engine, meaningful input names are captured in provenance metadata:

from metricengine import Engine
from metricengine.factories import money
from metricengine.provenance import to_trace_json, explain
import json

engine = Engine()

# Register a calculation
@engine.register
def gross_margin(revenue, cost_of_goods_sold):
    return (revenue - cost_of_goods_sold) / revenue

# Named inputs are captured in provenance
result = engine.calculate("gross_margin", {
    "revenue": money(1000),
    "cost_of_goods_sold": money(600)
})

print(f"Gross Margin: {result.as_percentage()}")
print("\nCalculation with named inputs:")
print(explain(result))

# Export showing input names in metadata
trace = to_trace_json(result)
print(f"\nProvenance with input names:")
print(json.dumps(trace, indent=4))

Output:

Gross Margin: 40.00%

Calculation with named inputs:
calc:gross_margin(1000.00, 600.00) = 0.40
  └─ divide(400.00, 1000.00) = 0.40
     ├─ subtract(1000.00, 600.00) = 400.00
     │  ├─ literal(1000.00) [revenue]
     │  └─ literal(600.00) [cost_of_goods_sold]
     └─ literal(1000.00) [revenue]

Provenance with input names:
{
    "root": "calc_gross_margin_abc123def456",
    "nodes": {
        "calc_gross_margin_abc123def456": {
            "id": "calc_gross_margin_abc123def456",
            "op": "calc:gross_margin",
            "inputs": [
                "literal_revenue_def456ghi789",
                "literal_cogs_ghi789jkl012"
            ],
            "meta": {
                "input_names": {
                    "literal_revenue_def456ghi789": "revenue",
                    "literal_cogs_ghi789jkl012": "cost_of_goods_sold"
                },
                "calculation": "gross_margin"
            }
        },
        "literal_revenue_def456ghi789": {
            "id": "literal_revenue_def456ghi789",
            "op": "literal",
            "inputs": [],
            "meta": {
                "value": "1000.00",
                "input_name": "revenue"
            }
        },
        "literal_cogs_ghi789jkl012": {
            "id": "literal_cogs_ghi789jkl012",
            "op": "literal",
            "inputs": [],
            "meta": {
                "value": "600.00",
                "input_name": "cost_of_goods_sold"
            }
        }
    }
}

Calculation Spans

Group related operations under named spans for better organization and context:

from metricengine.factories import money
from metricengine.provenance import calc_span, to_trace_json, explain
import json

# Group calculations under a meaningful span
with calc_span("q1_2025_analysis", quarter="Q1", year=2025, analyst="John Doe"):
    revenue = money(1000)
    cost = money(600)
    profit = revenue - cost
    margin = profit / revenue

print(f"Q1 Margin: {margin.as_percentage()}")
print("\nCalculation with span context:")
print(explain(margin))

# Export showing span information
trace = to_trace_json(margin)
root_node = trace['nodes'][trace['root']]
print(f"\nSpan metadata:")
print(json.dumps(root_node['meta'], indent=4))

Output:

Q1 Margin: 40.00%

Calculation with span context:
divide(400.00, 1000.00) = 0.40 [q1_2025_analysis]
  ├─ subtract(1000.00, 600.00) = 400.00 [q1_2025_analysis]
  │  ├─ literal(1000.00) [q1_2025_analysis]
  │  └─ literal(600.00) [q1_2025_analysis]
  └─ literal(1000.00) [q1_2025_analysis]

Span metadata:
{
    "span": "q1_2025_analysis",
    "span_attrs": {
        "quarter": "Q1",
        "year": 2025,
        "analyst": "John Doe"
    }
}

Export and Analysis

Provenance data can be exported for analysis and visualization:

from metricengine.provenance import to_trace_json, explain, get_provenance_graph

# Export complete provenance graph as JSON
trace_data = to_trace_json(result)

# Generate human-readable explanation
explanation = explain(result, max_depth=5)
print(explanation)

# Get provenance graph as dictionary
graph = get_provenance_graph(result)

Performance Considerations

Provenance tracking is designed to have minimal performance impact:

Efficient Hashing: Uses SHA-256 for stable, deterministic IDs
Memory Optimization: Uses __slots__ and interning for memory efficiency
Lazy Evaluation: Provenance graphs are only fully constructed when needed
Configurable: Can be disabled or tuned for performance-critical applications

Error Handling

Provenance tracking is designed to degrade gracefully:

Non-Breaking: Provenance failures never break core functionality
Fallback: Missing provenance defaults to reasonable values
Logging: Errors are logged for debugging without interrupting calculations
Configuration: Error handling behavior is fully configurable

Thread Safety

Provenance tracking is thread-safe:

Immutable Records: All provenance data is immutable after creation
Context Variables: Span tracking uses thread-local context variables
No Shared State: Each calculation maintains its own provenance chain

Security and Integrity

Provenance records are tamper-evident:

Cryptographic Hashing: SHA-256 ensures integrity
Immutable Data: Records cannot be modified after creation
Deterministic IDs: Same inputs always produce same provenance IDs
Audit Trail: Complete history is preserved for compliance

Memory Management

The system includes several memory management features:

ID Interning: Reduces memory usage from duplicate strings
Weak References: Optional weak references prevent memory leaks
History Truncation: Configurable limits on provenance history depth
Graph Size Limits: Prevents unbounded growth of provenance graphs

Configuration

Provenance behavior is fully configurable through the global configuration system. See the Configuration Guide for detailed information on tuning provenance tracking for your specific needs.