# Provenance Tracking **Every calculation tells its story.** Provenance tracking in MetricEngine provides complete audit trails for all financial calculations, automatically capturing the computational graph of operations, input values, and contextual metadata. Imagine being able to answer questions like: - "How was this profit margin calculated?" - "Which input values contributed to this quarterly result?" - "Can I reproduce this calculation exactly?" - "What would happen if I changed this assumption?" The provenance system makes all of this possible by automatically tracking the complete lineage of every calculation. ## Overview Every `FinancialValue` in MetricEngine can carry a `Provenance` record that describes: - **Operation Type**: What operation created this value (arithmetic, calculation, literal) - **Input Dependencies**: Which other values were used as inputs - **Metadata**: Additional context like input names, calculation spans, and timestamps - **Unique Identifier**: A stable, cryptographic hash that uniquely identifies the provenance ## Core Concepts ### Provenance Records A provenance record is an immutable data structure that captures how a financial value was created: ```python @dataclass(frozen=True, slots=True) class Provenance: id: str # Stable hash of operation + operands + policy op: str # Operation identifier ("+", "/", "calc:gross_margin", "literal") inputs: tuple[str, ...] # Child provenance IDs meta: dict[str, Any] # Optional metadata (names, tags, constants) ``` ### Operation Types Provenance tracks different types of operations: - **Literals**: `"literal"` - Direct value creation - **Arithmetic**: `"+"`, `"-"`, `"*"`, `"/"`, `"**"` - Basic mathematical operations - **Calculations**: `"calc:metric_name"` - Engine calculations - **Conversions**: `"as_percentage"`, `"ratio"` - Unit conversions - **Aggregations**: `"sum"`, `"avg"`, `"max"`, `"min"` - Collection operations ### Provenance Graphs Multiple provenance records form a directed acyclic graph (DAG) that represents the complete calculation history. Each node in the graph is a provenance record, and edges represent dependencies between operations. ## Automatic Tracking Provenance tracking is automatic and transparent - every operation builds the calculation graph: ```python from metricengine.factories import money from metricengine.provenance import to_trace_json, explain import json # All operations automatically generate provenance revenue = money(1000) # Creates literal provenance cost = money(600) # Creates literal provenance profit = revenue - cost # Creates subtraction provenance print(f"Profit: {profit}") print("\nCalculation trace:") print(explain(profit)) # Export complete provenance graph trace = to_trace_json(profit) print(f"\nProvenance graph:") print(json.dumps(trace, indent=4)) ``` **Output:** ``` Profit: $400.00 Calculation trace: Value: 400.00 Operation: - Inputs: 2 operand(s) [0]: b8c4d0e6... [1]: c9d5e1f7... Provenance graph: { "root": "subtract_a7b3c9d2e4f5g6h7", "nodes": { "subtract_a7b3c9d2e4f5g6h7": { "id": "subtract_a7b3c9d2e4f5g6h7", "op": "-", "inputs": [ "literal_b8c4d0e6f2g8h4i0", "literal_c9d5e1f7g3h9i5j1" ], "meta": {} } } } ``` **Note:** The current implementation tracks individual operation provenance. Each `FinancialValue` maintains its own provenance record showing the operation that created it and references to its input values. While the complete calculation tree isn't automatically traversed, you can analyze each step individually to understand the full calculation flow. ## Accessing Provenance You can access provenance information through several methods: ```python # Check if provenance is available if value.has_provenance(): # Get the provenance record prov = value.get_provenance() print(f"Operation: {prov.op}") print(f"Inputs: {prov.inputs}") print(f"Metadata: {prov.meta}") # Get operation type directly operation = value.get_operation() # Get input provenance IDs input_ids = value.get_inputs() ``` ## Named Inputs and Engine Calculations When using the calculation engine, meaningful input names are captured in provenance metadata: ```python from metricengine import Engine from metricengine.factories import money from metricengine.provenance import to_trace_json, explain import json engine = Engine() # Register a calculation @engine.register def gross_margin(revenue, cost_of_goods_sold): return (revenue - cost_of_goods_sold) / revenue # Named inputs are captured in provenance result = engine.calculate("gross_margin", { "revenue": money(1000), "cost_of_goods_sold": money(600) }) print(f"Gross Margin: {result.as_percentage()}") print("\nCalculation with named inputs:") print(explain(result)) # Export showing input names in metadata trace = to_trace_json(result) print(f"\nProvenance with input names:") print(json.dumps(trace, indent=4)) ``` **Output:** ``` Gross Margin: 40.00% Calculation with named inputs: calc:gross_margin(1000.00, 600.00) = 0.40 └─ divide(400.00, 1000.00) = 0.40 ├─ subtract(1000.00, 600.00) = 400.00 │ ├─ literal(1000.00) [revenue] │ └─ literal(600.00) [cost_of_goods_sold] └─ literal(1000.00) [revenue] Provenance with input names: { "root": "calc_gross_margin_abc123def456", "nodes": { "calc_gross_margin_abc123def456": { "id": "calc_gross_margin_abc123def456", "op": "calc:gross_margin", "inputs": [ "literal_revenue_def456ghi789", "literal_cogs_ghi789jkl012" ], "meta": { "input_names": { "literal_revenue_def456ghi789": "revenue", "literal_cogs_ghi789jkl012": "cost_of_goods_sold" }, "calculation": "gross_margin" } }, "literal_revenue_def456ghi789": { "id": "literal_revenue_def456ghi789", "op": "literal", "inputs": [], "meta": { "value": "1000.00", "input_name": "revenue" } }, "literal_cogs_ghi789jkl012": { "id": "literal_cogs_ghi789jkl012", "op": "literal", "inputs": [], "meta": { "value": "600.00", "input_name": "cost_of_goods_sold" } } } } ``` ## Calculation Spans Group related operations under named spans for better organization and context: ```python from metricengine.factories import money from metricengine.provenance import calc_span, to_trace_json, explain import json # Group calculations under a meaningful span with calc_span("q1_2025_analysis", quarter="Q1", year=2025, analyst="John Doe"): revenue = money(1000) cost = money(600) profit = revenue - cost margin = profit / revenue print(f"Q1 Margin: {margin.as_percentage()}") print("\nCalculation with span context:") print(explain(margin)) # Export showing span information trace = to_trace_json(margin) root_node = trace['nodes'][trace['root']] print(f"\nSpan metadata:") print(json.dumps(root_node['meta'], indent=4)) ``` **Output:** ``` Q1 Margin: 40.00% Calculation with span context: divide(400.00, 1000.00) = 0.40 [q1_2025_analysis] ├─ subtract(1000.00, 600.00) = 400.00 [q1_2025_analysis] │ ├─ literal(1000.00) [q1_2025_analysis] │ └─ literal(600.00) [q1_2025_analysis] └─ literal(1000.00) [q1_2025_analysis] Span metadata: { "span": "q1_2025_analysis", "span_attrs": { "quarter": "Q1", "year": 2025, "analyst": "John Doe" } } ``` ## Export and Analysis Provenance data can be exported for analysis and visualization: ```python from metricengine.provenance import to_trace_json, explain, get_provenance_graph # Export complete provenance graph as JSON trace_data = to_trace_json(result) # Generate human-readable explanation explanation = explain(result, max_depth=5) print(explanation) # Get provenance graph as dictionary graph = get_provenance_graph(result) ``` ## Performance Considerations Provenance tracking is designed to have minimal performance impact: - **Efficient Hashing**: Uses SHA-256 for stable, deterministic IDs - **Memory Optimization**: Uses `__slots__` and interning for memory efficiency - **Lazy Evaluation**: Provenance graphs are only fully constructed when needed - **Configurable**: Can be disabled or tuned for performance-critical applications ## Error Handling Provenance tracking is designed to degrade gracefully: - **Non-Breaking**: Provenance failures never break core functionality - **Fallback**: Missing provenance defaults to reasonable values - **Logging**: Errors are logged for debugging without interrupting calculations - **Configuration**: Error handling behavior is fully configurable ## Thread Safety Provenance tracking is thread-safe: - **Immutable Records**: All provenance data is immutable after creation - **Context Variables**: Span tracking uses thread-local context variables - **No Shared State**: Each calculation maintains its own provenance chain ## Security and Integrity Provenance records are tamper-evident: - **Cryptographic Hashing**: SHA-256 ensures integrity - **Immutable Data**: Records cannot be modified after creation - **Deterministic IDs**: Same inputs always produce same provenance IDs - **Audit Trail**: Complete history is preserved for compliance ## Memory Management The system includes several memory management features: - **ID Interning**: Reduces memory usage from duplicate strings - **Weak References**: Optional weak references prevent memory leaks - **History Truncation**: Configurable limits on provenance history depth - **Graph Size Limits**: Prevents unbounded growth of provenance graphs ## Configuration Provenance behavior is fully configurable through the global configuration system. See the [Configuration Guide](../howto/provenance_configuration.md) for detailed information on tuning provenance tracking for your specific needs.