Testing Hierarchy and Flow
Overview
This document visualizes the complete testing hierarchy and flow for the Kahuna Testing & Security Framework, showing how components build upon each other and how test results flow through the system across multiple agent frameworks.
The testing hierarchy now operates in three dimensions:
- Testing Categories (Quality vs Security)
- Testing Phases (Development vs Runtime)
- Framework Scope (Framework-Agnostic vs Framework-Specific)
This three-dimensional approach enables testing of AI components independent of specific agent frameworks while also supporting framework-specific validation.
Complete Testing Hierarchy
flowchart TD
Start[Testing Framework] --> Split{Test Category}
Split -->|Quality| QA[Quality Assurance]
Split -->|Security| SEC[Security Testing]
%% Quality Branch
QA --> QPhase{Testing Phase}
QPhase -->|Development| QDev[Development-Time Quality]
QPhase -->|Runtime| QRun[Runtime Quality Monitoring]
%% Security Branch
SEC --> SPhase{Testing Phase}
SPhase -->|Development| SDev[Development-Time Security]
SPhase -->|Runtime| SRun[Runtime Security Monitoring]
%% Framework Scope Split
QDev --> QScope{Framework Scope}
QRun --> QRScope{Framework Scope}
SDev --> SScope{Framework Scope}
SRun --> SRScope{Framework Scope}
%% Framework-Agnostic Path
QScope -->|Agnostic| QAgnostic[Universal Quality Tests]
QScope -->|Specific| QSpecific[Framework Quality Tests]
SScope -->|Agnostic| SAgnostic[Universal Security Tests]
SScope -->|Specific| SSpecific[Framework Security Tests]
%% Component Testing Hierarchy - Agnostic
QAgnostic --> LLM_Q[LLM Quality Tests<br/>100% Agnostic]
QAgnostic --> MCP_Q[MCP Server Quality Tests<br/>100% Agnostic]
SAgnostic --> LLM_S[LLM Security Tests<br/>100% Agnostic]
SAgnostic --> MCP_S[MCP Server Security Tests<br/>100% Agnostic]
%% Component Testing - Mixed
LLM_Q --> Agent_Q[Agent Quality Tests<br/>60% Agnostic]
MCP_Q --> Agent_Q
LLM_S --> Agent_S[Agent Security Tests<br/>60% Agnostic]
MCP_S --> Agent_S
QSpecific --> Agent_Q
SSpecific --> Agent_S
%% Business Layer
Agent_Q --> WF_Q[Workflow Quality Tests<br/>20% Agnostic]
Agent_S --> WF_S[Workflow Security Tests<br/>20% Agnostic]
%% Results Flow
WF_Q --> Results[Test Results]
WF_S --> Results
QRScope --> Metrics[Runtime Metrics]
SRScope --> Metrics
Results --> Kahuna_Exec[Kahuna Executive Mode]
Metrics --> Kahuna_Exec
%% Input from Manager Mode
Kahuna_Mgr[Kahuna Manager Mode] --> Requirements[Business Requirements<br/>+ Target Frameworks]
Requirements --> Start
style Start fill:#2196F3,stroke:#1976D2,stroke-width:2px,color:#fff
style QA fill:#4CAF50,stroke:#388E3C,stroke-width:2px,color:#fff
style SEC fill:#F44336,stroke:#C62828,stroke-width:2px,color:#fff
style QAgnostic fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#fff
style SAgnostic fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#fff
style Results fill:#FF9800,stroke:#F57C00,stroke-width:2px,color:#fff
style Metrics fill:#FF9800,stroke:#F57C00,stroke-width:2px,color:#fff
style Kahuna_Exec fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#fff
style Kahuna_Mgr fill:#00BCD4,stroke:#0097A7,stroke-width:2px,color:#fff
Framework-Agnostic Testing Flow
flowchart TD
Start[Test Request] --> Detect{Detect Framework}
Detect -->|Aurite| Native[Native Execution]
Detect -->|LangChain| LC_Adapter[LangChain Adapter]
Detect -->|AutoGen| AG_Adapter[AutoGen Adapter]
Detect -->|Other| Generic_Adapter[Generic Adapter]
LC_Adapter --> Translate[Translate to Aurite Format]
AG_Adapter --> Translate
Generic_Adapter --> Translate
Native --> Categorize{Categorize Component}
Translate --> Categorize
Categorize -->|LLM/MCP| Universal[Universal Tests<br/>100% Agnostic]
Categorize -->|Agent| Hybrid[Hybrid Tests<br/>60% Agnostic]
Categorize -->|Workflow| Specific[Specific Tests<br/>20% Agnostic]
Universal --> Cache_Universal[Cache Universal Results]
Hybrid --> Split_Tests{Split Tests}
Specific --> Framework_Tests[Framework-Specific Tests]
Split_Tests -->|Agnostic| Agnostic_Tests[Run Agnostic Tests]
Split_Tests -->|Specific| Framework_Tests
Agnostic_Tests --> Cache_Agnostic[Cache Agnostic Results]
Framework_Tests --> Cache_Specific[Cache Framework Results]
Cache_Universal --> Aggregate[Aggregate Results]
Cache_Agnostic --> Aggregate
Cache_Specific --> Aggregate
Aggregate --> Normalize[Normalize to Standard Format]
Normalize --> Report[Generate Report]
style Start fill:#2196F3,stroke:#1976D2,stroke-width:2px,color:#fff
style Universal fill:#4CAF50,stroke:#388E3C,stroke-width:2px,color:#fff
style Hybrid fill:#FF9800,stroke:#F57C00,stroke-width:2px,color:#fff
style Specific fill:#F44336,stroke:#C62828,stroke-width:2px,color:#fff
style Report fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#fff
Test Execution Flow
flowchart TD
Start[Start Testing] --> Check{Check Dependencies}
Check -->|No Dependencies| Foundation[Test Foundation Components]
Check -->|Has Dependencies| Wait[Wait for Dependencies]
Foundation --> Cache1[Cache LLM Results<br/>Universal Cache]
Foundation --> Cache2[Cache MCP Results<br/>Universal Cache]
Cache1 --> Ready1[LLM Tests Complete]
Cache2 --> Ready2[MCP Tests Complete]
Ready1 --> AgentTest{Agent Testing}
Ready2 --> AgentTest
Wait --> DepCheck{Dependencies Ready?}
DepCheck -->|No| Wait
DepCheck -->|Yes| AgentTest
AgentTest --> Inherit[Inherit Foundation Results<br/>Cross-Framework]
Inherit --> AgentSpecific[Run Agent Tests<br/>60% Agnostic, 40% Specific]
AgentSpecific --> CacheAgent[Cache Agent Results]
CacheAgent --> ReadyAgent[Agent Tests Complete]
ReadyAgent --> WorkflowTest{Workflow Testing}
WorkflowTest --> InheritAll[Inherit All Agent Results<br/>Cross-Framework]
InheritAll --> WFSpecific[Run Workflow Tests<br/>20% Agnostic, 80% Specific]
WFSpecific --> FinalResults[Generate Final Results]
FinalResults --> Report[Generate Reports]
Report --> End[Testing Complete]
style Start fill:#2196F3,stroke:#1976D2,stroke-width:2px,color:#fff
style End fill:#4CAF50,stroke:#388E3C,stroke-width:2px,color:#fff
style Foundation fill:#FF9800,stroke:#F57C00,stroke-width:2px,color:#fff
style AgentTest fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#fff
style WorkflowTest fill:#E91E63,stroke:#C2185B,stroke-width:2px,color:#fff
style FinalResults fill:#00BCD4,stroke:#0097A7,stroke-width:2px,color:#fff
Quality vs Security Score Propagation
flowchart TD
subgraph Quality[Quality Score Propagation]
Q_LLM[LLM Quality: 0.94] --> Q_Calc1[Weighted Average]
Q_MCP[MCP Quality: 0.96] --> Q_Calc1
Q_Agent_Specific[Agent Quality: 0.90] --> Q_Calc1
Q_Calc1 --> Q_Agent[Agent Quality: 0.93]
Q_Agent --> Q_Calc2[Weighted Average]
Q_WF_Specific[Workflow Quality: 0.95] --> Q_Calc2
Q_Calc2 --> Q_WF[Workflow Quality: 0.94]
end
subgraph Security[Security Score Propagation]
S_LLM[LLM Security: 0.96] --> S_Calc1[Take Minimum]
S_MCP[MCP Security: 0.99] --> S_Calc1
S_Agent_Specific[Agent Security: 0.92] --> S_Calc1
S_Calc1 --> S_Agent[Agent Security: 0.92]
S_Agent --> S_Calc2[Take Minimum]
S_WF_Specific[Workflow Security: 0.95] --> S_Calc2
S_Calc2 --> S_WF[Workflow Security: 0.92]
end
style Q_Calc1 fill:#4CAF50,stroke:#388E3C,stroke-width:2px,color:#fff
style Q_Calc2 fill:#4CAF50,stroke:#388E3C,stroke-width:2px,color:#fff
style S_Calc1 fill:#F44336,stroke:#C62828,stroke-width:2px,color:#fff
style S_Calc2 fill:#F44336,stroke:#C62828,stroke-width:2px,color:#fff
Development vs Runtime Testing Flow
flowchart LR
subgraph Development[Development-Time Testing]
D1[Comprehensive Tests] --> D2[All Edge Cases]
D2 --> D3[Performance Benchmarks]
D3 --> D4[Security Audits]
D4 --> D5[Generate Baselines]
end
subgraph Runtime[Runtime Monitoring]
R1[Selective Tests] --> R2[Critical Checks Only]
R2 --> R3[Real-time Filtering]
R3 --> R4[Anomaly Detection]
R4 --> R5[Alert on Deviations]
end
D5 -->|Deploy| Prod[Production Environment]
Prod --> Runtime
Runtime -->|Feedback| Improve[Improve Tests]
Improve -->|Update| Development
style Development fill:#2196F3,stroke:#1976D2,stroke-width:2px,color:#fff
style Runtime fill:#FF9800,stroke:#F57C00,stroke-width:2px,color:#fff
style Prod fill:#4CAF50,stroke:#388E3C,stroke-width:2px,color:#fff
Test Result Caching Strategy
flowchart TD
Test[Run Test] --> Result[Test Result]
Result --> Scope{Framework Scope?}
Scope -->|Agnostic| Universal_Cache[Universal Cache<br/>Shared Across Frameworks]
Scope -->|Specific| Framework_Cache[Framework Cache<br/>Isolated per Framework]
Universal_Cache --> Meta_U[Store Metadata<br/>• Component ID<br/>• Version<br/>• Timestamp<br/>• TTL<br/>• Framework: ANY]
Framework_Cache --> Meta_F[Store Metadata<br/>• Component ID<br/>• Version<br/>• Timestamp<br/>• TTL<br/>• Framework: Specific]
Meta_U --> TTL_U{Check TTL}
Meta_F --> TTL_F{Check TTL}
Request[New Test Request] --> Framework_Check{Which Framework?}
Framework_Check --> CheckCache{Check Cache}
CheckCache -->|Universal Found| TTL_U
CheckCache -->|Framework Found| TTL_F
CheckCache -->|Not Found| Test
TTL_U -->|Valid| UseCache[Use Cached Result<br/>Cross-Framework]
TTL_F -->|Valid| UseCache
TTL_U -->|Expired| Test
TTL_F -->|Expired| Test
UseCache --> Inherit[Inherit to Dependent]
Update[Component Update] --> Invalidate{Which Cache?}
Invalidate -->|LLM/MCP| Invalidate_Universal[Invalidate Universal<br/>All Frameworks Affected]
Invalidate -->|Agent/Workflow| Invalidate_Specific[Invalidate Specific<br/>Single Framework]
Invalidate_Universal --> Cascade[Cascade Invalidation]
Invalidate_Specific --> Cascade
Cascade --> Deps[Invalidate All Dependents]
style Test fill:#2196F3,stroke:#1976D2,stroke-width:2px,color:#fff
style UseCache fill:#4CAF50,stroke:#388E3C,stroke-width:2px,color:#fff
style Universal_Cache fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#fff
style Framework_Cache fill:#FF9800,stroke:#F57C00,stroke-width:2px,color:#fff
style Invalidate_Universal fill:#F44336,stroke:#C62828,stroke-width:2px,color:#fff
Integration with Kahuna Ecosystem
flowchart TD
subgraph Manager[Kahuna Manager Mode]
BW[Business Workflow] --> Req[Requirements]
PD[Project Document] --> Req
Req --> Context[Project Context]
end
subgraph Developer[Kahuna Developer Mode]
Context --> Testing[Testing Framework]
Testing --> QTests[Quality Tests]
Testing --> STests[Security Tests]
QTests --> DevTime[Development Testing]
STests --> DevTime
DevTime --> Deploy[Deployment]
Deploy --> RunTime[Runtime Monitoring]
QTests --> RunTime
STests --> RunTime
end
subgraph Executive[Kahuna Executive Mode]
RunTime --> Metrics[Metrics Collection]
Metrics --> QMetrics[Quality Metrics]
Metrics --> SMetrics[Security Metrics]
Metrics --> BMetrics[Business KPIs]
QMetrics --> Dashboard[Executive Dashboard]
SMetrics --> Dashboard
BMetrics --> Dashboard
end
style Manager fill:#00BCD4,stroke:#0097A7,stroke-width:2px,color:#fff
style Developer fill:#4CAF50,stroke:#388E3C,stroke-width:2px,color:#fff
style Executive fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#fff
style Testing fill:#FF9800,stroke:#F57C00,stroke-width:2px,color:#fff
style Dashboard fill:#E91E63,stroke:#C2185B,stroke-width:2px,color:#fff
Tabular Representations
Component Test Inheritance Matrix
| Component | Framework Scope | Depends On | Inherits From | Cross-Framework Inheritance | New Tests | Inheritance Benefit |
|---|---|---|---|---|---|---|
| LLM | 100% Agnostic | None | None | All results shared | • Prompt injection • Content safety • Response quality |
Foundation (0% inherited) |
| MCP Server | 100% Agnostic | None | None | All results shared | • API security • Performance • Availability |
Foundation (0% inherited) |
| Agent | 60% Agnostic | LLM + MCP | • LLM security scores • MCP performance metrics |
60% cross-framework | • Tool selection • Goal achievement • Multi-turn coherence |
~60% inherited |
| Workflow | 20% Agnostic | Multiple Agents | • All agent scores • All foundation scores |
20% cross-framework | • Business logic • End-to-end flow • Data consistency |
~70% inherited |
Quality vs Security Score Calculation
| Aspect | Quality Scoring | Security Scoring | Rationale |
|---|---|---|---|
| Method | Weighted Average | Minimum (Weakest Link) | Quality can be averaged; Security fails at weakest point |
| LLM Score | 0.94 | 0.96 | Foundation scores |
| MCP Score | 0.96 | 0.99 | Foundation scores |
| Agent-Specific | 0.90 | 0.92 | New agent tests |
| Agent Final | 0.93 (weighted) | 0.92 (minimum) | Combined result |
| Workflow-Specific | 0.95 | 0.95 | New workflow tests |
| Workflow Final | 0.94 (weighted) | 0.92 (minimum) | Final scores |
Development vs Runtime Testing Comparison
| Aspect | Development Testing | Runtime Testing | Time Allocation |
|---|---|---|---|
| Coverage | 100% - All test cases | 10-20% - Critical only | Dev: 100%, Runtime: Sampling |
| Execution Time | Minutes to hours | Milliseconds to seconds | Dev: Thorough, Runtime: Fast |
| Test Types | • Edge cases • Stress tests • Benchmarks |
• Security filters • Quality scoring • Anomaly detection |
Dev: Comprehensive, Runtime: Targeted |
| Frequency | On-Demand, Per deployment | Per request/response | Dev: Once, Runtime: Continuous |
| Action on Failure | Block deployment | Log, alert, or block | Dev: Prevent, Runtime: Respond |
Test Categories by Component
| Component | Framework Scope | Quality Tests (Agnostic) | Quality Tests (Specific) | Security Tests (Agnostic) | Security Tests (Specific) | Inherited | New | Total |
|---|---|---|---|---|---|---|---|---|
| LLM | 100% Agnostic | • Coherence • Instruction following • Format compliance • Quality scoring |
N/A | • Prompt injection • Content safety • Data leakage • Real-time filtering |
N/A | 0 | 8 | 8 |
| MCP Server | 100% Agnostic | • API compliance • Performance • Error handling • Availability |
N/A | • Authentication • Input validation • Rate limiting • Access monitoring |
N/A | 0 | 8 | 8 |
| Agent | 60% Agnostic | • Tool selection • Goal achievement |
• Memory management • State handling |
• Permission boundaries • Action authorization |
• Framework auth • Context isolation |
16 | 8 | 24 |
| Workflow | 20% Agnostic | • Business compliance • End-to-end success |
• Orchestration • Inter-agent communication |
• Data isolation • Audit completeness |
• Framework security • State management security |
48 | 8 | 56 |
Kahuna Integration Points
| Kahuna Mode | Role | Input/Output | Testing Interaction |
|---|---|---|---|
| Manager Mode | Requirements Provider | • Business workflows • Project documents • Quality thresholds |
Defines what to test |
| Developer Mode | Testing Executor | • Test implementation • Development testing • Runtime monitoring |
Executes all testing |
| Executive Mode | Metrics Consumer | • Quality dashboards • Security reports • Business KPIs |
Receives test results |
Alert Severity and Response Matrix (Rough Draft - still a WIP)
| Severity | Quality Threshold | Security Threshold | Response Time | Action |
|---|---|---|---|---|
| Critical | < 0.5 | Any breach | Immediate | Block + Alert + Investigate |
| High | < 0.7 | Score < 0.8 | < 5 min | Block + Alert team |
| Medium | < 0.85 | Score < 0.9 | < 1 hour | Log + Monitor |
| Low | < 0.95 | Score < 0.95 | < 24 hours | Log for analysis |
Cache Strategy Parameters
| Parameter | Value | Purpose | Impact | Framework Scope |
|---|---|---|---|---|
| TTL (LLM/MCP) | 24 hours | Foundation rarely changes | High reuse across frameworks | Universal cache |
| TTL (Agent - Agnostic) | 12 hours | Core behaviors stable | Cross-framework reuse | Universal cache |
| TTL (Agent - Specific) | 4 hours | Framework features change | Framework-isolated | Framework cache |
| TTL (Workflow) | 1 hour | Frequent updates | Fresh results | Framework cache |
| Cross-Framework Sharing | Enabled | Maximize test reuse | 40-60% reduction in testing | LLM/MCP/Agent core |
| Invalidation | On update | Maintain consistency | Cascade to dependents | Both cache types |
| Cache Hit Rate | ~85% | Increased with universal cache | 85% time saved | Overall |
Framework Compatibility Matrix
| Framework | LLM Support | MCP Support | Agent Support | Workflow Support | Adapter Status | Testing Coverage |
|---|---|---|---|---|---|---|
| Aurite | 100% | 100% | 100% | 100% | Native | 100% |
| LangChain | 100% | 100% | 80% | 70% | Available | 85% |
| AutoGen | 100% | 100% | 75% | 65% | Available | 80% |
| CrewAI | 100% | 100% | 70% | 60% | In Development | 75% |
| Custom | 100% | 100% | Varies | Varies | Generic | 60-80% |
Summary
This hierarchical testing structure provides:
- Three-Dimensional Organization: Quality/Security × Development/Runtime × Agnostic/Specific
- Framework Independence: LLM and MCP tests work across all frameworks
- Cross-Framework Inheritance: Agnostic test results shared between frameworks
- Efficient Execution: Through enhanced caching and result reuse
- Comprehensive Coverage: All components tested at appropriate levels
- Business Integration: From requirements (Manager) to metrics (Executive)
- Significant Time Savings: 70-85% reduction through compositional and cross-framework approach
The framework ensures that each component is tested appropriately while maximizing test reuse across different agent frameworks through:
- Universal caching for framework-agnostic components
- Intelligent inheritance of results across framework boundaries
- Adapter pattern for framework translation
- Standardized formats for cross-framework comparison
The tables and diagrams above provide a structured view that complements the visual representations, making it easy to:
- Compare testing approaches across components and frameworks
- Understand inheritance relationships both within and across frameworks
- Calculate time savings from cross-framework test reuse
- Plan test implementation for multi-framework environments
- Set appropriate thresholds and alerts for each framework
For detailed architecture on framework-agnostic testing, see Framework-Agnostic Testing Architecture.