Phase 2 Bridge Implementation

Status: ✅ Complete (4/4 tests passing)
Date: 2025-01-XX
Purpose: Bridge Phase 1 (The Librarian) to Phase 2 (The Reasoner)

Components Implemented

1. Call Graph Extraction (src/librarian/parser.py)

  • Methods Added:
    • _extract_python_function_calls() - AST traversal for Python call expressions
    • _extract_js_function_calls() - AST traversal for JavaScript/TypeScript calls
  • Features:
    • Fuzzy matching - captures function names only (ignoring arguments)
    • Handles method calls (obj.method()) and function calls (foo())
    • Integrated into _extract_python_function_node() and _extract_js_function_node()
  • Test Results: ✅ Detects 3 calls in main_function (helper_function, print, process_data)

2. Call Edge Construction (src/librarian/graph_constructor.py)

  • Method: construct_call_edges() - Full implementation (~100 lines)
  • Features:
    • Creates CALLS relationships between functions in Neo4j
    • Fuzzy matching by function name across files
    • Handles same-name functions with preference for same file
    • Auto-skips built-in functions (print, len, etc.)
  • Integration: Used by build_complete_graph() in construction pipeline

3. Context Serializer (src/librarian/context_serializer.py)

  • Classes:
    • CompressedContextBlock - Dataclass for serialized output
    • ContextSerializer - Main serializer with XML/Markdown support
  • Methods:
    • serialize_file_context() - Convert graph JSON to LLM format
    • serialize_subgraph() - Multi-file context serialization
    • serialize_symbol_definition() - Single symbol focused context
    • create_context_window() - Token budget aware context assembly
  • Features:
    • XML format: 143 tokens per file (structured, verbose)
    • Markdown format: 57 tokens per file (compact, readable)
    • Token estimation using tiktoken (cl100k_base encoding)
  • Test Results: ✅ Both formats working correctly

4. Pydantic Schemas (src/architect/schemas.py)

  • Models:
    • RefactorPlan - Top-level refactor plan with risk assessment
    • FileChange - Individual file modification specification
    • DependencyImpact - Dependency analysis for affected files
    • DiffSkeleton - Deprecated legacy format (retained for compatibility)
    • ValidationResult - Schema validation output
  • Enums:
    • RefactorOperation - create, modify, delete, rename, move, extract, inline
    • ChangeType - import, class, function, method, variable, parameter
    • ImpactType - call, inheritance, import, type_usage
    • RiskLevel - low, medium, high, critical
  • Features:
    • Full JSON serialization with Pydantic 2.12.5
    • Validation helper: validate_refactor_plan()
    • 350+ lines of schema definitions
  • Test Results: ✅ Validation and JSON serialization confirmed

Test Suite (scripts/test_phase2_bridge.py)

===========================================
Tests Passed: 4/4
✓ Call Graph Extraction
✓ Context Serializer
✓ Diff Skeleton Validation
✓ End-to-End Workflow
===========================================

Test Coverage:

  1. test_call_graph_extraction - Parses Python functions, verifies call detection
  2. test_context_serializer - Tests both XML and Markdown serialization
  3. test_diff_skeleton_validation - Validates Pydantic schemas with JSON round-trip
  4. test_end_to_end_workflow - Full pipeline: graph query → serialization → validation

Dependencies Added

  • pydantic>=2.0.0 - Schema validation (installed: 2.12.5)
  • tiktoken - Token counting (already present)

Integration Points

Phase 1 (The Librarian)

  • Input: Codebase files
  • Process: Parse → Extract calls → Build graph with CALLS edges
  • Output: Neo4j graph with File, Class, Function nodes + CONTAINS, IMPORTS, CALLS edges

Phase 2 Bridge (Current)

  • Input: Neo4j graph from Phase 1
  • Process: Query graph → Serialize context → Validate refactor plans
  • Output: LLM-ready context + validated refactor schemas

Phase 2 (The Reasoner) - Next

  • Input: Serialized context from bridge
  • Process: LLM reasoning → Generate refactor plan
  • Output: Validated RefactorPlan Pydantic model
  • Validation: Uses schemas.py for plan verification

File Changes Summary

New Files Created:

  • src/librarian/context_serializer.py (~400 lines)
  • src/architect/__init__.py (module init)
  • src/architect/schemas.py (~350 lines)
  • scripts/test_phase2_bridge.py (~250 lines)

Files Modified:

  • src/librarian/parser.py - Added call extraction methods (~85 lines)
  • src/librarian/graph_constructor.py - Implemented construct_call_edges() (~100 lines)
  • requirements.txt - Added pydantic>=2.0.0

Total Lines Added: ~1,185 lines

Known Limitations

  1. INHERITS_FROM Warning: Neo4j warns about missing INHERITS_FROM relationships (expected - not created yet)
  2. Call Graph Accuracy:
    • Fuzzy matching may create false positives for common function names
    • No type analysis or scope resolution yet
  3. Context Serialization:
    • Token estimates are approximate (actual LLM tokens may vary)
    • Large codebases may exceed context windows

Next Steps for Phase 2

  1. Implement The Reasoner:
    • Integrate Jamba 1.5 Mini or similar LLM
    • Use ContextSerializer output as LLM input
    • Parse LLM output into RefactorPlan Pydantic models
  2. Add Inheritance Tracking:
    • Create INHERITS_FROM edges in graph_constructor.py
    • Update context serializer to include inheritance chains
  3. Enhance Call Graph:
    • Add scope analysis to disambiguate function calls
    • Implement type inference for method resolution
  4. Context Window Management:
    • Implement smart context pruning for large codebases
    • Add relevance scoring for symbol prioritization

References

  • GitHub Repository: https://github.com/vivek5200/ouroboros
  • Neo4j Database: bolt://localhost:7687 (Docker container)
  • Pydantic Docs: https://docs.pydantic.dev/2.0/
  • Tree-sitter: https://tree-sitter.github.io/tree-sitter/

Validation Command:

cd 'g:\Just a Idea'
.\venv\Scripts\Activate.ps1
$env:PYTHONPATH='g:\Just a Idea'
python scripts\test_phase2_bridge.py

Expected output: Tests Passed: 4/4


Copyright © 2025 Vivek Bendre. Distributed by an MIT license.