AI safety evaluation framework testing LLM epistemic robustness under adversarial self-history manipulation
-
Updated
Dec 18, 2025 - Python
AI safety evaluation framework testing LLM epistemic robustness under adversarial self-history manipulation
This project explores alignment through **presence, bond, and continuity** rather than reward signals. No RLHF. No preference modeling. Just relational coherence.
A reference point for phenomena that have been reported to occur inside AI systems but have no direct mapping into natural language.
Hoshimiya Script / StarPolaris OS — internal multi-layer AI architecture for LLMs. Self-contained behavioral OS (Type-G Trinity).
Mechanistic interpretability experiments detecting "Evaluation Awareness" in LLMs - identifying if models internally represent being monitored
8-layer framework for AI alignment with systemic awareness (Φ, Ω, T)
Add a description, image, and links to the alignment-research topic page so that developers can more easily learn about it.
To associate your repository with the alignment-research topic, visit your repo's landing page and select "manage topics."