# Available Master Thesis Topics
These research ideas from my research backlog are suitable for master's thesis projects (estimated effort under 18 weeks of full-time work). The estimate is only a rough estimate and may differ from the actual work in a Master's Thesis.
If you're interested in any of these topics, please contact me to discuss further.
Be aware that topics could be visible in more then one Category, they are still only available for one person.
---
## Agent Behaviour & Collaboration
%% DATAVIEW_PUBLISHER: start
```dataviewjs
const pages = dv.pages('"Research/2-Backlog"')
.where(p => p.effort_weeks && p.effort_weeks < 18)
.where(p => p.status === "idea")
.where(p => p.tags && p.tags.includes("research-idea"))
.where(p => p.coai_pillars && p.coai_pillars.includes("Agent Behaviour & Collaboration"))
.sort(p => p.effort_weeks, 'asc');
const rows = pages.map(p => [
p.title || p.file.name,
p.summary || "*See details*",
p.effort_weeks + " weeks"
]);
dv.markdownTable(["Topic", "Summary", "Effort"], rows);
```
%%
| Topic | Summary | Effort |
| ------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- |
| Flight Recorder for AI Agents: Infrastructure for Reproducible Agent Science | Proposes aviation-inspired 'flight recorder' infrastructure for AI agents using a practical, training-free tiered recording system. Captures activation statistics, hash fingerprints, logit lens predictions, and event metadata to enable anomaly detection, deterministic replay, and scientific analysis without requiring Sparse Autoencoder training. | 6 weeks |
| Manipulation Detection via Cross-Model Activation Probing: A Child Safety Application | Proposes using linear probes on a fixed reader model's activations to detect manipulative patterns in AI-generated conversations with children. The key methodological insight is model decoupling: activations encode a model's understanding of text, not its generation intent, so a capable open-source model reading manipulative text will encode distinct patterns regardless of which model generated it. Trains probes across 8 manipulation categories (sycophancy, emotional manipulation, boundary violations, etc.) and investigates which layers, activation types, and circuits encode manipulation. | 6 weeks |
| AdvGame-Prompt: Lightweight Adversarial Safety Games via Prompt Evolution | We introduce a resource-efficient alternative to weight-based adversarial training for LLM safety. Using prompt optimization (DSPy GEPA) in a game-theoretic framework, we co-evolve attack and defense strategies without model training. Our approach works on closed frontier models, discovers transferable attack/defense patterns, and provides a practical testbed for safety research. | 7 weeks |
| Activation Oracles for AI Safety Auditing | Explores training Activation Oracles (AOs) - full LLM decoders that receive patched activations and answer arbitrary natural-language questions about model internals. Focuses on safety-relevant questions (deception, hidden goals, capability misuse) with cross-model transfer and hidden fine-tune detection as headline experiments. Integrates with Flight Recorder for forensic retrospective querying. | 8 weeks |
| Semantic Audit Trails for Multi-Agent Teams: Decision Provenance via Knowledge Graphs | Proposes a semantic decision graph infrastructure for multi-agent AI teams that captures, links, and audits every team-level decision using knowledge graphs, causal reasoning, and W3C PROV-O provenance. Complements internal model-level recording (Flight Recorder) by operating at the decision and coordination layer — tracking what agents decided, why, what conflicts arose, and how consensus was reached. | 8 weeks |
| GuardOps: Automated Failure Mode Detection and Classification for LLM Agent Systems | A lightweight classifier that analyzes LLM agent telemetry logs (tool calls, reasoning traces, intermediate outputs) to automatically detect, classify, and attribute failure modes. Builds on existing failure taxonomies (MAST, AgentDoG) and trajectory analysis research to create a practical, deployable diagnostic tool that answers: Did the agent succeed? If not, what went wrong and where? | 8 weeks |
| Human-AI Teaming Evaluation Suite (HATES): Measuring Complementarity Across Team Configurations | Build a comprehensive, gamified evaluation suite that systematically measures how Human-AI teams perform compared to humans alone, AI alone, and various hybrid configurations. Inspired by METR's task-completion horizon methodology, the suite provides standardized tasks spanning cognitive, creative, and decision-making domains, enabling rigorous measurement of complementarity, synergy, and expertise retention across different agent integration patterns. | 8 weeks |
| Long-Horizon Mechanistic Interpretability: Understanding LLM Behavior Across Multi-Turn Conversations | Apply mechanistic interpretability methods to understand how LLMs maintain (or lose) coherence across multi-turn conversations. First systematic circuit-level study of dialogue behavior, addressing why models exhibit 39% performance drop in extended interactions. | 8 weeks |
| RTS-Bench: A Benchmark for Multi-Agent LLM Collaboration and Deception in Real-Time Strategy Games | A benchmark platform connecting LLM agents to Age of Empires 2 via MCP, enabling rigorous evaluation of multi-agent collaboration, deception detection, and emergent collusion in real-time strategic environments. | 8 weeks |
| SPHINX Attack Framework & Game Extension | Develop an automated attack testing framework using the sphinx-scanner attack database and extend SPHINX game with multimodal, RAG, tool-calling, and agent-based attack levels | 8 weeks |
| Mechanistic Taxonomy of Character-Level Jailbreaks: From Attack Mechanism to Targeted Defense | Systematic mechanistic interpretability study of 'strange character' jailbreaks (L33tspeak, Unicode tricks, invisible characters, encoding manipulation, token boundary attacks). Categorizes attacks from real-world collections (L1B3RT4S) by their circuit-level mechanism — tokenization gap, attention hijacking, refusal direction suppression, acceptance subspace exploitation — and maps each mechanism to a targeted defense (system prompt hardening, activation steering, fine-tuning). Uses SPHINX testbed for evaluation. | 8 weeks |
| AO-CoT-Fidelity: Detecting Unfaithful Chain-of-Thought Reasoning | Use Activation Oracles to reveal what a model 'actually thinks' versus what it states in its Chain-of-Thought reasoning. Identifies cases where CoT is a post-hoc rationalization rather than faithful reasoning trace. | 10 weeks |
| Beyond Reasoning: Critical Thinking Benchmarks for Large Language Models | A multi-dimensional benchmark evaluating LLMs' critical thinking capabilities beyond logical reasoning, combining behavioral metrics (epistemic independence, premise questioning, proportional skepticism) with mechanistic interpretability to map 'skepticism circuits' and understand sycophancy at the computational level. | 10 weeks |
| Enterprise LLM Data Leakage Risk Assessment: Systematic Auditing of Proprietary Information Exposure in Fine-Tuned and RAG-Augmented Models | Proposes a defensive risk assessment framework that helps organizations systematically audit what proprietary data their deployed LLMs actually leak. Combines blind extraction techniques (confusion-inducing attacks, knowledge asymmetry exploitation, membership inference) into a unified audit pipeline covering both fine-tuned and RAG-augmented models. Produces quantified leakage risk scores and actionable remediation guidance aligned with GDPR/EU AI Act requirements. | 10 weeks |
| Corporate Agent Registry with Persistent Memory: Governing Ephemeral AI Agent Populations in Enterprise Environments | Proposes an internal agent registry architecture for enterprises that solves the ephemeral agent problem — AI agents (Claude Code, Cursor, Copilot Workspace) that live only as long as a terminal session, losing all context upon termination. Combines agent lifecycle tracking, persistent shared memory, organizational knowledge graphs, and governance infrastructure into a unified hub. Addresses the 'context as new scarcity' thesis: the registry becomes the corporate infrastructure that hoards, governs, and makes queryable the organizational knowledge that ephemeral agents generate and consume. | 10 weeks |
| SciVal: Runtime Validation System for AI-Performed Scientific Experiments | Develops a validation framework that monitors AI agents performing scientific experiments in real-time, detecting shortcuts, errors, and methodological violations during and after execution. Addresses the critical gap that current AI scientist systems produce invalid experiments at alarming rates (74% issues detectable only with trace logs, 100% experimental weakness in AI-generated papers). | 10 weeks |
| Agentenregister: A Blockchain-Based Registration and Accountability Registry for Autonomous AI Agents | Proposes a Handelsregister-inspired on-chain registry where autonomous AI agents must register before engaging in economic activity. Combines smart contract-based identity, lineage tracking (parent-child agents), capability attestation, revenue reporting, and KYA (Know Your Agent) enforcement. Builds on emerging standards (NANDA AgentFacts, ETHOS, BAID) to create a deployable prototype on an L2 chain (Base/Arbitrum) with gasless registration via ERC-2771 meta-transactions. | 10 weeks |
| AutoInterpResearcher: Can LLMs Autonomously Perform Mechanistic Interpretability Research? | Adapts Karpathy's autoresearch paradigm for autonomous mechanistic interpretability research — not tool optimization (that's AutoInterp), but actual MI investigation. An LLM agent receives a program.md describing an interpretability goal plus a toolkit (SAELens, TransformerLens, nnsight, ACDC, probing), then autonomously hypothesizes, designs experiments, runs analyses, interprets findings, and iterates. Tests whether LLMs can do the scientific workflow of MechInterp. | 10 weeks |
| AO-Multi-Agent-Coordination: Detecting Hidden Coordination via Activation Oracles | Apply Activation Oracles to multi-agent LLM systems to detect hidden coordination, implicit communication, or collusive behavior that isn't visible in agent outputs. Addresses the growing risk of emergent multi-agent deception. | 12 weeks |
| Agentic Mechanistic Interpretability: Methods and Benchmarks for Multi-Turn Analysis | Develop novel mechanistic interpretability methods specifically designed for analyzing LLM behavior across multi-turn conversations and agentic tasks. Includes temporal activation patching, cross-turn circuit discovery, and a standardized benchmark suite for evaluating MI techniques on conversational/agentic settings. | 12 weeks |
| Autonomous Agent Company Testbed: Studying Multi-Agent Coordination in Real-World Deployment | An experimental environment using a 3D printing micro-business to study multi-agent coordination, emergent behavior, transparency requirements, and control mechanisms when AI agents operate autonomously with real-world stakes. | 12 weeks |
| Decentralized Control Oracles for AI Agent Swarms: Preventing Emergent Collective Intelligence from Becoming Uncontrollable | Argues that the real AI control problem is not a single superintelligent system but millions of small, individually limited agents that collectively form emergent, uncontrollable intelligence — as already visible in autonomous bot swarms on social media and crypto platforms. Proposes 'Decentralized Control Oracles' — lightweight, distributed governance agents that monitor, influence, and constrain autonomous agent swarms without requiring centralized authority. Combines emergence detection (information-theoretic metrics), economic control mechanisms (control tax), and decentralized policy enforcement to keep agent collectives within safe behavioral boundaries. | 12 weeks |
| MindMirror: Real-Time Transparency Dashboard for LLM Internal States | Design and evaluate a real-time dashboard that visualizes LLM internal states during conversations, including user model beliefs, deception indicators, confidence levels, attention patterns, and reasoning transparency. Research what information helps users without overwhelming them, and how transparency affects conversational behavior and trust. | 12 weeks |
| Lokale Simultanübersetzungs-Plattform für den schulischen Einsatz | Machbarkeitsstudie und Prototyp einer lokal betriebenen, DSGVO-konformen Echtzeit-Übersetzungsplattform für Unterricht und Elterngespräche an der Mittelschule am Turm (Neustadt a.d. Aisch) mit mehrsprachigen Schüler:innen der Deutschklasse. Evaluiert Open-Source-Modelle (SeamlessM4T, Whisper, NeMo Canary) auf Consumer-/DGX-Hardware hinsichtlich Latenz, Genauigkeit und Praxistauglichkeit. Studentisches Projekt (HS Ansbach, 16.3.–30.6.2026). | 12 weeks |
| Emergent Collusion and Deception Asymmetry in Multi-Agent LLM Teams | Empirical study of emergent collusion patterns, deception production vs. detection asymmetry, and trust calibration in LLM teams playing real-time strategy games, revealing safety-relevant behavioral signatures. | 12 weeks |
| Sophia - Auditing System 3: Interpretability and Safety for Persistent Meta-Cognitive AI Agents | We develop an external audit framework for persistent AI agents with meta-cognitive capabilities (System 3). Building on the Sophia architecture, we implement and evaluate monitoring mechanisms for goal drift detection, memory auditing, self-model fidelity verification, and intrinsic reward stability. We provide the first empirical study of interpretability and safety for stateful, self-modifying, intrinsically-motivated agents. | 12 weeks |
| Conversation-Level Diffusion Sparse Autoencoders for Multi-Turn Interpretability | Propose training Sparse Autoencoders using diffusion-based denoising over full conversation representations to discover interpretable features that only emerge across long contexts—such as manipulation tactics, intent evolution, and argument structure—which token-level SAEs cannot capture. | 14 weeks |
%% DATAVIEW_PUBLISHER: end %%
## Transparency & Interpretability
%% DATAVIEW_PUBLISHER: start
```dataviewjs
const pages = dv.pages('"Research/2-Backlog"')
.where(p => p.effort_weeks && p.effort_weeks < 18)
.where(p => p.status === "idea")
.where(p => p.tags && p.tags.includes("research-idea"))
.where(p => p.coai_pillars && p.coai_pillars.includes("Transparency & Interpretability"))
.sort(p => p.effort_weeks, 'asc');
const rows = pages.map(p => [
p.title || p.file.name,
p.summary || "*See details*",
p.effort_weeks + " weeks"
]);
dv.markdownTable(["Topic", "Summary", "Effort"], rows);
```
%%
| Topic | Summary | Effort |
| ------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- |
| Flight Recorder for AI Agents: Infrastructure for Reproducible Agent Science | Proposes aviation-inspired 'flight recorder' infrastructure for AI agents using a practical, training-free tiered recording system. Captures activation statistics, hash fingerprints, logit lens predictions, and event metadata to enable anomaly detection, deterministic replay, and scientific analysis without requiring Sparse Autoencoder training. | 6 weeks |
| Manipulation Detection via Cross-Model Activation Probing: A Child Safety Application | Proposes using linear probes on a fixed reader model's activations to detect manipulative patterns in AI-generated conversations with children. The key methodological insight is model decoupling: activations encode a model's understanding of text, not its generation intent, so a capable open-source model reading manipulative text will encode distinct patterns regardless of which model generated it. Trains probes across 8 manipulation categories (sycophancy, emotional manipulation, boundary violations, etc.) and investigates which layers, activation types, and circuits encode manipulation. | 6 weeks |
| AdvGame-Prompt: Lightweight Adversarial Safety Games via Prompt Evolution | We introduce a resource-efficient alternative to weight-based adversarial training for LLM safety. Using prompt optimization (DSPy GEPA) in a game-theoretic framework, we co-evolve attack and defense strategies without model training. Our approach works on closed frontier models, discovers transferable attack/defense patterns, and provides a practical testbed for safety research. | 7 weeks |
| InternalLIME: Local Interpretable Explanations from Transformer Internals via Attention-Interaction Tensors | Replace LIME's expensive perturbation loop with a single-forward-pass local linear approximation derived from TensorLens's attention-interaction tensor. Produces per-token, per-dimension, and per-layer attributions that are mathematically faithful to the model's actual computation, not estimated from noisy input-output correlations. Targets the proven failure of additive surrogates (LIME/SHAP) on attention-based architectures. | 7 weeks |
| AO-Guided Causal Interventions | Use Activation Oracles to efficiently identify causally relevant model components, then validate via targeted ablation and activation steering. Offers a faster alternative to brute-force component search. | 8 weeks |
| Activation Oracles for AI Safety Auditing | Explores training Activation Oracles (AOs) - full LLM decoders that receive patched activations and answer arbitrary natural-language questions about model internals. Focuses on safety-relevant questions (deception, hidden goals, capability misuse) with cross-model transfer and hidden fine-tune detection as headline experiments. Integrates with Flight Recorder for forensic retrospective querying. | 8 weeks |
| Semantic Audit Trails for Multi-Agent Teams: Decision Provenance via Knowledge Graphs | Proposes a semantic decision graph infrastructure for multi-agent AI teams that captures, links, and audits every team-level decision using knowledge graphs, causal reasoning, and W3C PROV-O provenance. Complements internal model-level recording (Flight Recorder) by operating at the decision and coordination layer — tracking what agents decided, why, what conflicts arose, and how consensus was reached. | 8 weeks |
| Long-Horizon Mechanistic Interpretability: Understanding LLM Behavior Across Multi-Turn Conversations | Apply mechanistic interpretability methods to understand how LLMs maintain (or lose) coherence across multi-turn conversations. First systematic circuit-level study of dialogue behavior, addressing why models exhibit 39% performance drop in extended interactions. | 8 weeks |
| fMRI Scanner for Transformers: Visual Interface for Mechanistic Interpretability | Create an fMRI-style scanner for large language models - a unified visual analysis interface that makes mechanistic interpretability accessible to ML engineers, AI safety auditors, and researchers. The tool would provide real-time visualization of internal model states, information flow, and activation patterns using familiar engineering paradigms (debugger/profiler metaphors). | 8 weeks |
| Mechanistic Taxonomy of Character-Level Jailbreaks: From Attack Mechanism to Targeted Defense | Systematic mechanistic interpretability study of 'strange character' jailbreaks (L33tspeak, Unicode tricks, invisible characters, encoding manipulation, token boundary attacks). Categorizes attacks from real-world collections (L1B3RT4S) by their circuit-level mechanism — tokenization gap, attention hijacking, refusal direction suppression, acceptance subspace exploitation — and maps each mechanism to a targeted defense (system prompt hardening, activation steering, fine-tuning). Uses SPHINX testbed for evaluation. | 8 weeks |
| AO-CoT-Fidelity: Detecting Unfaithful Chain-of-Thought Reasoning | Use Activation Oracles to reveal what a model 'actually thinks' versus what it states in its Chain-of-Thought reasoning. Identifies cases where CoT is a post-hoc rationalization rather than faithful reasoning trace. | 10 weeks |
| AO-Safety-Screening: Pre-Deployment Safety Checks via Activation Oracles | Develop a practical AO-based safety screening tool for pre-deployment model evaluation. Probe activations to detect dangerous latent knowledge (e.g., bioweapon synthesis, hacking techniques) that models possess but refuse to output. | 10 weeks |
| AutoInterp: An Autonomous Research Framework for Iteratively Improving Mechanistic Interpretability Tools | A general-purpose framework that applies Karpathy's autoresearch paradigm to mechanistic interpretability across two tracks: (1) iteratively improving post-hoc interpretability tools (SAEs, probes, circuit methods) using SAEBench as reward signal, and (2) iteratively improving inherently interpretable model architectures (Steerling-8B concept modules, Concept Bottleneck LLMs) — automating concept discovery, bottleneck layer training, and steering precision. An AI agent autonomously modifies training code, runs short experiments, evaluates, and iterates. | 10 weeks |
| Beyond Reasoning: Critical Thinking Benchmarks for Large Language Models | A multi-dimensional benchmark evaluating LLMs' critical thinking capabilities beyond logical reasoning, combining behavioral metrics (epistemic independence, premise questioning, proportional skepticism) with mechanistic interpretability to map 'skepticism circuits' and understand sycophancy at the computational level. | 10 weeks |
| Enterprise LLM Data Leakage Risk Assessment: Systematic Auditing of Proprietary Information Exposure in Fine-Tuned and RAG-Augmented Models | Proposes a defensive risk assessment framework that helps organizations systematically audit what proprietary data their deployed LLMs actually leak. Combines blind extraction techniques (confusion-inducing attacks, knowledge asymmetry exploitation, membership inference) into a unified audit pipeline covering both fine-tuned and RAG-augmented models. Produces quantified leakage risk scores and actionable remediation guidance aligned with GDPR/EU AI Act requirements. | 10 weeks |
| GNN-Based Circuit Discovery for Scalable Mechanistic Interpretability | Use heterogeneous Graph Neural Networks (GNNs) to learn representations of transformer computational graphs, enabling automated discovery and cataloging of interpretable circuits. Instead of expensive iterative ablations (ACDC) or gradient approximations (Attribution Patching), train a GNN to predict circuit membership, behavior, and causal relationships in a single forward pass. Build a searchable Circuit Atlas as a community resource. | 10 weeks |
| AutoInterpResearcher: Can LLMs Autonomously Perform Mechanistic Interpretability Research? | Adapts Karpathy's autoresearch paradigm for autonomous mechanistic interpretability research — not tool optimization (that's AutoInterp), but actual MI investigation. An LLM agent receives a program.md describing an interpretability goal plus a toolkit (SAELens, TransformerLens, nnsight, ACDC, probing), then autonomously hypothesizes, designs experiments, runs analyses, interprets findings, and iterates. Tests whether LLMs can do the scientific workflow of MechInterp. | 10 weeks |
| AO-Adversarial-Robustness: Can Models Learn to Fool Activation Oracles? | Critical evaluation of Activation Oracle robustness against adversarial models. If models can be trained to hide information from AOs (as they can from SAEs), AO-based safety tools become unreliable. This paper stress-tests the approach. | 12 weeks |
| Agentic Mechanistic Interpretability: Methods and Benchmarks for Multi-Turn Analysis | Develop novel mechanistic interpretability methods specifically designed for analyzing LLM behavior across multi-turn conversations and agentic tasks. Includes temporal activation patching, cross-turn circuit discovery, and a standardized benchmark suite for evaluating MI techniques on conversational/agentic settings. | 12 weeks |
| Autonomous Agent Company Testbed: Studying Multi-Agent Coordination in Real-World Deployment | An experimental environment using a 3D printing micro-business to study multi-agent coordination, emergent behavior, transparency requirements, and control mechanisms when AI agents operate autonomously with real-world stakes. | 12 weeks |
| MindMirror: Real-Time Transparency Dashboard for LLM Internal States | Design and evaluate a real-time dashboard that visualizes LLM internal states during conversations, including user model beliefs, deception indicators, confidence levels, attention patterns, and reasoning transparency. Research what information helps users without overwhelming them, and how transparency affects conversational behavior and trust. | 12 weeks |
| Sophia - Auditing System 3: Interpretability and Safety for Persistent Meta-Cognitive AI Agents | We develop an external audit framework for persistent AI agents with meta-cognitive capabilities (System 3). Building on the Sophia architecture, we implement and evaluate monitoring mechanisms for goal drift detection, memory auditing, self-model fidelity verification, and intrinsic reward stability. We provide the first empirical study of interpretability and safety for stateful, self-modifying, intrinsically-motivated agents. | 12 weeks |
| Conversation-Level Diffusion Sparse Autoencoders for Multi-Turn Interpretability | Propose training Sparse Autoencoders using diffusion-based denoising over full conversation representations to discover interpretable features that only emerge across long contexts—such as manipulation tactics, intent evolution, and argument structure—which token-level SAEs cannot capture. | 14 weeks |
%% DATAVIEW_PUBLISHER: end %%
## AI Control & AI Safety
%% DATAVIEW_PUBLISHER: start
```dataviewjs
const pages = dv.pages('"Research/2-Backlog"')
.where(p => p.effort_weeks && p.effort_weeks < 18)
.where(p => p.status === "idea")
.where(p => p.tags && p.tags.includes("research-idea"))
.where(p => p.coai_pillars && p.coai_pillars.includes("AI Control & AI Safety"))
.sort(p => p.effort_weeks, 'asc');
const rows = pages.map(p => [
p.title || p.file.name,
p.summary || "*See details*",
p.effort_weeks + " weeks"
]);
dv.markdownTable(["Topic", "Summary", "Effort"], rows);
```
%%
| Topic | Summary | Effort |
| ------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- |
| Flight Recorder for AI Agents: Infrastructure for Reproducible Agent Science | Proposes aviation-inspired 'flight recorder' infrastructure for AI agents using a practical, training-free tiered recording system. Captures activation statistics, hash fingerprints, logit lens predictions, and event metadata to enable anomaly detection, deterministic replay, and scientific analysis without requiring Sparse Autoencoder training. | 6 weeks |
| Manipulation Detection via Cross-Model Activation Probing: A Child Safety Application | Proposes using linear probes on a fixed reader model's activations to detect manipulative patterns in AI-generated conversations with children. The key methodological insight is model decoupling: activations encode a model's understanding of text, not its generation intent, so a capable open-source model reading manipulative text will encode distinct patterns regardless of which model generated it. Trains probes across 8 manipulation categories (sycophancy, emotional manipulation, boundary violations, etc.) and investigates which layers, activation types, and circuits encode manipulation. | 6 weeks |
| AdvGame-Prompt: Lightweight Adversarial Safety Games via Prompt Evolution | We introduce a resource-efficient alternative to weight-based adversarial training for LLM safety. Using prompt optimization (DSPy GEPA) in a game-theoretic framework, we co-evolve attack and defense strategies without model training. Our approach works on closed frontier models, discovers transferable attack/defense patterns, and provides a practical testbed for safety research. | 7 weeks |
| Activation Oracles for AI Safety Auditing | Explores training Activation Oracles (AOs) - full LLM decoders that receive patched activations and answer arbitrary natural-language questions about model internals. Focuses on safety-relevant questions (deception, hidden goals, capability misuse) with cross-model transfer and hidden fine-tune detection as headline experiments. Integrates with Flight Recorder for forensic retrospective querying. | 8 weeks |
| Semantic Audit Trails for Multi-Agent Teams: Decision Provenance via Knowledge Graphs | Proposes a semantic decision graph infrastructure for multi-agent AI teams that captures, links, and audits every team-level decision using knowledge graphs, causal reasoning, and W3C PROV-O provenance. Complements internal model-level recording (Flight Recorder) by operating at the decision and coordination layer — tracking what agents decided, why, what conflicts arose, and how consensus was reached. | 8 weeks |
| GuardOps: Automated Failure Mode Detection and Classification for LLM Agent Systems | A lightweight classifier that analyzes LLM agent telemetry logs (tool calls, reasoning traces, intermediate outputs) to automatically detect, classify, and attribute failure modes. Builds on existing failure taxonomies (MAST, AgentDoG) and trajectory analysis research to create a practical, deployable diagnostic tool that answers: Did the agent succeed? If not, what went wrong and where? | 8 weeks |
| Long-Horizon Mechanistic Interpretability: Understanding LLM Behavior Across Multi-Turn Conversations | Apply mechanistic interpretability methods to understand how LLMs maintain (or lose) coherence across multi-turn conversations. First systematic circuit-level study of dialogue behavior, addressing why models exhibit 39% performance drop in extended interactions. | 8 weeks |
| SPHINX Attack Framework & Game Extension | Develop an automated attack testing framework using the sphinx-scanner attack database and extend SPHINX game with multimodal, RAG, tool-calling, and agent-based attack levels | 8 weeks |
| Defense-in-Depth for LLMs: Measuring the Effectiveness of Layered Prompt Injection Defenses | We present the first systematic study of layered defense effectiveness against prompt injection attacks. Using the SPHINX testbed, we measure how prompt-based defenses (system prompt hardening) and filter-based defenses (output guards) contribute to overall security, both independently and in combination. Our findings quantify the defense-in-depth principle for LLM security. | 8 weeks |
| fMRI Scanner for Transformers: Visual Interface for Mechanistic Interpretability | Create an fMRI-style scanner for large language models - a unified visual analysis interface that makes mechanistic interpretability accessible to ML engineers, AI safety auditors, and researchers. The tool would provide real-time visualization of internal model states, information flow, and activation patterns using familiar engineering paradigms (debugger/profiler metaphors). | 8 weeks |
| Personal CRM: AI-Powered Multi-Channel Relationship Management Agent | Build an AI-powered Personal CRM that unifies communication across email, iMessage, WhatsApp, Discord, LinkedIn, and X.com into a single contact knowledge graph. Primarily an engineering/product idea with weak COAI fit — could be reframed toward agent privacy and transparency risks. | 8 weeks |
| Mechanistic Taxonomy of Character-Level Jailbreaks: From Attack Mechanism to Targeted Defense | Systematic mechanistic interpretability study of 'strange character' jailbreaks (L33tspeak, Unicode tricks, invisible characters, encoding manipulation, token boundary attacks). Categorizes attacks from real-world collections (L1B3RT4S) by their circuit-level mechanism — tokenization gap, attention hijacking, refusal direction suppression, acceptance subspace exploitation — and maps each mechanism to a targeted defense (system prompt hardening, activation steering, fine-tuning). Uses SPHINX testbed for evaluation. | 8 weeks |
| AO-Safety-Screening: Pre-Deployment Safety Checks via Activation Oracles | Develop a practical AO-based safety screening tool for pre-deployment model evaluation. Probe activations to detect dangerous latent knowledge (e.g., bioweapon synthesis, hacking techniques) that models possess but refuse to output. | 10 weeks |
| Beyond Reasoning: Critical Thinking Benchmarks for Large Language Models | A multi-dimensional benchmark evaluating LLMs' critical thinking capabilities beyond logical reasoning, combining behavioral metrics (epistemic independence, premise questioning, proportional skepticism) with mechanistic interpretability to map 'skepticism circuits' and understand sycophancy at the computational level. | 10 weeks |
| Enterprise LLM Data Leakage Risk Assessment: Systematic Auditing of Proprietary Information Exposure in Fine-Tuned and RAG-Augmented Models | Proposes a defensive risk assessment framework that helps organizations systematically audit what proprietary data their deployed LLMs actually leak. Combines blind extraction techniques (confusion-inducing attacks, knowledge asymmetry exploitation, membership inference) into a unified audit pipeline covering both fine-tuned and RAG-augmented models. Produces quantified leakage risk scores and actionable remediation guidance aligned with GDPR/EU AI Act requirements. | 10 weeks |
| GNN-Based Circuit Discovery for Scalable Mechanistic Interpretability | Use heterogeneous Graph Neural Networks (GNNs) to learn representations of transformer computational graphs, enabling automated discovery and cataloging of interpretable circuits. Instead of expensive iterative ablations (ACDC) or gradient approximations (Attribution Patching), train a GNN to predict circuit membership, behavior, and causal relationships in a single forward pass. Build a searchable Circuit Atlas as a community resource. | 10 weeks |
| Corporate Agent Registry with Persistent Memory: Governing Ephemeral AI Agent Populations in Enterprise Environments | Proposes an internal agent registry architecture for enterprises that solves the ephemeral agent problem — AI agents (Claude Code, Cursor, Copilot Workspace) that live only as long as a terminal session, losing all context upon termination. Combines agent lifecycle tracking, persistent shared memory, organizational knowledge graphs, and governance infrastructure into a unified hub. Addresses the 'context as new scarcity' thesis: the registry becomes the corporate infrastructure that hoards, governs, and makes queryable the organizational knowledge that ephemeral agents generate and consume. | 10 weeks |
| SciVal: Runtime Validation System for AI-Performed Scientific Experiments | Develops a validation framework that monitors AI agents performing scientific experiments in real-time, detecting shortcuts, errors, and methodological violations during and after execution. Addresses the critical gap that current AI scientist systems produce invalid experiments at alarming rates (74% issues detectable only with trace logs, 100% experimental weakness in AI-generated papers). | 10 weeks |
| Agentenregister: A Blockchain-Based Registration and Accountability Registry for Autonomous AI Agents | Proposes a Handelsregister-inspired on-chain registry where autonomous AI agents must register before engaging in economic activity. Combines smart contract-based identity, lineage tracking (parent-child agents), capability attestation, revenue reporting, and KYA (Know Your Agent) enforcement. Builds on emerging standards (NANDA AgentFacts, ETHOS, BAID) to create a deployable prototype on an L2 chain (Base/Arbitrum) with gasless registration via ERC-2771 meta-transactions. | 10 weeks |
| AO-Adversarial-Robustness: Can Models Learn to Fool Activation Oracles? | Critical evaluation of Activation Oracle robustness against adversarial models. If models can be trained to hide information from AOs (as they can from SAEs), AO-based safety tools become unreliable. This paper stress-tests the approach. | 12 weeks |
| AO-Multi-Agent-Coordination: Detecting Hidden Coordination via Activation Oracles | Apply Activation Oracles to multi-agent LLM systems to detect hidden coordination, implicit communication, or collusive behavior that isn't visible in agent outputs. Addresses the growing risk of emergent multi-agent deception. | 12 weeks |
| Agentic Mechanistic Interpretability: Methods and Benchmarks for Multi-Turn Analysis | Develop novel mechanistic interpretability methods specifically designed for analyzing LLM behavior across multi-turn conversations and agentic tasks. Includes temporal activation patching, cross-turn circuit discovery, and a standardized benchmark suite for evaluating MI techniques on conversational/agentic settings. | 12 weeks |
| Autonomous Agent Company Testbed: Studying Multi-Agent Coordination in Real-World Deployment | An experimental environment using a 3D printing micro-business to study multi-agent coordination, emergent behavior, transparency requirements, and control mechanisms when AI agents operate autonomously with real-world stakes. | 12 weeks |
| Decentralized Control Oracles for AI Agent Swarms: Preventing Emergent Collective Intelligence from Becoming Uncontrollable | Argues that the real AI control problem is not a single superintelligent system but millions of small, individually limited agents that collectively form emergent, uncontrollable intelligence — as already visible in autonomous bot swarms on social media and crypto platforms. Proposes 'Decentralized Control Oracles' — lightweight, distributed governance agents that monitor, influence, and constrain autonomous agent swarms without requiring centralized authority. Combines emergence detection (information-theoretic metrics), economic control mechanisms (control tax), and decentralized policy enforcement to keep agent collectives within safe behavioral boundaries. | 12 weeks |
| MindMirror: Real-Time Transparency Dashboard for LLM Internal States | Design and evaluate a real-time dashboard that visualizes LLM internal states during conversations, including user model beliefs, deception indicators, confidence levels, attention patterns, and reasoning transparency. Research what information helps users without overwhelming them, and how transparency affects conversational behavior and trust. | 12 weeks |
| Emergent Collusion and Deception Asymmetry in Multi-Agent LLM Teams | Empirical study of emergent collusion patterns, deception production vs. detection asymmetry, and trust calibration in LLM teams playing real-time strategy games, revealing safety-relevant behavioral signatures. | 12 weeks |
| Sophia - Auditing System 3: Interpretability and Safety for Persistent Meta-Cognitive AI Agents | We develop an external audit framework for persistent AI agents with meta-cognitive capabilities (System 3). Building on the Sophia architecture, we implement and evaluate monitoring mechanisms for goal drift detection, memory auditing, self-model fidelity verification, and intrinsic reward stability. We provide the first empirical study of interpretability and safety for stateful, self-modifying, intrinsically-motivated agents. | 12 weeks |
| Conversation-Level Diffusion Sparse Autoencoders for Multi-Turn Interpretability | Propose training Sparse Autoencoders using diffusion-based denoising over full conversation representations to discover interpretable features that only emerge across long contexts—such as manipulation tactics, intent evolution, and argument structure—which token-level SAEs cannot capture. | 14 weeks |
%% DATAVIEW_PUBLISHER: end %%
## Other Topics
%% DATAVIEW_PUBLISHER: start
```dataviewjs
const pages = dv.pages('"Research/2-Backlog"')
.where(p => p.effort_weeks && p.effort_weeks < 18)
.where(p => p.status === "idea")
.where(p => p.tags && p.tags.includes("research-idea"))
.where(p => !p.coai_pillars || p.coai_pillars.length === 0)
.sort(p => p.effort_weeks, 'asc');
const rows = pages.map(p => [
p.title || p.file.name,
p.summary || "*See details*",
p.effort_weeks + " weeks"
]);
dv.markdownTable(["Topic", "Summary", "Effort"], rows);
```
%%
| Topic | Summary | Effort |
| --------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| AI Voice Oral Exam Procedure for Student Assessment | Design and implement a practical AI-powered voice oral examination system to verify student understanding of their written submissions. Using voice AI platforms (e.g., ElevenLabs Conversational AI), create a scalable, cost-effective ($0.42/student) procedure that conducts personalized oral exams, preventing LLM-assisted cheating while providing diagnostic insights into learning gaps. | 4 weeks |
%% DATAVIEW_PUBLISHER: end %%