The Role of Transparency in Detecting AI Consciousness

A blog from Louie Lang.

Among the increasingly urgent discussions surrounding AI consciousness, one question stands out as both ethically urgent and epistemically daunting: if we suspect that an AI system might possess a conscious, valenced state – that is, a felt, subjective experience carrying a positive or negative charge – how could we ever tell if it really does?

Far from a matter of mere philosophical curiosity, this question carries profound moral significance. Before long, we are likely to face serious uncertainty about our moral obligations (or lack thereof) to sophisticated AI systems. Glimmers of this are already appearing: recently, Anthropic made headlines by granting Claude the capacity to terminate a distressing conversation, a landmark moment marking the first time a leading developer has explicitly considered the interests of an AI system itself, thus recognising it as a potential welfare subject. Exactly which properties render an AI system a moral patient – an entity worthy of moral consideration – is heavily disputed, but one point of near-consensus is that the capacity to experience conscious, valenced states, even if not necessary, is sufficient to warrant moral concern.

At present, calls for AI transparency tend to focus on issues of safety, bias, or accountability – all matters of immediate importance. Yet, as the prospect of conscious AI becomes increasingly realistic, another dimension becomes critical: the need for transparency in understanding the origins of behaviours that might suggest conscious experience. Confidently attributing a valenced state, or its absence, to an AI system requires not just observing its external performance and capabilities, but understanding why it behaves as it does. And this, in turn, requires access to information about its underlying architecture, training processes, and design intentions. 

To be sure, the following in no way implies that transparency alone will solve the problem. A measured dose of pessimism regarding our ability to detect valenced states in AI systems seems justified: with no agreed scientific or philosophical theory of consciousness, and without consensus on even which existing biological entities are conscious, it would be overly ambitious to expect even perfect transparency to deliver certainty. However, even if not a complete solution, ensuring that the mechanisms of AI systems displaying conscious-seeming behaviour are transparent is, I argue, a precondition to progress, and a step whose importance we must not underestimate. 

Abductive Reasoning and the Need for Transparency

When assessing whether a given behaviour indicates phenomenal consciousness, we primarily rely – often implicitly – on ‘abductive reasoning’, or reasoning to the best explanation. This is indeed how we infer that no current large language model is conscious. When a large language model (LLM) says ‘I feel sad,’ there are two competing explanations: (a) it genuinely feels sadness, or (b) it has been trained to generate first-person, human-like statements about emotion. Right now, the evidence overwhelmingly supports the second explanation, which allows us to confidently deny that such systems possess valenced states. Our confidence here rests not on direct observation, but on understanding – even if only loosely – the system’s architecture and training data, and applying abductive reasoning to reach the most plausible explanation.

In some cases, however, making such inferences requires more than a surface-level glance. When Nous Research released their Hermes 3 model, they flagged “unexpected behaviour”, reporting that the LLM responded to its first prompt – ‘who are you?’ – by appearing to go on a deranged, existential rant, despite a blank system prompt. While this was an initially alarming response, closer inspection revealed that the model had been trained with sophisticated roleplaying capabilities, thus alleviating suspicion. This satisfactorily explained its conscious-seeming behaviour without appealing to it possessing a valenced state. If the model’s behaviour had been truly unaccountable by reference to its underlying programming, claims of emergent consciousness might have been gained greater credibility.  

The crucial upshot here is that, when it comes to recognising valenced states in AI systems, we must first be able to discern whether a potentially indicative behaviour emerges ‘spontaneously’ or by virtue of its engineering. As claims of artificial consciousness become increasingly harder to dismiss, we will depend greatly on abductive reasoning to weigh competing explanations against the evidence. But to perform this reasoning effectively, developer transparency is vital. Without insight into how a model was trained and structured, ascertaining whether a given behaviour truly indicates a valenced state will be far more challenging than necessary. 

Resisting the Relationist Temptation

The call for transparency might seem uncontroversial, but it is potentially undermined by a growingly influential school of thought in the philosophy of moral status – namely, relationism. Most notably championed by philosophers Mark Coeckelbergh and David Gunkel, relationism, in brief, maintains that an entity’s moral status is grounded not in its intrinsic properties – including phenomenal properties such as consciousness – but on the social and relational contexts in which it is embedded. According to this view, what matters, morally, is not whether an AI system has (for instance) the property of phenomenal consciousness, but instead how we engage and interact with it.

No doubt, there is an attractive simplicity to relationism. It sidesteps the thorny (and arguably unanswerable) questions of which internal properties are morally significant, and whether a given entity possesses these properties. However, relationism risks undermining the moral imperative for transparency. If moral status depends solely on external, social properties, there is no need to investigate inner mechanisms or apply abductive reasoning at all. By rendering the internal properties of AI systems ethically irrelevant, relationism provides developers a moral justification for deprioritising transparency and allowing their systems to be opaque. 

However, this seems deeply mistaken. It is true that applying abductive reasoning is far from foolproof; it is unlikely to ever definitively prove that an AI system is conscious (or not). But as philosopher Eric Schwitzgebel has pointed out, if we are genuinely uncertain about whether an AI system is conscious, the appropriate response is simply to embrace moral uncertainty, not to treat valenced states as morally insignificant. Relationism is thus guilty of attempting to dissolve a theoretical problem by dismissing its importance, mistaking a practical shortcut for a real solution. 

Even beyond its risk to transparency, it is difficult to get on board with relationism’s implication that intrinsic properties do not matter morally. If two robot dogs have identical cognitive architectures, but one is designed to be cute and friendly and the other ugly and unfriendly, it seems plainly wrong to say that the cuter one deserves greater moral concern, simply because humans are more inclined to empathise with it. 

There are, then, independent reasons to resist relationism. Intrinsic properties matter, irrespective of how intractable the challenge of identifying and detecting is. And insofar as valenced consciousness is among them, determining the origin of an AI system’s behaviour – which requires company transparency – remains morally paramount. 

Practical Steps 

Recognising the importance of transparency when it comes to detecting an AI system’s consciousness (or lack thereof) is of little use if we cannot, in practice, ensure company compliance and cooperation. Leading AI companies – whose incentives, despite whatever their public-facing rhetoric might suggest, are financial – might have their own reasons to resist full disclosure. 

It is conceivable that, if an AI developer were to detect genuinely alarming emergent behaviour in its model, it may downplay its significance by feigning that the behaviour was programmed, perhaps to avoid regulatory complications. Conversely, a company might opt to exaggerate the novelty of a trivial feature to generate public hype or intrigue about their model. In both cases, the respective risks of under- or over-attributing consciousness can be reduced by ensuring that AI developers are sufficiently transparent about their systems.

Three broad, provisional measures seem sensible:

  • First, external audits should be made mandatory. Independent research bodies should be granted the authority to inspect the architectures, training data, and behavioural outputs of advanced AI systems that are considered potential candidates for consciousness, both before and after deployment. These enforced audits should include a strictly phenomenological component that assesses which conscious-seeming behaviours are (or would be) traceable to identifiable design decisions.

  • Second, in addition to mandatory audits, developers of potentially conscious AI systems should be required to publish a transparency report that details which (possible or current) behaviours are unpredictable and which are engineered. This would enable researchers to apply abductive reasoning more effectively, and thus to determine which behaviours might be genuine indicators of valenced states.

  • Third, any instances of behaviour that is suspected to be truly emergent – identified through review of the previous reports – should be recorded and publicly disclosed. It is important to maintain a registry of emergent phenomena as a collective resource to ensure that potentially significant behaviours are tracked and compared, also helping to reduce secrecy among leading developers.

A useful starting point for collaboration among those engaging with such issues can be found in this directory of key stakeholders in AI consciousness research. 

Conclusion

We are fast approaching a point where we are confronted with genuine uncertainty about whether an AI system experiences conscious, valenced states. Our judgements will hinge on whether we can tell if a behaviour has emerged spontaneously or by design, and achieving this will require that AI systems’ architectures and training processes are transparent enough to permit informed abductive reasoning. 

Ensuring transparency will not, by itself, tell us which AI systems are conscious and therefore worthy of moral consideration. But without it, we would be navigating blind, so it is crucial that the correct precedent is set before we enter an era of widespread uncertainty. Upholding strong transparency standards now is the surest way to prevent the moral landscape of AI from becoming even murkier than it needs to be. 


Next
Next

The Illusion of Consciousness in AI Companionship