Agentic Pixel

Constructor Theory: Building software with physics in mind

A constructor-theoretic lens on program synthesis, test design, and why mutation score beats line coverage

There is a branch of theoretical physics, constructor theory, developed by David Deutsch and Chiara Marletto. It attempts to rephrase all of physics in terms of which transformations can happen and which cannot: factuals and counterfactuals.

It turns out this framing maps, with surprising precision, onto problems myself and other software engineers face every day. Writing correct programs, choosing test strategies, and measuring test quality.

This article presents three claims, each grounded in published computer science and physics literature. The mapping itself between the two fields is novel when applied to testing. No published work connects constructor theory to software engineering despite a sub-set of computer programs themselves being an example of programmable constructors.

Think of this as a long held perspective of mine on a familiar problem.

Background: Constructor Theory in 90 Seconds

Constructor theory, introduced by Deutsch in 2013, reframes physics around a single question: which tasks are possible and which are impossible, and why? A task is a specified transformation of a physical system (the substrate) from one state to another. A constructor is any entity that can cause a task to happen and remain able to cause it again, like a chemical catalyst as an abstracted to a concept.

The crucial insight is that the theory privileges impossibility statements over dynamical laws. In Deutsch’s formulation, the fundamental objects of the theory are statements about which transformations can and cannot be performed, and the constructor itself can be abstracted away from the explanation. The constructor is defined by the partition between the tasks it performs and those it refuses.

Deutsch and Marletto extended this framework to information in 2015, defining information media entirely through possible and impossible transformations on physical substrates.

Key references:

Claim 1: Program Synthesis Is Constructor Instantiation

Summary

Program synthesis is the search for a constructor C given a partition of the task space into T (possible) and I (impossible). The necessity of the I-set is not a design choice: it is a theorem (Gold 1967).

The CEGIS loop is the constructive realisation: propose from T, refute from I, converge on C. No published paper makes this explicit bridge to Deutsch-Marletto constructor theory. But the structural isomorphism between (T, I) → C in both domains is exact.

Full Explanation

Consider what a program synthesiser actually does. You give it a specification, some combination of examples the program should handle correctly and behaviours it must reject: and it searches for a program that satisfies both constraints.

In constructor-theoretic terms:

This is not a loose metaphor. The structure is identical to the most influential synthesis algorithms in the literature.

Gold’s impossibility theorem (1967): proved that no learner can identify a language from positive examples alone when the target class contains all finite languages plus at least one infinite language. (The term “superfinite” for this class was coined later by the learning theory community, not by Gold himself, but the mathematical content is the same.) The implication is stark: negative examples, the I-set, are not optional. They are epistemically necessary for learning.

Angluin’s L* algorithm (1987): learns regular languages using membership queries (Does string s belong to the language?) and equivalence queries (Is my current hypothesis correct?). Equivalence queries return counterexamples on failure: elements of the I-set. The algorithm cannot converge without them.

CEGIS (Counterexample-Guided Inductive Synthesis):

Introduced in Solar-Lezama’s 2008 PhD thesis, formalises synthesis as a loop:

This is structurally identical to constructor theory’s (possible tasks, impossible tasks) → constructor.

SyGuS (Syntax-Guided Synthesis):

Formalised by Alur and nine co-authors at FMCAD 2013 and generalises CEGIS:

The specification encodes both what must hold and what counterexamples refine.

Key references

Claim 2: Tests Are T/I Boundary Probes: And the Type of Test Determines Which Region It Covers

Summary

Each testing technique probes a different region of the T/I boundary.

This unified taxonomy doesnot appear in any published paper: it is a novel organisation of individually well-established observations.

Full Explanation

If a program is a constructor defined by its T/I partition, then a test is a probe of that partition’s boundary. Different testing techniques sample different regions of the boundary, with different biases and resolutions.

Unit tests are point evaluations at known elements of T

A unit test asserts that a specific input produces a specific output: it confirms that one particular task is possible. The survey by Zhu, Hall, and May (1997) in ACM Computing Surveys defines test adequacy criteria as measures of how thoroughly the specification space has been covered.

Unit tests, by this definition, are discrete point samples of the input domain. They confirm T-membership one point at a time.

Property-based tests are stochastic sampling of the T-space with incidental boundary contact

Claessen and Hughes introduced QuickCheck at ICFP 2000, which randomly generates inputs and checks that programmer-specified properties hold.

An important nuance: QuickCheck is biased toward the T-space (checking that properties hold), not toward the T/I boundary specifically. It only probes the boundary when a randomly generated input happens to land near it. The bias toward the boundary only strengthens when properties explicitly encode forbidden behaviours.

Mutation testing measures I-boundary coverage

This is the strongest and besT-supported mapping, detailed in Claim 3 below. A mutant is a deliberate, small perturbation of the program: it represents a program that should not pass the test suite. Killing a mutant means the test suite can distinguish the correct constructor from a nearby incorrect one.

This is precisely I-boundary resolution: the ability to detect when the program has drifted into the impossible region.

Fuzzing is adversarial boundary search biased toward I

Fuzzers (Takanen, DeMott, and Miller, 2008) generate inputs designed to trigger unexpected behaviour: crashes, hangs and memory violations. By design, they search for inputs that land in the I-space. The bias toward I is in the definition: a fuzzer that only found valid inputs would be useless.

Line and branch coverage are proxies for T-space density, not I-boundary resolution

Chekam, Papadakis, Le Traon, and Harman (2017) showed empirically at ICSE 2017 that statement and branch coverage are unreliable predictors of fault revelation Coverage tells you how much of the program’s structure was exercised by your tests, it measures exploration of the T-space, but it says nothing about whether your tests can detect deviations into I. Gopinath, Jensen, and Groce (2014) reached compatible conclusions at ICSE 2014, finding that coverage alone poorly predicts fault detection.

Key references

Claim 3: Mutation Score Beats Line Coverage for Fault Detection. Because It Measures the I-Boundary

Summary

Mutation score correlates more strongly with fault detection than any structural coverage metric. This is empirically established across at least five independent large-scale studies and one formal proof. The constructor-theoretic explanation, that mutation testing measures I-boundary resolution while line coverage measures T-space exploration, is a novel theoretical interpretation of this established empirical fact.

No published paper offers this specific explanation, but the empirical finding it explains is among the most robust results in software testing research.

Full Explanation

The foundational paper by DeMillo, Lipton, and Sayward (1978) introduced mutation testing and defined two hypotheses that underpin it.

The competent programmer hypothesis says that programs written by competent developers are close to correct: they differ from the correct version by small, simple faults.

The coupling effect says that test suites capable of detecting simple faults are also capable of detecting complex faults. Together these hypotheses mean that if your test suite can distinguish the correct program from every small perturbation (mutant) of it: it can probably detect real bugs too.

In constructor-theoretic terms, each mutant represents a program that has drifted slightly into the I-space: It performs at least one task it should not, or fails at least one task it should perform. A high mutation score means the test suite has fine-grained resolution along the T/I boundary: it can detect even small deviations from the correct constructor.

Frankl and Weyuker (1993) provided the theoretical backbone. Their paper in IEEE TSE 19(10), pp. 962–975, showed that mutation testing is provably superior to branch testing at fault detection under two probabilistic measures. An important precision: the paper does not prove simple subsumption (inclusion of test requirements) but rather proper coverage. A formally distinct and actually stronger concept: meaning guaranteed superior fault detection under specific probabilistic conditions.

Andrews, Briand, Labiche, and Namin (2006) conducted a large-scale empirical study published in IEEE TSE 32(8), pp. 608–624, finding a statistically significant relationship between mutation score and fault detection that was stronger than the relationship for branch coverage.

Jia and Harman (2011) published the authoritative survey of mutation testing in IEEE TSE 37(5), pp. 649–678, establishing it as the gold standard for test adequacy evaluation. This survey has accumulated well over 1,600 citations on Semantic Scholar alone.

Chekam, Papadakis, Le Traon, and Harman (2017) provided the strongest single piece of empirical evidence at ICSE 2017. Their study avoids the “clean program assumption” (which biases earlier studies) and finds that strong mutation testing has the highest fault revelation of four widely-used adequacy criteria, while statement and branch coverage show little independent effect. They also identify a threshold phenomenon: fault revelation only increases meaningfully at high coverage levels.

Papadakis, Kintis, Zhang, Jia, Le Traon, and Harman (2019) published an updated survey in Advances in Computers 112, pp. 275–378, confirming mutation testing’s superiority over structural coverage metrics for fault revelation.

Petrović, Ivanković, Fraser, and Just (2021) presented a Google-scale empirical study at ICSE 2021, pp. 910–921, finding that mutation testing improves fault detection over code-coverage only approaches in industrial practice.

Tengeri et al. (2016) found that code coverage is necessary but not sufficient and that mutation score is a better predictor of defect density. This paper appeared at ICST Workshops (ICSTW) 2016, pp. 174–179, not the main ICST conference: a meaningful distinction regarding peer review level.

Key references

Novel vs Established

Established:

Each of these is individually well-cited within its field.

Novel:

None of these connections appear in any published paper based on a literature search as of early 2026. The structural correspondence between the two frameworks is formally exact, as demonstrated above; whether this correspondence yields new predictive power beyond that which existing software testing theory already provides is an open question.

The Practical Aspect

The final practical takeaway from this article is that your testing strategy and its implementation have a structure. That structure is defined by which regions of the specification space each technique probe, and that different strategies are comparable in those terms.

Line coverage tells you how much of your code your tests touched, it says nothing about whether those tests would catch a bug. Mutation testing asks that question directly: if I introduced a small fault, would my tests notice? That is the difference between exploring the T-space and probing the I-boundary.

If your CI pipeline reports 90% line coverage and you feel confident, consider that high line coverage is a necessary baseline but not a reliable indicator of test quality on its own. The gap between line coverage and mutation score is the gap between knowing that your program runs and knowing that your program is correct. Without accompanying mutation analysis, that confidence may not be warranted.

The constructor-theoretic framing does not change any well-established software testing advice, but it does offer an explanation for common knowledge and empirical results. And most valuable for me, it offers alternative language to explain: Why we all know you can test every line of code and still ship bugs into production.