Finding a billion-user project for Eiffel: How DbC catches the security flaws that Rust misses

by Finnian Reilly (modified: 2026 Jun 02)

Contents

Preface
Introduction: Eiffel's visibility problem
The memory safety moment
What is a CVE and why should Eiffel developers care?
Enter Rust: the memory safety champion
The flaw in the memory safety argument
Design by Contract: the missing piece
Choosing the right target: criteria for a billion-user Eiffel project
Why libexpat is the ideal candidate
Introducing xpact
Eiffel's practical advantages for this specific task
xpact vs expat: bridging the performance gap
XML name interning: a zero-copy, cache-efficient strategy for xpact callbacks
Running the expat test suite as a correctness proof
Eiffel's static verification ecosystem: AutoTest and AutoProof
A living contract suite: learning from the entire XML parser ecosystem
Addressing the honest objections
Beyond xpact: other candidates worth considering
What success would look like for Eiffel
Conclusion: correctness is the next frontier
Code listings for C_STRING_8 and benchmark
Foot notes
See also
References

Estimated reading time: 45 to 55 minutes, approximately 10,900 words.

Preface

This essay was developed in collaboration with Claude. The technical observations and strategic judgements are my own, drawn from decades of Eiffel development. The AI contributed structure, prose, and breadth of reference across the security landscape.

Introduction: Eiffel's visibility problem

Eiffel is one of the most carefully designed programming languages in existence. Its type system eliminates null pointer dereferences statically. Its Design by Contract mechanism allows developers to express and verify correctness properties directly in source code. Its compile-to-C backend produces portable, auditable output that works with any C compiler on any platform. Its garbage collector can be selectively disabled for performance-critical sections. It has been battle-tested for decades across finance, aerospace, and enterprise software.

And almost nobody outside those circles has heard of it.

This is not a new observation. The Eiffel community has been aware of this visibility gap for years. Various explanations have been offered — the commercial history of EiffelStudio, the academic associations, the relatively small developer pool, the absence of a breakout open source project that non-Eiffel developers encounter in their daily work.

This essay argues that the conditions are now better than they have ever been for Eiffel to change that situation — not through marketing, not through language advocacy, but through building one specific thing that the world genuinely needs and that Eiffel is uniquely qualified to provide.

That thing is a streaming XML parser with a libexpat-compatible interface, written in Eiffel, with full Design by Contract annotations, called xpact.

To understand why this matters, we need to understand what is happening in the broader software security landscape right now.

The memory safety moment

Something significant has shifted in how governments and major technology organisations talk about software security. For decades the security conversation focused on processes — patch management, vulnerability disclosure, penetration testing, secure development guidelines. The underlying assumption was that insecure code was an unfortunate but inevitable consequence of the difficulty of programming, and the best you could do was find and fix vulnerabilities quickly.

That assumption is now being challenged at the highest levels.

In 2022 the US National Security Agency published guidance explicitly recommending that organisations migrate away from C and C++ toward memory-safe languages. In 2023 the White House Office of the National Cyber Director published a report calling on the software industry to eliminate entire classes of vulnerability through language choice rather than reactive patching. The same year CISA — the Cybersecurity and Infrastructure Security Agency — published a joint advisory with international partners echoing the same message.

The specific concern is a class of bugs that have plagued C and C++ programs since their inception: buffer overflows, use-after-free errors, double-free errors, null pointer dereferences, and related memory corruption vulnerabilities. These bugs arise from the fact that C gives programmers direct control over memory with no runtime safety net. Get it wrong and the consequences range from crashes to complete system compromise. Decades of security research has shown that this class of bug accounts for a disproportionate fraction of critical security vulnerabilities — Microsoft has stated that approximately 70% of their security vulnerabilities over a period of years were memory safety issues. Google has reported similar figures for Chrome.

The proposed solution is to replace C and C++ with languages that eliminate memory safety bugs by construction — languages where the type system or runtime makes the entire class of error impossible. Rust has emerged as the industry's favoured answer to this challenge, and it is worth understanding exactly what Rust does well before examining what it cannot do.

What is a CVE and why should Eiffel developers care?

A CVE — Common Vulnerabilities and Exposures — is a standardised identifier assigned to a publicly disclosed security vulnerability. Each CVE has the format CVE-YEAR-NUMBER, for example CVE-2024-8176. The year indicates when the identifier was assigned, not necessarily when the vulnerability was introduced or discovered. The system is maintained by MITRE Corporation and sponsored by the US Department of Homeland Security.

When a vulnerability is discovered and responsibly disclosed, it receives a CVE identifier along with a description of the flaw, a CVSS score (a number from 0 to 10 indicating severity), and references to affected versions and available patches. Security scanning tools used by enterprises, cloud providers, and Linux distributions all track CVE identifiers to flag vulnerable software automatically.

For a library developer, a CVE is simultaneously a technical failure and a reputational event. Each CVE represents a bug that reached production, was discovered by a researcher or attacker, and required an emergency patch and coordinated disclosure across every downstream project that embeds the library. For widely-embedded libraries this process affects hundreds of projects simultaneously.

Why should Eiffel developers care about CVEs? Because a new Eiffel library that can credibly claim a lower CVE rate than its C equivalent — and more importantly explain architecturally *why* that is the case — has a compelling story to tell to exactly the audience that the memory safety moment has made receptive. Security teams, procurement processes, and engineering leadership at major organisations are actively looking for justification to replace C libraries. A well-argued, well-demonstrated Eiffel alternative is the kind of thing that gets written about on Hacker News, cited in security conference talks, and referenced in procurement decisions.

Zero CVEs at launch is easy to claim. Explaining why the architecture makes entire classes of CVE structurally impossible is the claim worth making — and the one that requires Eiffel's specific capabilities to support.

Enter Rust: the memory safety champion

Rust was created by Graydon Hoare, a programmer at Mozilla, initially as a personal project begun around 2006. The story of its origin is almost poetic: returning home one day to find his apartment building's elevator broken due to a software crash, Hoare was struck by the absurdity of computer people being unable to write elevator software that didn't crash. He knew, as any systems programmer would, that such crashes are frequently caused by memory safety bugs. That frustration seeded years of work that would eventually become Rust.

Mozilla began sponsoring Rust around 2009, recognising that Firefox's enormous C++ codebase was a perpetual source of security vulnerabilities. They funded both the language itself and Servo — an experimental browser engine written entirely in Rust that served as a proving ground for the language's capabilities. In 2021 the independent Rust Foundation was established with backing from Google, Microsoft, Amazon, and others, cementing Rust's position as an industry-supported language rather than a Mozilla project.

Rust's core innovation is the borrow checker — a compile-time analysis that enforces strict rules about ownership and borrowing of memory. The rules ensure that memory is always either owned by exactly one binding or borrowed under controlled conditions, making it impossible to have two mutable references to the same data simultaneously, impossible to use memory after it has been freed, and impossible to free memory twice. These guarantees are checked at compile time with no runtime overhead — the resulting binary is as fast as equivalent C code while being provably free of the specific class of memory corruption bugs that the checker addresses.

The results have been significant. Stylo, Firefox's CSS rendering engine rewritten in Rust, eliminated an entire category of memory corruption CVEs from that component. Android's Rust adoption has correlated with measurable reductions in memory safety vulnerabilities. The Linux kernel began accepting Rust code in version 6.1.

Perhaps most visibly for this discussion, sudo-rs — a Rust rewrite of the sudo utility — shipped as the default sudo implementation in Ubuntu 25.10, released in October 2025, ahead of its anticipated adoption in the Ubuntu 26.04 LTS release expected in April 2026. This would place a Rust-based security-critical utility on tens of millions of systems.

Rust is a genuine achievement and its memory safety guarantees are real. But there is a flaw in the argument that memory safety is sufficient for security — and it showed up almost immediately in sudo-rs.

The flaw in the memory safety argument

In July 2025, shortly after sudo-rs began shipping in Ubuntu, researchers discovered two vulnerabilities in the implementation. Neither was a memory safety violation. Both were logic errors.

One allowed a user with limited sudo privileges to enumerate other users' sudo permissions — information that should have been inaccessible. The other allowed authentication to be bypassed under specific conditions involving cached credentials. Both vulnerabilities were entirely invisible to Rust's borrow checker, because the borrow checker has no opinion about authentication logic, permission models, or what information one user should be allowed to see about another.

This is not a criticism of Rust specifically. It is an illustration of a fundamental limitation of the memory safety framing. Memory safety addresses one class of bug — the class arising from incorrect manual memory management. It says nothing about whether your authentication logic is correct, whether your permission checks are complete, whether your state machine can reach an invalid state, or whether a crafted input can drive your algorithm into quadratic time complexity.

Consider the CVE history of libexpat, the widely-deployed C XML parser:

CVE-2024-8176 — a stack overflow triggered by deeply nested entity references, causing a crash through recursive exhaustion
CVE-2025-59375 — insufficient controls on dynamic memory allocation, allowing a small crafted XML document to trigger massive heap allocation
CVE-2025-66382 — a crafted 2MB XML file causing dozens of seconds of CPU time through algorithmic complexity
Multiple integer overflow CVEs from 2024 — negative length values accepted without validation, integer wraparound on 32-bit platforms

Look carefully at this list. The stack overflow is caused by absent recursion depth checking — a missing precondition. The unbounded memory allocation is caused by absent allocation limits — a missing invariant. The algorithmic complexity attack is caused by absent input complexity bounds — a missing precondition. The integer overflows are caused by absent range validation — missing preconditions.

Every single one of these CVEs is a logic error expressible as a Design by Contract annotation. Every single one would have been caught during development if the code had been written with explicit preconditions and invariants. None of them is a memory corruption bug. None of them would have been prevented by rewriting libexpat in Rust.

This is the flaw in the memory safety argument: it addresses the implementation language's failure modes, not the algorithm's failure modes. A Rust rewrite of libexpat would eliminate use-after-free and buffer overflows — which libexpat has also had — but would do nothing for the logic errors that comprise the majority of its recent CVE history.

Design by Contract addresses logic errors directly. That is the gap Eiffel can fill.

Design by Contract: the missing piece

Design by Contract was introduced by Bertrand Meyer as a central feature of the Eiffel programming language in the late 1980s and formally described in his book Object-Oriented Software Construction. The core idea is straightforward but profound: every routine in a program has a contract with its callers, consisting of three parts.

A precondition expresses what must be true when the routine is called — the caller's obligation. A postcondition expresses what will be true when the routine returns — the routine's guarantee. A class invariant expresses what must always be true about a valid object of that class — the consistency guarantee that every routine must maintain.

In Eiffel these are not comments or documentation. They are executable code, evaluated at runtime during development and testing. A precondition violation means the caller broke the contract. A postcondition violation means the routine failed to deliver its guarantee. An invariant violation means an object has reached an inconsistent state. Each violation is a precisely located, automatically detected bug.

Consider how DbC would express the constraints that libexpat's CVE history reveals were missing:

parse_entity_reference (depth: INTEGER) require depth_within_bounds: depth <= Maximum_entity_depth ... ensure stack_within_limits: current_stack_depth <= Maximum_stack_depth end allocate_buffer (requested_size: INTEGER) require size_is_positive: requested_size > 0 size_within_amplification_limit: requested_size <= input_size * Maximum_amplification_factor ... parse_chunk (buffer: SPECIAL [CHARACTER_8]; length: INTEGER) require length_non_negative: length >= 0 length_within_buffer: length <= buffer.count ...

Each of these preconditions is a one-liner that directly corresponds to a class of CVE that libexpat has shipped. They are not afterthoughts — they are the specification. Writing them forces the developer to think about the contract at the moment of implementation, before any test is written, before any fuzzer runs, before any deployment.

During development and testing, every violation of every contract is automatically detected and reported with a precise location and the values that triggered the violation. The XML specification itself defines precise rules about valid documents, valid entity expansion ratios, valid nesting depths, and valid encoding sequences — all of which map directly to preconditions and invariants. An Eiffel XML parser is a natural home for a machine-readable encoding of the XML specification.

Eiffel also supports loop invariants and loop variants — formal correctness proofs embedded directly in loop constructs. The loop invariant expresses what remains true on every iteration; the loop variant is an integer expression proven to decrease on every iteration and remain non-negative, formally guaranteeing termination. For a parser like xpact whose hot paths consist largely of byte-scanning loops, these are not academic curiosities — they are the mechanism by which the correctness of each scanning loop is formally established rather than assumed.

The crucial point is that contracts do their most valuable work during development even though they may be disabled in the production build. By the time you ship, every test run, every fuzz input, every regression test has been checked against every contract. The logic errors that became libexpat CVEs would have surfaced as contract violations during testing — not as crash reports from production systems.

Choosing the right target: criteria for a billion-user Eiffel project

Not every C library is a good candidate for an Eiffel reimplementation intended to raise Eiffel's profile. The ideal candidate has several specific properties.

Widely deployed — the library should be present on a very large number of systems, ideally in the hundreds of millions. This is what gives the "billion-user" claim its meaning. A library used by Python, by Mozilla, by hundreds of Linux distributions, by embedded systems, by enterprise software stacks is a library whose replacement would be noticed.

Clear CVE history — the library should have a documented track record of security vulnerabilities, specifically of the logic error variety that DbC addresses rather than exclusively memory corruption bugs. This is what makes the "catches the flaws that Rust misses" argument concrete and falsifiable rather than theoretical.

No serious Rust rewrite underway — the project needs a clear field to operate in. If a well-funded Rust replacement already exists and is shipping, the strategic window has closed. The goal is to arrive first, not to compete with an entrenched alternative.

Achievable scope — the library should be small enough that a focused team of Eiffel developers can produce a credible implementation in a reasonable timeframe. A project that requires five years and twenty developers before it can be evaluated publicly is not a visibility win — it is a long-term bet that may never pay off.

Domain where DbC is visibly valuable — the library's domain should be one where the correctness properties are well-defined and expressible as contracts. Parsers and protocol implementations are ideal because they implement formal specifications — the XML specification, the HTTP specification, the DNS specification — where preconditions and postconditions can be derived directly from the standard.

Permissive licence — the original C library should be under a licence that permits a clean-room reimplementation without legal complications. MIT and BSD licences are ideal.

libexpat satisfies every one of these criteria.

Why libexpat is the ideal candidate

libexpat is a stream-oriented XML parser written in C, originally created by James Clark — the technical lead of the W3C XML Working Group that produced the XML specification itself. Clark named it after his own status as a British expatriate living in Thailand, with the secondary meaning of XML Parser Toolkit. He released it in 1998 under what is now the MIT licence, made it the parser used to add XML support to Netscape 5, and handed it to a maintenance team around 2000.

It is embedded in an extraordinary range of software. Python's standard library uses it as the backend for xml.parsers.expat and the higher-level xml.etree and xml.sax modules — meaning it is present on every Python installation on every platform. It was used by Mozilla. It is a dependency of hundreds of Linux packages. It appears in embedded systems and hardware devices. Conservative estimates of the number of systems carrying libexpat run into the hundreds of millions.

Its CVE history is extensive and ongoing. The list includes integer overflows, stack overflows from recursive entity expansion, unbounded memory allocation, negative length acceptance, algorithmic complexity attacks, and null pointer dereferences — with new CVEs appearing as recently as early 2026. Critically, the recent CVEs are predominantly logic errors rather than memory corruption bugs, which means a Rust rewrite would not have prevented them.

The library is small. The core implementation lives in approximately five source files — xmlparse.c (around 7,000-8,000 lines), xmltok.c with its multiply-included encoding variants, xmlrole.c (the prolog state machine), and supporting headers. The entire codebase is perhaps 12,000-14,000 lines of C. In Eiffel, with generics handling the encoding variants that C addresses through macro-based multiple inclusion, and with the state machine mapping naturally to a class hierarchy, a complete implementation would likely be around 4,000-6,000 lines — achievable by a small focused team.

libexpat is entirely single-threaded. There is no internal concurrency, no mutex, no shared state between parser instances. The thread safety model is simply one parser instance per thread. This simplifies the implementation considerably and, as we will see, opens interesting opportunities for Eiffel's SCOOP model to add value beyond the baseline.

No serious Rust rewrite of libexpat exists. A mechanical C2Rust transpilation called rexpat exists, maintained by Immunant, with the stated ambition of eventually becoming safe idiomatic Rust — but it remains explicitly marked as a work-in-progress producing unsafe Rust, has no published releases, and is not on crates.io. It is not for production use. The field is open.

The licence is MIT. There are no legal complications.

libexpat's maintainer has publicly sought help from the organisations embedding it, writing to approximately forty companies asking for security contacts and assistance. There is institutional awareness of the problem and an audience already primed to care about a credible replacement.

Introducing xpact

The proposed project is called xpact. The name works on two levels simultaneously: it sounds like expat, immediately signalling to existing developers what it is and what it replaces; and pact means a formal agreement — a direct reference to Design by Contract without spelling it out. The tagline is straightforward: xpact — XML parsing, by contract.

The architecture has several key components.

The core parser is written in Eiffel, implementing the XML 1.0 specification as a streaming parser with the same event-driven callback model as libexpat. The key difference is that every significant operation carries explicit DbC annotations derived directly from the XML specification and from the known constraints that libexpat's CVE history reveals were missing.

The C-compatible interface is a thin layer that exports the libexpat public API — XML_ParserCreate, XML_SetElementHandler, XML_SetCharacterDataHandler, XML_Parse, and the remaining public functions — as C-callable functions. This makes xpact a drop-in replacement: any application currently linking against libexpat can switch to xpact by substituting the library file without changing a single line of application code.

The shared library target uses EiffelStudio's ability to compile Eiffel to a DLL on Windows or SO on Linux and macOS, with the Eiffel runtime bundled within the library. The library presents a standard C ABI to its callers. The Eiffel internals are invisible.

The Python binding provides a PyPI-installable package that wraps xpact for Python use. Given that Python's standard library already uses libexpat as its XML parsing backend, a well-packaged Python binding with clear security documentation would reach Python's enormous developer community directly. A Python developer choosing between the standard library's expat module and xpact with published contract annotations and CVE-regression test results is exactly the kind of choice that generates discussion and adoption.

The name libxpact for the shared library and xpact for the Python package completes the naming scheme.

A .NET parser component. EiffelStudio's .NET backend compiles Eiffel directly to Common Intermediate Language (CIL), producing a proper .NET assembly rather than a native shared library. This means xpact could ship simultaneously as a native libxpact.so / xpact.dll for C and Python consumers, and as a NuGet-installable .NET assembly consumable directly from C#, F#, or VB.NET — all from the same Eiffel source base. For .NET consumers the Eiffel classes appear as native .NET types with no P/Invoke FFI layer required. More importantly, the .NET build exposes xpact's parsing events as native CLR delegates rather than C function pointers, giving C# and F# consumers an idiomatic event-driven API that feels entirely natural within the .NET ecosystem — a considerably more appealing product than a thin wrapper around C conventions would be. The GC interaction concern that applies to the native build disappears entirely in the .NET target, since everything runs inside the CLR's own managed environment. Since the .NET runtime is already present on any .NET-capable system, the runtime bundling overhead of the native build does not arise here either — the CLR simply hosts the assembly alongside everything else. This is a capability Rust cannot easily match, and it significantly widens xpact's potential audience to include the enormous .NET enterprise ecosystem. A single Eiffel codebase therefore delivers three deployment targets — native C-compatible shared library, Python PyPI package, and .NET NuGet package — each reaching a distinct developer community with an API idiomatic to that community.

Eiffel's practical advantages for this specific task

Several specific properties of Eiffel and EiffelStudio make xpact not merely possible but genuinely well-suited to this task in ways that are worth stating explicitly.

Compile-to-C portability. EiffelStudio generates C as an intermediate representation, which is then compiled by whatever C compiler is available on the target platform — GCC, Clang, or MSVC. This means xpact is portable across Windows, Linux, and macOS without platform-specific code, and integrates naturally into build systems that already use those compilers. This is a significant advantage over Rust, which manages its own compilation pipeline and has historically required effort to integrate into Windows MSVC-based projects.

SPECIAL arrays for hot-path performance. The innermost loop of an XML parser — scanning bytes, recognising token boundaries, checking encoding validity — needs to be fast. Eiffel's SPECIAL type is a contiguous block of typed memory with no object overhead, no GC pressure, and cache-friendly sequential access. Operations on SPECIAL arrays generate tight C loops equivalent to what a C programmer would write directly. The hot path of xpact can be SPECIAL-backed throughout, delivering performance competitive with the C original.

GC management during parsing. The garbage collector can be disabled for the duration of a parse and all memory reclaimed in a single operation when the parse completes. A parse has a natural transactional boundary — start, process, finish — that maps perfectly to this model. Intermediate objects (token strings, attribute value buffers, namespace prefix strings) are all short-lived and their lifetime is bounded by the parse. Turning the GC off during parsing means no GC pauses during the critical path and entirely predictable memory behaviour — important for callers embedding xpact in performance-sensitive contexts.

String pool recycling. XML documents are highly repetitive in terms of string vocabulary — the same element names, attribute names, and namespace prefixes appear thousands of times in a large document. A string pool that interns these strings, allocating each unique string once and returning a reference for subsequent occurrences, reduces allocation pressure dramatically. This pattern is used in Eiffel-Loop and is directly applicable to xpact. Combined with GC-off parsing, the memory behaviour becomes competitive with the most optimised C implementations.

Simplified C callbacks without freezing. Normally, passing Eiffel object references into C callback contexts requires "freezing" the objects — pinning them in memory so the GC cannot relocate them during a collection cycle. This adds complexity and is a potential source of bugs if forgotten. With the GC disabled during parsing, objects do not move, and the entire freeze/unfreeze ceremony is unnecessary. The C-compatible callback layer becomes straightforward to implement and impossible to get wrong in this specific way.

SCOOP for pipeline parallelism. libexpat is single-threaded by design. xpact can offer an optional parallel parsing mode using Eiffel's SCOOP concurrency model, implementing a pipeline where the tokenizer, attribute parser, and callback dispatcher run as separate concurrent processors passing work through queues. From the caller's perspective the API is identical — register handlers, call parse, receive callbacks. The parallelism is an internal implementation detail. The correctness of the pipeline is guaranteed by SCOOP's model rather than requiring manual mutex management. This would be a genuine capability improvement over libexpat — the same API, internally pipelined for multi-core hardware.

Void safety. Eiffel's void safety guarantee is static and always active regardless of whether contracts are enabled in the production build. It eliminates null pointer dereference — one of libexpat's historical CVE types — at compile time. No runtime check is needed because the type system proves it cannot occur.

Decades-tested IDE and tooling. EiffelStudio has been refined over decades around a single philosophy — that code should be comprehensible, navigable, and provably correct. Its diagram tool generates live class diagrams directly from source. Its contract browser presents preconditions, postconditions and invariants as first-class navigable elements alongside the code they annotate. The flat and short forms of a class expose the complete inherited interface without manual hierarchy navigation. This level of integration between correctness mechanism and development environment has no equivalent in the Rust ecosystem, where IDE support — while good and improving — is provided through general-purpose plugins rather than an environment built around the language's specific philosophy. For a security-critical library expected to be maintained for years by contributors who may not have written the original code, this difference matters practically. xpact's maintainability story is not just about Eiffel's language properties — it is about the entire environment in which it is developed and evolved.

Incidental Obfuscation There is one unintended security benefit worth noting with a smile: EiffelStudio's generated C is so thoroughly transformed from the original Eiffel source that conventional C vulnerability research tools and techniques produce largely meaningless results against it. The security-relevant behaviour is expressed in the Eiffel contracts — not in identifier-mangled, macro-expanded generated C that bears no resemblance to hand-written code. This is not a security strategy anyone would recommend deliberately. But as an accidental consequence of the compilation model, it is not nothing.

xpact vs expat: bridging the performance gap

A correctness argument without a performance argument is only half a case. For xpact to be taken seriously by the libexpat community — and for Python and C developers to consider it as a genuine drop-in replacement — it must deliver parse throughput competitive with the C original. This section addresses that challenge directly, and argues that the right architectural choices make xpact's performance model not merely competitive with libexpat but potentially superior for large document parsing.

The SPECIAL array limitation

Eiffel's SPECIAL type delivers near-C performance for numeric and byte operations. However for string operations — the dominant operation in an XML parser doing token recognition — SPECIAL-backed strings carry abstraction overhead that accumulates over millions of comparisons. An XML parser continuously asks questions like: does this token start with xmlns:? Does this attribute name start with xml? Is this the opening of a <![CDATA[ section? Each such check on a STRING_8 goes through Eiffel's bounds-checking and character access abstractions. For a parser processing large documents that overhead is not negligible.

The C_STRING_8 approach

The benchmarks are driven by C_STRING_8, a lightweight proof-of-concept class developed purely for this benchmarking exercise. It is not a production-ready string library — it implements only the handful of operations needed to construct a meaningful comparison with STRING_8: character access, prefix testing, occurrence counting, delimiter searching, and zero-copy substring extraction. Its purpose is to demonstrate that the performance gap is real and measurable, and to provide a concrete architectural reference for whoever takes on the Phase 2 optimisation work in xpact. The class wraps C-allocated memory directly, with string operations delegated to optimised C library functions — memcmp for comparison and memcpy for initialisation — and inherits directly from MANAGED_POINTER rather than using composition, giving direct access to the underlying C memory with a single object header and no indirection. Despite its limited scope, each operation carries full DbC postconditions verifying equivalence with STRING_8, so the benchmark simultaneously measures performance and proves correctness on every debug run.

Benchmark results

The following benchmark compares STRING_8 (SPECIAL-backed) against C_STRING_8 (C-buffer-backed) for three operations critical to an XML parser. Test data is a list of 64 strings of varying lengths — the English titles of the 64 I Ching hexagrams — exercising realistic short-string workloads typical of XML token recognition.

(Passes over 1000 millisecs in descending order)RESULTS: starts_with C buffer starts_with : 2178.0 times (100%) SPECIAL starts_with : 878.0 times (-59.7%) RESULTS: occurrences C buffer occurrences : 2170.0 times (100%) SPECIAL occurrences : 1687.0 times (-22.3%) RESULTS: CSV line parsing C buffer parse_csv : 65.0 times (100%) SPECIAL parse_csv : 52.0 times (-20.0%)

The starts_with result is the most striking: memcmp on C-allocated memory is 59.7% faster than STRING_8's SPECIAL-based equivalent. Since token prefix recognition is the dominant operation in XML parsing — recognising element names, namespace prefixes, attribute names — this difference compounds across a full document parse into a substantial throughput advantage. The occurrences result shows a solid 22.3% advantage for C-buffer access.

The CSV line parsing benchmark is the most relevant to xpact's actual workload: it exercises the full token-extraction workflow — finding delimiter characters, creating substrings, returning results. The 20% advantage here is more modest because it includes object creation overhead that partially offsets the zero-copy savings. For short strings the copy cost is small; the real payoff of the zero-copy architecture reveals itself at document scale.

Zero-copy parsing through shared C buffers

The performance argument extends beyond individual string operations to the parse architecture as a whole. The standard approach when interfacing Eiffel with C involves copying strings across the language boundary — materialising Eiffel STRING_8 objects from C char* pointers at every token boundary. For a streaming parser this copying is continuous and cumulative.

C_STRING_8 eliminates this copying through two complementary mechanisms. First, a make_shared constructor — renamed from MANAGED_POINTER's share_from_pointer — wraps existing C-allocated memory without copying or taking ownership. The caller's input buffer is wrapped directly and the parser operates on it in place. Second, a substring feature creates a new C_STRING_8 that shares a slice of the parent buffer by pointer offset and length, again without copying. Every token the parser recognises — element name, attribute name, attribute value, namespace prefix, text content — becomes a shared substring of the original input buffer. The callback fires with a C_STRING_8 that is a direct window into the caller's own input memory.

For documents with rich vocabularies — many unique element and attribute names — the zero-copy model also eliminates heap pressure entirely on the hot parse path. No new character buffers are allocated for tokens regardless of document size. The memory footprint of the parse is bounded by the input buffer itself plus lightweight wrapper objects.

The safety of this model rests on immutability enforced by contract. The DbC postconditions on every C_STRING_8 operation verify equivalence with STRING_8 automatically during development — the benchmark class itself contains postconditions asserting that CSV parsing produces identical results to STRING_8.split. Any deviation is caught as a contract violation rather than a silent correctness error.

How libexpat handles its input buffer

A careful reading of libexpat's internals reveals an architectural difference that works in xpact's favour. When XML_Parse() is called, the very first thing libexpat does is copy the caller's input into its own internal buffer via memcpy. Every byte of input is therefore touched twice before parsing begins — once by the caller filling the input buffer, and once by libexpat copying it internally. For large documents or high-frequency parsing this is a measurable overhead that xpact's zero-copy architecture avoids entirely.

How libexpat handles name lookups — and why xpact may be faster

libexpat maintains hash tables for element names, attribute names, namespace prefixes, and other named entities. The hash function used is SipHash — a cryptographically strong algorithm chosen specifically to prevent hash collision DoS attacks. SipHash is deliberately more expensive than a simple hash precisely because security requires it. Critically, SipHash is recomputed on every occurrence of a name — if <item> appears 10,000 times in a document, libexpat computes SipHash 10,000 times. There is no hash code caching in libexpat's lookup path.

xpact's C_NULLED_STRING_8_NAME_CACHE replaces hash computation entirely with bucketed binary search. A bucket is selected by the first character of the name — a single array index operation — and binary search within the bucket requires at most 3 comparisons for typical XML vocabularies. There is no hash function to compute, no collision handling, and no rehashing overhead. For a document where <item> appears 10,000 times, xpact reads one character and performs one memcmp 10,000 times.

The security advantage is structural rather than incidental: xpact's bucketed binary search is immune to hash flooding attacks by design. There is no hash function to exploit. libexpat had to adopt SipHash specifically because its earlier hash function was vulnerable to crafted inputs with many collisions — a vulnerability class that does not exist in xpact's architecture.

Callback string types and the appropriate Eiffel class

Not all callback strings benefit equally from the name cache. The following table maps each libexpat callback type to its string representation, the appropriate Eiffel class, and whether caching is beneficial:

Callback type	String type in libexpat API	Eiffel class	Benefits from caching
Element name	Null-terminated (`const XML_Char *`)	C_NULLED_STRING_8	Yes — element names repeat thousands of times in typical documents; intern table lookup after first occurrence is near-zero cost
Attribute name	Null-terminated (`const XML_Char *`)	C_NULLED_STRING_8	Yes — attribute names are part of the document vocabulary and repeat with the same frequency as element names
Namespace prefix	Null-terminated (`const XML_Char *`)	C_NULLED_STRING_8	Yes — namespace prefixes are a small fixed vocabulary; typically only 2–5 unique prefixes per document
Attribute value	Null-terminated (`const XML_Char *`)	C_NULLED_STRING_8	Rarely — attribute values are typically unique per element occurrence; caching provides no benefit for dynamic values such as identifiers, URLs, or numeric data. Fixed enumerated values (e.g. `type="text"`) would benefit.
Character data / text content	Fixed-length buffer + length (`const XML_Char *, int len`)	C_STRING_8	No — text content is almost always unique; a shared substring window into the input buffer is sufficient and correct. No null terminator is needed since the length is provided explicitly.
Comment	Null-terminated (`const XML_Char *`)	C_NULLED_STRING_8	No — comments are typically unique prose; caching provides no benefit. A fresh C_NULLED_STRING_8 per comment is appropriate.
Processing instruction target	Null-terminated (`const XML_Char *`)	C_NULLED_STRING_8	Yes — processing instruction targets (e.g. `xml-stylesheet`) are a small fixed vocabulary that repeats across documents
Processing instruction data	Null-terminated (`const XML_Char *`)	C_NULLED_STRING_8	No — processing instruction data is typically unique per occurrence

What this means for xpact's overall performance

The cumulative picture suggests xpact could realistically match or modestly exceed libexpat's throughput for typical document workloads once Phase 2 optimisations are in place:

The zero-copy input architecture eliminates libexpat's initial memcpy of every input byte
The bucketed binary search eliminates SipHash recomputation on every name occurrence
The C_NULLED_STRING_8 intern table means callback strings are allocated once and remain CPU-cache-hot throughout the parse
STRING_8 objects carry two heap-allocated objects per string (instance + SPECIAL [CHARACTER_8]) while C_STRING_8 requires only one, halving the GC object count per token
The GC disabled during parsing eliminates GC pause interference entirely

The advantage is most pronounced for documents with repeated structure — RSS feeds, SOAP messages, configuration files, data serialisation formats — where the same element and attribute names appear thousands of times. For these workloads libexpat computes SipHash thousands of times while xpact performs a single bucket lookup and one or two memcmp calls per name occurrence.

The advantage is less clear for documents with large unique vocabularies or very short documents where the vocabulary discovery phase dominates, and libexpat's 25 years of hot-path micro-optimisation in the tokeniser remains a genuine advantage that EiffelStudio-generated C will not automatically match.

Managing GC during parsing

The Eiffel runtime's MEMORY class provides sophisticated control over garbage collection that maps cleanly onto xpact's parse lifecycle requirements. The key insight is that a parse has a natural transactional boundary — start, process, finish — and the MEMORY class provides exactly the tools needed to manage GC across that boundary safely and efficiently.

execute_without_collection — the idiomatic solution

Rather than manually calling collection_off and collection_on around the parse, the preferred approach is execute_without_collection, which accepts the parse procedure as an agent:

execute_without_collection (a_action: PROCEDURE) -- Execute `a_action' with the garbage collector disabled. -- If `a_action' modifies the status of `collecting', we restore -- it no matter what at the end. ensure collection_status_preserved: collecting = old collecting

The postcondition collection_status_preserved and an internal rescue clause guarantee that GC status is always restored to its original state even if an exception occurs during parsing. This eliminates the risk of leaving the GC permanently disabled after a parse error — a subtle but serious bug that manual collection_off / collection_on bookkeeping can introduce.

The large document safety valve

With GC disabled, Eiffel heap allocations accumulate without collection. For xpact's hot parse path — where string data lives in C memory via C_STRING_8 and C_NULLED_STRING_8 — Eiffel heap allocation should be minimal. However for pathologically large documents a safety valve is advisable. set_memory_threshold provides exactly this:

set_memory_threshold (value: INTEGER) -- Set a new `memory_threshold' in bytes. Whenever the memory -- allocated for Eiffel reaches this value, an automatic -- collection is performed.

Setting a generous threshold — say 50MB — allows the parse to run effectively GC-free for all typical documents while triggering an automatic collection only if the Eiffel heap grows unexpectedly large. The threshold is a configurable class constant in xpact, adjustable for memory-constrained deployment environments.

allocate_fast and full_coalesce — the complete parse lifecycle

Two further features complete the picture. allocate_fast switches the runtime to speed-optimised allocation mode before the parse begins — prioritising throughput over memory compactness, which is the right trade-off during a high-frequency parse. After the parse, full_coalesce merges adjacent free memory blocks to reduce fragmentation that accumulated during the GC-off period — the ISE documentation explicitly notes it is "useful when a lot of memory is allocated with garbage collector off", which is written for exactly this use case.

The recommended xpact parse lifecycle combining all of these:

parse_document (input: C_STRING_8) do allocate_fast set_memory_threshold (Default_memory_threshold) execute_without_collection (agent do_parse (input)) full_collect full_coalesce allocate_compact ensure gc_restored: collecting = old collecting end Default_memory_threshold: INTEGER = 50_000_000 -- 50MB safety valve for large document parsing

Benchmarking GC behaviour with memory_statistics

The MEMORY class also provides precise measurement tools invaluable for Anders' benchmark suite:

memory_statistics (memory_type: INTEGER): MEM_INFO gc_statistics (collector_type: INTEGER): GC_INFO memory_count_map: HASH_TABLE [INTEGER, INTEGER] -- Number of instances per dynamic type present in system

These allow the benchmark to measure exactly how much Eiffel heap allocation occurs during a parse and how many GC cycles were triggered — providing concrete numbers for the claim that xpact's hot parse path generates near-zero GC pressure. memory_count_map is particularly useful during development: calling it before and after a parse confirms that the zero-allocation design is working as intended by verifying that no unexpected object types were created on the Eiffel heap during the parse.

The sophistication of the MEMORY class means xpact's GC management strategy is not a collection of ad hoc workarounds but a properly specified, well-documented, and safely recoverable part of the parser's architecture — expressed in Eiffel with the same contract discipline as every other component.

Disclaimer

All of the above is theoretical reasoning based on architectural analysis of libexpat's internals and xpact's proposed design. Performance claims in systems programming are notoriously unreliable until measured on real hardware with real workloads. Cache behaviour, branch prediction, compiler optimisation, and document structure all interact in ways that are impossible to fully anticipate analytically. The argument presented here is that xpact's architecture gives it a plausible path to competitive performance — not a guarantee of it. We won't know until we build the thing and benchmark it properly against a representative corpus of real-world XML documents.

That said — if xpact were to achieve even a modest performance edge over libexpat, it would be a remarkable result. An Eiffel XML parser outperforming a 25-year-old hand-optimised C library would be a story worth telling well beyond the Eiffel community. That alone would be worth writing home about. I am excited to try.

XML name interning: a zero-copy, cache-efficient strategy for xpact callbacks

(Class names link to github)

One of the subtler challenges in implementing a libexpat-compatible API in Eiffel is the null-termination contract. The libexpat callback API passes element names, attribute names, and namespace prefixes as null-terminated C strings — a requirement inherited from C's string conventions. xpact's internal representation uses C_STRING_8 shared substrings that are direct windows into the input buffer and are deliberately not null-terminated. Bridging this gap efficiently without abandoning the zero-copy architecture requires a name interning strategy.

The intern table

The solution is a name intern table — a cache mapping non-null-terminated C_STRING_8 token references to null-terminated C_NULLED_STRING_8 instances that satisfy the callback API's contract. The first time an element or attribute name is encountered during parsing, a null-terminated copy is created and stored in the cache. Every subsequent occurrence of the same name finds the cached entry immediately and returns the already-cached pointer. The C callback client receives a pointer into CPU-cache-hot memory.

This is the right target for interning because element and attribute names are precisely the strings that are highly repetitive — the same names appear potentially thousands of times in a large structured document — and short, typically fitting within one or two cache lines. Once the document's name vocabulary is established, which typically happens within the first few hundred bytes of a well-formed document, every name lookup is a pure cache hit with no new allocation.

The elegance of C_NULLED_STRING_8

Rather than pinning Eiffel objects in memory to prevent GC relocation — a mechanism that adds bookkeeping complexity and requires careful lifecycle management — the solution is simpler: C_NULLED_STRING_8, a class inheriting from C_STRING_8 that allocates its character data in C memory rather than on the Eiffel heap. Since C-allocated memory is invisible to the GC entirely, there is nothing to pin. The GC cannot relocate what it does not manage.

C_NULLED_STRING_8 is created once from a C_STRING_8 token, copies the bytes into a C buffer one byte larger than the source, and writes a null terminator at the end. The resulting object satisfies the libexpat callback contract — a valid null-terminated C string pointer — with no GC involvement, no pinning ceremony, and a contract that verifies correctness automatically in development mode:

make (str: C_STRING_8) -- initialize from `str' and terminate with NULL character do make_sized (str.count + 1) area.memory_copy (str.area, str.count) put_character ('%U', str.count) ensure room_for_null: count = str.count + 1 null_terminated: item (count) = '%U' same_string: str.to_string ~ to_string end

The three postconditions express the complete contract: the buffer is exactly one byte larger than the source, the final byte is null, and the string content is identical to the source. Any deviation is caught automatically during development.

Why binary search rather than a hash table

The obvious data structure for an intern table is a hash table. For xpact's typical workload however a sorted arrayed map list with binary search is likely more efficient. The reasoning is straightforward: XML documents in practice have modest name vocabularies — commonly 20 to 50 unique element and attribute names — and for collections of this size a sorted array that fits entirely in CPU cache outperforms a hash table. Hash lookup requires computing a hash code, handling potential collisions, and accessing memory at a hash-determined offset. Binary search on a cache-hot sorted array of 30 names requires at most 5 comparisons on data that is almost certainly already in L1 cache. There is no hash computation, no collision handling, and no rehashing overhead.

The insert cost — maintaining sort order when a new name is added — is paid only during the vocabulary discovery phase, which ends quickly. Once all names have been seen at least once, the table is static and every subsequent lookup is a pure search with no structural changes.

Bucketing by first character

For documents with larger vocabularies — schemas with hundreds of element types, configuration files with complex attribute sets — the sorted map list is bucketed by the first character of the name. C_NULLED_STRING_8_NAME_CACHE is an array of EL_KEY_INDEXED_ARRAYED_MAP_LIST instances indexed by first character code. This reduces the effective search space from the full vocabulary to the subset sharing the same initial letter. With 200 unique names distributed across 26 buckets the average bucket contains 7 to 8 entries — binary search on 8 entries requires at most 3 comparisons. The bucket array itself is small enough to fit in a single cache line, making the two-level lookup — bucket selection by character code, search within the bucket — extremely cache-friendly.

The item feature implements the full lookup and lazy creation in one place:

item (name: C_STRING_8): C_NULLED_STRING_8 -- cached null terminated name for fixed length C string `name' -- creating one if it doesn't exist already require not_empty: name.count > 0 do if attached array_item (name [1].code) as map_list then if map_list.is_empty then create Result.make (name) map_list.extend (name, Result) else search (map_list, name) if map_list.after then create Result.make (name) map_list.extend (name, Result) else Result := map_list.item_value end end end end

Linear search or binary search — chosen automatically

The search feature applies a threshold of 10 entries per bucket — below that count a linear search is used, at or above it binary search takes over. For typical XML documents with modest vocabularies most buckets will have 1 to 3 entries and the linear path dominates. The binary search path is insurance for schemas with many names sharing the same initial letter. The threshold is expressed as a named constant, making it visible and easily tunable:

search (map_list: like new_map_list; name: C_STRING_8) do if map_list.count > Linear_search_count then map_list.binary_search (name) else map_list.start map_list.search_key (name) end end Linear_search_count: INTEGER = 10

The unit test proves it works

The implementation is verified by a unit test in CONTAINER_STRUCTURE_TEST_SET using a vocabulary of 15 names — a dozen words beginning with 'a' to stress the bucketing and search logic, plus three names from other buckets. The critical assertion is identity rather than equality: the same name looked up twice must return the same object reference, confirming that the cache is genuinely interning rather than creating fresh instances:

test_key_indexed_arrayed_map_list local cache: C_NULLED_STRING_8_NAME_CACHE name: C_STRING_8; name_null_1, name_null_2: C_NULLED_STRING_8 name_str, dozen_a_words: STRING_8 do create cache.make dozen_a_words := "able,archery,android,anchor,average,ant,ancestor,anca,all,attached,artery,arc" assert ("one dozen", dozen_a_words.occurrences (',') + 1 = 12) across (dozen_a_words + ",Zig,zag,zebra").split (',') as word loop name := word.item name_null_1 := cache.item (name) name_null_2 := cache.item (name) assert ("same reference", name_null_1 = name_null_2) create name_str.make_from_c (name_null_1.area) -- calls C function strlen assert ("same name", name_str ~ word.item) end assert ("3 buckets have items", cache.used_count = 3) end

The make_from_c call is particularly significant — it calls the C standard library strlen function on the C_NULLED_STRING_8 area pointer to reconstruct a STRING_8, exactly as a C callback client would. The test confirms that the null terminator is in the right place and the content is correct. The dozen 'a' words exercise the binary search threshold — 12 entries in the 'a' bucket exceeds the threshold of 10, confirming that binary search engages correctly for larger buckets.

A C programmer's perspective

A C programmer reading this architecture would recognise each individual component — bucket arrays, sorted lists, binary search, null termination — but implementing the whole safely in C would require manual maintenance of sort order across multiple data structures, manual memory management for each cached string, manual verification that null terminators are written correctly at every allocation site, and no automatic detection of any error. A subtle bug in the bucket index calculation or an off-by-one on the null terminator would produce silent memory corruption discoverable only under specific document shapes and specific vocabularies. In Eiffel the postconditions on C_NULLED_STRING_8.make verify null termination and content correctness automatically on every construction during development. The type system prevents the category of pointer arithmetic errors that make C implementations of this pattern genuinely hazardous to maintain. The architecture is not simpler in Eiffel — it is the same architecture — but it is safe to build, safe to modify, and self-verifying in a way that the C equivalent simply cannot be.

What this means for xpact

This name interning strategy is not a theoretical optimisation appended to the design as an afterthought. Each component addresses a specific, measurable cost: the zero-copy architecture eliminates allocation pressure on the hot parse path, the bucketed search eliminates hash function overhead for typical vocabulary sizes, and the C-allocated intern table eliminates both null-termination copying and GC pressure on the callback boundary. Together they constitute a parse memory architecture that is competitive with the most optimised C XML parser implementations — not despite being written in Eiffel, but in part because Eiffel makes the necessary contracts expressible and automatically verifiable.

Anders Persson's Phase 1 xpact implementation establishes the correctness baseline. The interning strategy described here and demonstrated in working Eiffel-Loop code defines the Phase 2 performance architecture — available when the time comes, documented, reasoned through, tested, and ready.

Running the expat test suite as a correctness proof

libexpat has a comprehensive test suite with greater than 90% code coverage. More importantly for xpact, it includes regression tests for every historical CVE — specific crafted inputs that triggered each past vulnerability, ensuring that fixed bugs do not regress.

The xpact development methodology is straightforward: run the complete libexpat test suite against xpact in development mode with full contract checking enabled. Every precondition, postcondition, and invariant is evaluated on every test input. Contract violations become test failures.

The CVE regression tests are particularly valuable in this context. The crafted input that triggered CVE-2024-8176 — deeply nested entity references designed to exhaust the stack — will, when run against xpact, trigger the recursion depth precondition before any stack exhaustion can occur. The crafted input for CVE-2025-59375 — a small document triggering massive allocation — will trigger the amplification factor precondition. The 2MB document from CVE-2025-66382 designed to cause dozens of seconds of parse time will trigger the complexity bound precondition.

Each historical CVE becomes a contract violation caught in testing rather than a silent failure in production. The claim that xpact can then make is specific and verifiable: xpact passed the complete libexpat test suite including all CVE regression tests with full Design by Contract checking enabled, with zero contract violations. This is an empirical correctness statement, not a theoretical one, backed by the most battle-hardened XML parser test suite in existence.

No C implementation can make an equivalent statement because C has no equivalent mechanism. The closest a C parser can say is "our tests passed" — but the tests do not formally verify the contract between each function and its callers the way Eiffel's contracts do.

The role of fuzzing

The test suite proves correctness for known inputs. Adversarial inputs in the wild are a different challenge. A sufficiently motivated attacker will try inputs that no test author anticipated — crafted encodings, malformed document structures, inputs designed to drive parser state into unexpected combinations.

Fuzzing addresses this. libexpat itself participates in OSS-Fuzz, Google's continuous fuzzing infrastructure, which generates millions of malformed inputs automatically and reports crashes and hangs. xpact should participate in the same infrastructure.

Eiffel's contracts make fuzzing more productive, not less. When a fuzzer finds an input that triggers a contract violation in development mode, it produces a precise, machine-generated report of exactly which invariant was broken and what input caused it — not a crash dump requiring reverse engineering. The feedback loop from fuzzer to fix is correspondingly faster and more informative.

Eiffel's static verification ecosystem: AutoTest and AutoProof

AutoTest is a contract-driven automated test generation tool integrated into EiffelStudio. Rather than requiring developers to write test cases manually, AutoTest reads existing DbC contracts and generates test inputs automatically — using random and systematic strategies to find inputs that satisfy preconditions, executing routines, and checking whether postconditions and invariants hold. A contract violation becomes an automatically discovered bug without a human having written the test that found it. For xpact, AutoTest would automatically generate both valid and malformed XML inputs, exercising the parser's contract boundaries continuously as development proceeds. The contracts serve double duty: specification and test oracle simultaneously.

AutoProof goes further still — it attempts to prove Eiffel contracts correct at compile time using SMT solvers (Satisfiability Modulo Theories), most notably Microsoft Research's Z3 engine. Rather than finding violations empirically, AutoProof seeks mathematical proof that violations are impossible for all inputs. If a proof succeeds, the guarantee is not "we tested this extensively" but "this is provably correct." AutoProof translates Eiffel contracts into logical propositions and attempts to determine their truth mathematically. A successful proof eliminates an entire class of bug not just for tested inputs but for every possible input. AutoProof currently remains a research prototype — scaling to production codebases of arbitrary complexity is an unsolved problem, and many correct programs cannot yet be automatically proven within practical time bounds. However xpact's bounded, specification-driven domain — a streaming parser implementing a formal grammar with well-defined invariants derived directly from the XML specification — represents close to an ideal case for AutoProof. The possibility of mathematically proving xpact's core parsing contracts correct is realistic in a way that it would not be for a large general-purpose application.

The layered correctness picture for xpact is therefore:

Void safety — static, always on, eliminates null dereference at compile time regardless of contract mode
DbC contracts — verified against the complete libexpat test suite including all CVE regression tests, with full contract checking enabled
AutoTest — automatic test generation driven by contracts, continuously exercising contract boundaries without manual test authorship
AutoProof — mathematical proof of correctness for provable components, eliminating entire bug classes for all possible inputs
Fuzzing — adversarial input coverage via OSS-Fuzz for the unprovable remainder

No other language or toolchain currently offers this combination for a security-critical parser. Rust eliminates memory corruption. Eiffel eliminates memory corruption, proves logic correctness, generates tests from specifications, and offers a path to mathematical verification — all from the same codebase, with the same annotations serving every layer.

A living contract suite: learning from the entire XML parser ecosystem

One of the less obvious advantages of xpact's contract-driven architecture is that it can learn continuously from security research happening across the entire XML parsing ecosystem — not just libexpat's CVE history.

Every XML parser CVE ever published represents a logic error that a correctly written contract would have prevented. libexpat, libxml2, Xerces-C++, MSXML, RapidXML — each has its own CVE history, and many of those vulnerabilities share common root causes: unbounded recursion depth, unconstrained entity expansion ratios, integer overflow on length calculations, negative length acceptance, algorithmic complexity attacks from crafted input structures. These are not libexpat-specific failures. They are recurring patterns in XML parser implementations across languages, decades, and organisations.

This collective CVE history is, in effect, a collaboratively written specification of what an XML parser must constrain — contributed inadvertently by security researchers and attackers worldwide. xpact can treat it as such.

The AI-assisted contract derivation workflow

Claude and similar AI tools are well suited to systematic CVE analysis. The workflow is straightforward:

Fetch CVE descriptions and root cause analyses for all major XML parsers
Categorise each CVE by the class of constraint that was missing — recursion depth, expansion ratio, buffer length, integer range, complexity bound
Map each category to a specific Eiffel precondition or invariant
Identify which constraints xpact already expresses as contracts and which are absent
Suggest specific Eiffel contract annotations for the gaps

This turns the entire XML parser CVE database into a continuously updated contract specification. As new CVEs are published for any XML parser, the same analysis applies — a new vulnerability anywhere in the ecosystem becomes a contract suggestion for xpact within days of disclosure rather than years after exploitation.

The inversion worth noting

This is a remarkable inversion of the usual security dynamic. Normally attackers discover vulnerabilities and defenders react. In xpact's model, every vulnerability discovered anywhere in the XML parsing ecosystem — past, present, and future — automatically strengthens xpact's contract suite. Attackers inadvertently contribute to xpact's correctness story. The CVE database becomes a collaborative specification written by the world's security researchers.

The other significant XML parsers

libexpat is the most widely embedded XML parser but the CVE landscape is broader. libxml2 — used by GNOME, Python's lxml, PHP, and many others — has an even longer CVE history and supports XPath, XPointer, and full validation, making it a natural Phase 2 target once xpact establishes its libexpat-compatible core. Xerces-C++ (Apache) has a significant enterprise deployment footprint with its own history of denial-of-service vulnerabilities from crafted inputs. RapidXML and pugixml are widely used in game development and embedded systems — less security scrutiny means their CVE surface may be underreported rather than absent.

The living contract suite

Combined with AutoTest and OSS-Fuzz continuous fuzzing, xpact's contract suite would be updated from three complementary sources: systematic analysis of historical CVEs across all XML parsers, new CVEs as they are published, and AutoTest-discovered violations during active development. This is a specification that grows more complete over time automatically — the opposite of how most security-critical libraries evolve, where specifications are static and vulnerabilities accumulate. xpact's contracts do not just reflect what we know today. They are designed to incorporate what the security community will discover tomorrow.

Addressing the honest objections

A credible proposal addresses its weaknesses directly.

Contracts are disabled in the production build.

This is true and important to acknowledge. Finalised Eiffel builds disable contract checking for performance. However, contracts do their most valuable work during development. Every test run, every CI execution, every fuzz input runs with contracts enabled. The logic errors that became libexpat CVEs would have been caught as contract violations before they ever reached a production build. The production binary benefits from having been developed under contract — the contracts are the process, not just the runtime check. Additionally, void safety is always active regardless of contract mode, and SPECIAL array bounds can be checked selectively on security-critical paths.

The Eiffel runtime adds size overhead.

Bundling the Eiffel runtime within the shared library will make xpact.so larger than libexpat.so. The exact figure depends on finalisation and dead code removal, but a reasonable estimate is that xpact.so will be in the 600KB-1MB range compared to libexpat.so's 200-300KB. In absolute terms this is small — well under 1MB. In an era where JavaScript frameworks routinely ship 5-10MB bundles, this is not a serious practical objection. Furthermore, with GC disabled during parsing and SCOOP not used in the minimal single-threaded build, the runtime components actually exercised are a small fraction of the full runtime.

The Eiffel developer pool is small.

This is the most significant long-term constraint. xpact needs contributors, maintainers, and reviewers. The Eiffel community is small. However, this is partly a consequence of Eiffel's visibility problem — and xpact is specifically intended to address that. A successful, visible, widely-deployed xpact library is the kind of project that attracts developers who did not previously know Eiffel existed. The Python binding in particular opens the door to a large developer community discovering Eiffel through a tool they use. The goal is not to build xpact with the current Eiffel community alone but to use xpact to grow it.

The maintainability argument is actually a partial answer to that concern: a codebase that is readable, well-contracted, and supported by mature tooling is easier for new contributors to pick up than a mechanically transpiled unsafe Rust codebase. Lower barrier to contribution partially compensates for a smaller initial pool.

Can xpact really match libexpat's performance?

This is an empirical question that requires a benchmark rather than a theoretical argument. What can be said is that the architectural choices — SPECIAL arrays for the hot path, GC disabled during parsing, string pool recycling — are specifically designed to make the performance case competitive. The experience from Eiffel-Loop with similar patterns suggests the performance story is credible. Publishing benchmark results early and honestly, including where xpact is slower and why, is the right approach.

Beyond xpact: other candidates worth considering

xpact is proposed as the highest-priority first project — the one that best satisfies all the criteria simultaneously. But the methodology applies to other targets, and the Eiffel community should be thinking about a pipeline of projects rather than a single bet.

The criteria — widely deployed, logic-error CVE history, no Rust rewrite, achievable scope, specification-driven domain, permissive licence — can be applied systematically.

libyaml is a C YAML parser with a CVE history that includes integer overflows and heap corruption. YAML is widely used in configuration files across the DevOps ecosystem. It is small — comparable to libexpat in scope. No serious Rust rewrite exists as a drop-in replacement.

A DNS stub resolver is another candidate. DNS resolution is specification-driven, security-critical, present on every networked device, and the existing C implementations have accumulated significant CVE histories. A clean Eiffel implementation with formal contracts on the packet parsing logic would be compelling.

A JSON parser is a smaller scope project — potentially appropriate as a learning exercise or proof of concept before tackling a full XML parser. JSON's grammar is simpler than XML's, the CVE surface is smaller, but the ecosystem is enormous and a well-packaged Eiffel JSON library with Python bindings would get immediate traction.

The important principle is that each project should follow the same pattern: C-compatible interface where applicable, contracts derived from the specification, CVE regression tests as the correctness baseline, and a Python binding as the community-facing distribution channel.

What success would look like for Eiffel

It is worth being concrete about what a successful xpact project would change for Eiffel's visibility.

Phase 1: credible release. xpact passes the complete libexpat test suite. The benchmark results are published — honestly, including areas where performance is not yet competitive and why. The contract annotations are visible in the public repository. This is achievable by a small team and represents the minimum credible statement.

Phase 2: Python package. The PyPI package ships. Python developers can install xpact and use it as an alternative XML parsing backend. Security-conscious Python teams have a concrete choice to make. This is where community discovery begins.

Phase 3: security community engagement. The essay arguing that xpact's DbC methodology would have prevented specific historical CVEs is published. The argument is presented at a security conference — PyCon, linux.conf.au, or a dedicated security track. The comparison with sudo-rs's logic-error CVEs makes the argument timely and concrete.

Phase 4: OSS-Fuzz integration. xpact joins libexpat in the OSS-Fuzz continuous fuzzing infrastructure. Any vulnerabilities found are fixed quickly and transparently. A clean OSS-Fuzz record over time is a durable credibility signal.

Phase 5: downstream adoption. A Linux distribution packages xpact as an alternative to libexpat. A Python project of note switches its XML processing to xpact and publishes the reasoning. These are the events that generate the kind of coverage that introduces Eiffel to developers who have never encountered it.

None of this requires Eiffel to become the next Python or Rust. It requires one well-executed, well-documented, well-argued project to demonstrate that Eiffel is a serious choice for real systems programming work. The visibility benefit accrues to the whole language from a single flagship project.

Conclusion: correctness is the next frontier

The software industry spent the last decade discovering that memory safety matters and that programming language choice can address it. The insight was correct and the progress is real. Rust has made C++ codebases meaningfully safer and the trend toward memory-safe systems languages will continue.

But the sudo-rs CVEs of 2025 illustrate a truth that the memory safety framing obscures: most bugs are not memory bugs. Most CVEs are logic errors — missing validations, incorrect state transitions, absent bounds checks, unenforced invariants. These bugs are invisible to the borrow checker. They are not solved by ownership types. They require a different tool.

That tool is Design by Contract. Bertrand Meyer understood this in 1988. Eiffel has embodied it ever since. The industry is only now approaching the point where the memory safety problem is sufficiently addressed that attention will turn to the correctness problem — the problem Eiffel was built to solve.

It is worth pausing to appreciate what Eiffel brings to this task. It would be hard to find another language that simultaneously offers:

Formal correctness guarantees through Design by Contract
Static null safety through void safety
Near-C performance through SPECIAL arrays
Native C interoperability through compile-to-C code generation, with support for x86-64, ARM64, Apple Silicon, and Raspberry Pi from the same source base
MSVC and GCC portability from the same source base
Managed concurrency through SCOOP (native build)
.NET deployment through CIL compilation (separate target, see foot note)
Decades of IDE investment in EiffelStudio
Selective GC control for performance-critical sections

No single one of these is unique to Eiffel. The combination in a single language, producing multiple deployment targets from one codebase, is.

xpact is a proposal for how Eiffel enters that conversation with a concrete, deployable, demonstrable answer to a real security problem affecting real systems. Not a language advocacy argument. Not a theoretical proof. A library. A drop-in replacement. A test suite with contracts. A Python package. A benchmark.

The billion users are already there, embedded in the systems that carry libexpat. The security concern is documented in a public CVE list that keeps growing. The architectural answer has existed in Eiffel for thirty years.

The essay you are reading is an argument. xpact is the proof.

Code listings for C_STRING_8 and benchmark

The following listings present the C_STRING_8 class and the benchmark class discussed in the xpact vs expat section. Both are available in the Eiffel-Loop library under the MIT licence.

C_STRING_8 class

class C_STRING_8 inherit MANAGED_POINTER rename item as area, share_from_pointer as make_shared export {NONE} all {C_STRING_8} area {ANY} count undefine is_equal end COMPARABLE undefine copy end STRING_HANDLER undefine copy, is_equal end DEBUG_OUTPUT undefine copy, is_equal end create make_from_string, make_shared, make_empty convert make_from_string ({STRING_8}) feature {NONE} -- Initialization make_empty do make (0) end make_from_string (s: STRING_8) -- Initialize buffer with the contents of `s'. do make_from_pointer (s.area.base_address, s.count) ensure count_set: count = s.count end feature -- Comparison is_less alias "<" (other: like Current): BOOLEAN -- Is current string lexicographically less than `other'? do Result := c_strcmp_n (area, count, other.area, other.count) < 0 end feature -- Access item alias "[]" (i: INTEGER): CHARACTER_8 -- Character at position `i'. require valid_index: valid_index (i) do Result := read_character_8 (area, i - 1) end feature -- Measurement index_of (c: CHARACTER_8; start_index: INTEGER): INTEGER -- Position of first occurrence of `c' at or after `start_index'; -- 0 if none. require start_large_enough: start_index >= 1 start_small_enough: start_index <= count + 1 local i, l_count: INTEGER; l_area: like area do l_area := area; l_count := count if start_index <= l_count then from i := start_index - 1 until i = l_count or else read_character_8 (l_area, i) = c loop i := i + 1 end if i < l_count then -- We add +1 due to the area starting at 0 and not at 1 Result := i + 1 end end ensure same_as_string_8: Result = to_string.index_of (c, start_index) end occurrences (c: CHARACTER_8): INTEGER -- Number of times `c' appears in `area' local i, l_count: INTEGER; l_area: POINTER do l_area := area; l_count := count from i := 0 until i = l_count loop if read_character_8 (l_area, i) = c then Result := Result + 1 end i := i + 1 end ensure same_as_string_8: Result = to_string.occurrences (c) end feature -- Status report starts_with (other: C_STRING_8): BOOLEAN -- Does `area' start with the same bytes as `other.area'? do if other.count <= count then Result := memory_compare (area, other.area, other.count) end ensure same_as_string: Result = to_string.starts_with (other.to_string) end valid_index (i: INTEGER): BOOLEAN -- Is `i' within the bounds of the string? do Result := (i > 0) and (i <= count) ensure definition: Result = (1 <= i and i <= count) end feature -- Conversion to_string, debug_output: STRING_8 local i, l_count: INTEGER; l_area: POINTER do create Result.make (count) Result.set_count (count) if attached Result.area as area_out then l_area := area; l_count := count from i := 0 until i = l_count loop area_out [i] := read_character_8 (l_area, i) i := i + 1 end end ensure then round_trip: is_equal (new_string (Result)) end feature -- Duplication substring (start_index, end_index: INTEGER): like Current -- substring with shared character buffer containing all characters at indices -- between `start_index' and `end_index' local l_count: INTEGER do if (1 <= start_index) and (start_index <= end_index) and (end_index <= count) then l_count := end_index - start_index + 1 create Result.make_shared (area + (start_index - 1), l_count) else create Result.make_empty end ensure substring_count: Result.count = end_index - start_index + 1 or Result.count = 0 first_code: Result.count > 0 implies Result [1] = item (start_index) recurse: Result.count > 0 implies Result.substring (2, Result.count) ~ substring (start_index + 1, end_index) end new_string (str: STRING_8): like Current do Result := str end feature {NONE} -- Implementation frozen c_strcmp_n (p1: POINTER; n1: INTEGER; p2: POINTER; n2: INTEGER): INTEGER -- Lexicographic comparison of `n1' bytes at `p1' with `n2' bytes at `p2'. -- Returns negative if p1 < p2, zero if equal, positive if p1 > p2. external "C inline use <string.h>" alias "[ int n = ($n1 < $n2) ? $n1 : $n2; int cmp = memcmp($p1, $p2, n); if (cmp != 0) return cmp; return ($n1 < $n2) ? -1 : ($n1 > $n2) ? 1 : 0; ]" end frozen memory_compare (p1, p2: POINTER; n: INTEGER): BOOLEAN -- True if first `n' bytes at `p1' and `p2' are identical. external "C inline use <string.h>" alias "return (memcmp ($p1, $p2, $n) == 0);" end frozen read_character_8 (a_area: POINTER; i: INTEGER): CHARACTER_8 -- Character at offset `i' in buffer `a_area'. require valid_index: valid_index (i + 1) external "C inline" alias "return ((EIF_CHARACTER_8 *)$a_area)[$i];" end end

Benchmark class

class STRING_8_VS_C_STRING_8 inherit STRING_BENCHMARK_COMPARISON create make feature -- Access Description: STRING = "STRING_8 VS C_STRING_8 basic routines" feature -- Basic operations execute do if attached new_string_list as string_list and then attached new_c_string_list as c_string_list then compare ("starts_with", ["SPECIAL starts_with", agent special_starts_with (string_list, The)], ["C buffer starts_with", agent c_buffer_starts_with (c_string_list, C_the)] >>) compare ("occurrences", ["SPECIAL occurrences", agent special_occurrences (string_list, 'a')], ["C buffer occurrences", agent c_buffer_occurrences (c_string_list, 'a')] >>) compare ("CSV line parsing", ["SPECIAL parse_csv", agent special_parse_csv (string_list)], ["C buffer parse_csv", agent c_buffer_parse_csv (c_string_list)] >>) end end feature {NONE} -- Benchmark routines c_buffer_occurrences (title_list: LIST [C_STRING_8]; c: CHARACTER_8) local count: INTEGER do across 0 |..| 100 as n loop across title_list as list loop count := count + list.item.occurrences (c) end end end c_buffer_parse_csv (title_list: LIST [C_STRING_8]) local count: INTEGER do across 0 |..| 100 as n loop across title_list as list loop count := count + new_c_string_csv_list (list.item).count end end end c_buffer_starts_with (title_list: LIST [C_STRING_8]; str: C_STRING_8) local count: INTEGER do across 0 |..| 100 as n loop across title_list as list loop if list.item.starts_with (str) then count := count + 1 end end end end special_occurrences (title_list: LIST [STRING_8]; c: CHARACTER_8) local count: INTEGER do across 0 |..| 100 as n loop across title_list as list loop count := count + list.item.occurrences (c) end end end special_parse_csv (title_list: LIST [STRING_8]) local count: INTEGER do across 0 |..| 100 as n loop across title_list as list loop count := count + new_csv_list (list.item).count end end end special_starts_with (title_list: LIST [STRING_8]; str: STRING_8) local count: INTEGER do across 0 |..| 100 as n loop across title_list as list loop if list.item.starts_with (str) then count := count + 1 end end end end feature {NONE} -- Contract Support new_split_list (str: STRING_8; c: CHARACTER_8): LIST [STRING_8] -- allow object comparison do Result := str.split (c) Result.compare_objects end feature {NONE} -- List Factory new_c_string_csv_list (str: C_STRING_8): ARRAYED_LIST [C_STRING_8] local prev_i, next_i: INTEGER do create Result.make (str.occurrences (',') + 1) Result.compare_objects from until Result.full loop next_i := str.index_of (',', prev_i + 1) if next_i = 0 then next_i := str.count + 1 end Result.extend (str.substring (prev_i + 1, next_i - 1)) prev_i := next_i end ensure same_as_split: across new_split_list (str.to_string, ',') as list all list.item ~ Result [list.cursor_index].to_string end end new_c_string_list: ARRAYED_LIST [C_STRING_8] do if attached Hexagram.English_titles as titles then create Result.make (titles.count) across titles as list loop Result.extend (list.item) end end ensure same_as_new_string_list: across new_string_list as list all list.item.starts_with (The) implies Result [list.cursor_index].starts_with (C_the) end end new_csv_list (str: STRING_8): ARRAYED_LIST [STRING_8] local prev_i, next_i: INTEGER do create Result.make (str.occurrences (',') + 1) Result.compare_objects from until Result.full loop next_i := str.index_of (',', prev_i + 1) if next_i = 0 then next_i := str.count + 1 end Result.extend (str.substring (prev_i + 1, next_i - 1)) prev_i := next_i end ensure same_as_split: Result ~ new_split_list (str, ',') end new_string_list: ARRAYED_LIST [STRING_8] do create Result.make_from_array (Hexagram.English_titles.to_array) end feature {NONE} -- Constants C_the: C_STRING_8 once Result := The.string ensure same_string: Result.to_string ~ The end The: STRING_8 once Result := "The" end end

Foot notes

.NET deployment: SCOOP concurrency and .NET deployment are alternative targets from the same Eiffel source base — SCOOP is not currently supported in the .NET build. The native build delivers SCOOP-based pipeline parallelism; the .NET build delivers idiomatic CLR delegate-based callbacks. Each target is optimised for its deployment context.

However, the Eiffel-Loop library provides EL_PROCEDURE_DISTRIBUTER and EL_FUNCTION_DISTRIBUTER as a practical alternative — agent-based thread pool distribution built on ISE's classic thread.ecf. (See Concurrency-demo) These abstractions are not experimental — they are production-tested in the Eiffel-View repository publisher, a tool that performs a partial parse of the entire Eiffel-Loop codebase using distributed threads every time source code is added or changed, generating the static HTML for eiffel-loop.com. The native xpact build uses SCOOP for pipeline parallelism; the .NET build can use Eiffel-Loop's battle-tested threading abstractions or the CLR's own task parallel library.

References

Meyer, Bertrand. Object-Oriented Software Construction, 2nd ed. Prentice Hall, 1997.
US National Cybersecurity Strategy, Office of the National Cyber Director, 2023.
CISA, NSA et al. The Case for Memory Safe Roadmaps, 2023.
sudo-rs CVE disclosures, July 2025.
Ubuntu 25.10 release notes, October 2025.