Every AI product now claims it learns. But look closely and what most of them mean is that somewhere a text file is getting longer. The agent hits a problem, works something out, and appends the lesson to an instructions file (CLAUDE.md, AGENTS.md, a “memory” store) that gets stuffed into the prompt on every subsequent call. That is the industry’s default memory architecture: a diary the model re-reads, at your expense, every single time it does anything.
The problem is that it degrades. In 2025, researchers at Chroma measured what they named “context rot” across 18 frontier models, including GPT-4.1 and Claude 4: accuracy falls continuously as prompt length grows, with drops of 20 to 50 percent on long inputs. An earlier paper, Lost in the Middle, showed that models reliably miss information buried in the middle of a long context. Put those together and the standard “learning” architecture has a design flaw you can state in one sentence: the more the product knows, the worse it behaves.
When a shiny new AI hammer comes along, every problem looks like a nail. I designed WPLoadTester 7 to use AI for what it does best, leaving everything else to boring, deterministic, testable algorithms. When our AI Assistant solves a problem, it doesn’t write the lesson into a prompt. It writes the lesson into a rule, hands the rule to a deterministic expert system, and gets out of the way. The expert system applies everything it has accumulated to every future recording before the AI is invoked at all. The result is a product where the relationship runs backward from the rest of the industry: the more you use the AI, the less it runs.
Here is the mental model behind that decision. When I evaluate any AI-and-data problem, I place it on two axes. One axis: is the data structured (fields, tables, name-value pairs) or unstructured (raw text, HTML, a wall of HTTP traffic)? The other: is the processing deterministic (same input, same output, every time) or non-deterministic (statistical, sampled, calculated, never quite the same twice)?
The bottom-right quadrant is the interesting one. Before large language models, nothing handled unstructured data non-deterministically in a useful way; now it is the hottest quadrant in software, and every vendor is racing to move their whole product into it. That is the mistake. The bottom-right quadrant is where hard problems get solved, because an LLM can read a wall of raw HTTP traffic and reason about what it means. It is a terrible place for solved problems to live, because everything in it costs tokens per use and carries a small probability of coming back different. An LLM’s ability to give different answers to the same question looks like creativity when generating images, but it is a liability for correlations, math, or processing large datasets.
The rubric gives you the rule: create knowledge in the expensive quadrant, store it in the cheap one. A prompt-file memory violates this rule. It creates knowledge in the expensive quadrant and then leaves it there, paying rent on every call.
WPLoadTester’s cheap quadrant has a name: Application State Management, or ASM. It is a rules-based expert system we have been refining since 1999, and it exists because of a problem every load tester knows. You record a user session as HTTP traffic, you replay it with a thousand virtual users, and it breaks immediately, because the recording is full of values that were only valid for the original session: JSESSIONID cookies, CSRF tokens, ASP.NET VIEWSTATE blobs, OAuth Bearer tokens. Finding every dynamic value, tracing where it originates, and wiring the extraction used to take days of an engineer staring at headers.
ASM automates the well-known cases. It ships with over 300 detection rules encoding 27 years of correlation work: framework rules for ASP.NET, Spring, Rails, and PHP; protocol rules for cookies and cache headers; 14 rules for OAuth token flows alone. It runs automatically after every recording and handles roughly 95 percent of the common patterns before a human, or an AI, looks at anything.
The AI Assistant covers the long tail: the proprietary framework, the unusual token scheme, the field that only reveals itself as a runtime failure during replay. This is exactly the bottom-right-quadrant work the rubric prescribes, reasoning over raw, unstructured traffic to figure out what a never-before-seen application is doing. On a recent case, the assistant spent four attempts refining a regex to extract a JWT from an escaped Next.js streaming payload. It ran the same loop a senior engineer would, but finished in seconds.
Here is the part where our AI strategy really pays off. When the assistant cracks a problem like that, it doesn’t just patch the test case. It writes the solution as a new detection rule: a plain properties file with an extraction pattern and a context scope, saved straight into ASM’s rule store. The next recording you make gets that rule applied during ASM’s up-front scan. Zero tokens. Identical, verified behavior. No context to rot.
A rule does not get into the store just because the AI wrote it. The assistant tests the extraction in isolation against the recorded traffic, applies it to the test case, then replays the whole scenario end to end to confirm the fix holds in practice. It usually takes a few tries, a small tweak and another replay, before everything passes. Only then does the rule earn its permanent spot.
To be clear, this is not machine learning. No model weights change. The AI does not get smarter; the system around it accumulates knowledge, in a form you can open in a text editor and read.
The research world has been here. Voyager (2023) had an LLM agent build a permanent library of Minecraft skills. ExpeL (2023) extracted reusable insights from an agent’s successes and failures. DreamCoder (2021) grew libraries of program abstractions before LLMs made it fashionable. We did not invent the idea of an AI that writes down what it learned. The difference is where the learning goes: every one of those systems retrieves its accumulated knowledge back into the prompt, where it costs tokens and competes for the model’s attention. Ours compiles it out of the prompt entirely, into a deterministic engine that runs before the model. The research field calls this general direction neuro-symbolic AI. I call it not paying for tokens to solve the same problem over and over again.
An OAuth login flow that takes two hours of manual correlation typically configures in about four minutes with the assistant. That is the first encounter. The second encounter is handled automatically by the expert system’s up-front scan. Your AI spend concentrates on problems nobody has seen before, which is the only work an LLM should be doing anyway.
Meanwhile the prompt-file architecture pays in the other direction: every lesson learned makes every future call slightly larger, slightly slower, and, per the context-rot data, slightly less reliable.
The AI Assistant is the most impressive thing in WPLoadTester 7, and the entire architecture is designed to need it less every week you use it. Both are useful, but I think the second one is the better feature. One trend I’m seeing with AI adoption is the more people use it, the more they come to rely on it and use it more. I would rather give you an AI implementation that you’ll need to use less and less.

Founder of Web Performance, Inc
B.S. Electrical & Computer Engineering
The Ohio State University
LinkedIn Profile
Occasional load-testing tips, performance-engineering notes, and product updates. No spam — unsubscribe anytime.