Base Models, Harnesses, and the Human Objective in AI Knowledge Creation

Author / Director: Hulki Okan Tabak

Final publish-ready revision drafted with ChatGPT 5.5 Pro

Source corpus: five first-stage model responses + two synthesis essays — 21 May 2026

Final method: comparative synthesis + wisepersons review + extended literary-masters polish + editorial-pass cleanup

Abstract / Summary

This essay answers a question that began in a room where Hulki Okan Tabak, Turan, and Oya were present and contributing: in AI-assisted knowledge creation, what matters more — the base large language model, or the prompt, context, and harness around it? Turan’s position was that the base LLM is the basic source: the trained model contains the compressed world, the priors, the scale, the data, and the latent competence from which future knowledge will be drawn. Hulki’s counter-position was that once frontier models have broadly similar world models, the decisive difference shifts toward prompt, context, tools, memory, evaluators, workflow, and human intent — the system that decides which possible answer becomes actual.

The final answer is not a fixed percentage. The base model sets the horizon of possible thought and the quality of candidate ideas. Post-training supplies temperament and policy. The prompt supplies orientation. Context supplies evidence. Tools supply contact with the world. The harness supplies procedure. The evaluator supplies discipline. The human or institution supplies purpose, judgment, and warrant. Knowledge is created when these layers form a loop of conjecture, criticism, correction, and stabilization.

For readers who will not read the whole essay: Turan is right at the ceiling and at generational jumps; no harness can reliably extract what the model cannot represent. Hulki is right at the margin among comparable frontier systems; the same model can become shallow, brilliant, grounded, or wrong depending on context and harness. Oya’s presence in the originating discussion matters because the question was never merely technical; it was born in a human setting of shared inquiry. The strongest thesis is: the LLM proposes, the harness searches, the evaluator disciplines, and the human objective decides what is worth knowing.

This publish-ready version also documents its own final revision method. After five first-pass model responses and two synthesis essays, the article was reviewed through a wisepersons panel, polished through an extended literary-masters panel, and cleaned through a 15-criterion editorial pass. Those protocols are used here as AI-assisted critical lenses, not as claims that the historical figures named in those panels participated. Their role is methodological: to make provenance, standpoint, structure, voice, and final cleanliness visible.

1. The question from the room

The question did not begin as a benchmark. It began as conversation. In a room where Hulki Okan Tabak, Turan, and Oya were present and contributing, a disagreement emerged about the origin of knowledge in large language model systems. Later, through deeper discussion with Turan, the disagreement took a more precise form. Is the base LLM — the trained model itself — the basic and more important source of future knowledge creation? Or, once several frontier models have reached similar scale and have trained on overlapping data, does the real differentiator become the prompt, the context, and the harness around the model?

Four terms need to be made plain. The base model is the trained network: the weights that have compressed patterns from language, code, images, scientific papers, websites, books, and other training sources into a vast latent map. The prompt is the instruction, question, role, or frame given to the model at the moment of use. The context is the evidence supplied at inference time: documents, examples, retrieved passages, a codebase, a database, the conversation so far. The harness is the machinery that wraps the model: retrieval pipelines, tool access, memory, code execution, search, multi-agent debate, reflection, formal verifiers, evaluators, and procedures for iteration.

Turan’s position can be stated generously. The base model is not a decorative component; it is the substrate of possible cognition. If a model has not learned the deep structure of a domain, no prompt can make it genuinely master that domain. If a model has not absorbed Turkish literary nuance, a translator harness will be scaffolding an empty room. If a model lacks mathematical abstraction, no amount of role-play will make it prove theorems reliably. The model defines the ceiling.

The counter-position can also be stated generously. In practical knowledge work, especially when comparing frontier models in the same general capability band, the model often contains more latent competence than naive use can elicit. The decisive question becomes not only what the model knows, but what the system makes it do: which evidence it sees, which persona it adopts, which tools it uses, which failures it observes, which hypotheses it tests, and which standards reject its first answer. The harness defines the path.

The initial question asked for a percentage of influence. That is exactly where the trap lies. A model-plus-harness system is not additive. It is closer to a product. If the model is too weak, the system fails. If the context is wrong, it hallucinates. If the evaluator is absent, it cannot distinguish plausible prose from warranted knowledge. If the human objective is confused, the system may optimize the wrong thing beautifully. Knowledge does not live in one component; it appears when components are coupled in a loop.

2. How this essay was made

The method of this essay is part of its argument. Hulki Okan Tabak posed the same question to five public-facing advanced language model systems as they were available to the project on 21 May 2026: ChatGPT, Claude, DeepSeek, Gemini, and Grok. Those five responses were then given to two models, ChatGPT and Claude, for second-stage synthesis. This final essay uses all seven documents as a comparative corpus. It is not a neutral experiment, but it is a revealing intellectual procedure.

The procedure resembles the very harness it is trying to understand. A single model response can feel complete because there is no counter-voice inside the document. Five responses make omissions visible. One model emphasizes the ceiling of the base model; another emphasizes prompt sensitivity; another emphasizes retrieval and guardrails; another emphasizes the Popperian loop of conjecture and criticism; another turns toward artificial epistemology. The second-stage syntheses then compare, compress, and challenge the first layer. The final operation is not simply more model. It is model plus corpus, model plus comparative pressure, model plus authorial direction, model plus constraints of length and publishability.

For this publish-ready edition, the article then received three further directed passes. First, the wisepersons panel was used as a designed-disagreement review of the article’s architecture. Its effect was to make the reader path clearer, add the missing abstract, make the provenance more explicit, protect the role of Oya in the originating setting, strengthen the distinction between method and evidence, and require that the final version disclose its own revision protocols. Second, because that review altered the document’s structure, the extended literary-masters panel was used as a polish pass: to cut inflated transitions, keep the reader inside the argument, preserve disagreement rather than smoothing it into consensus, prefer concrete language to catalogues, and make the ending carry the essay rather than explain it twice. Third, the editorial-pass skill was used as the final cleanup gate: typography, consistency, cross-reference sanity, duplicate detection, verbal tics, ending strength, front/back matter, and publish-ready formatting.

These named panels should be understood correctly. They are not claims that the historical people whose names organize the protocols participated in this essay. They are AI-assisted critical frames: a way of forcing multiple kinds of editorial attention into the drafting process. Ranganathan-like attention asks whether the reader can find the path; Avram-like attention asks whether the metadata and provenance survive; Haraway-like attention asks whose standpoint is being hidden; Kahle-like attention asks what will remain legible if the platform changes; Morrison-like attention asks whether the essay has a conscience and not only a structure. The literary panel then asks a different question: not merely whether the argument is correct, but whether it breathes as prose.

The method must also be scrutinized. These seven source documents are not independent witnesses. They are outputs of related systems trained on overlapping public knowledge and shaped by unknown system prompts, alignment policies, tool availability, context windows, default temperatures, and corporate styles. Their agreement should not be mistaken for proof. A chorus of aligned assistants may converge because they share insight; they may also converge because helpfulness training rewards the diplomatic synthesis: both sides are right, the truth is a symbiosis, here is a balanced metaphor.

This is why the method is interpretive rather than experimental. The corpus is useful for surfacing the space of arguments, metaphors, cautions, and regime distinctions. It cannot settle the percentage. That would require controlled ablations: the same tasks, the same context, the same tools, systematically varied base models and harnesses, measured against external ground truth across domains with and without cheap verifiers. The present essay does something different. It reads the seven documents as a miniature literature, preserves their strongest contributions, criticizes their blind spots, and tries to produce a final answer that is more than an average of fluent opinions.

The result should be read accordingly. It is not a benchmark report. It is a synthetic essay generated through a directed human-AI process, under the authorship and direction of Hulki Okan Tabak, using multiple model outputs as evidence, provocation, and constraint. Its own final revision passes are part of the record because, in this subject, method is never outside the argument.

3. What the seven documents agree on

The first striking result is convergence. The five initial model responses, and the two second-stage syntheses, all reject a simple model-versus-harness split. Gemini gives the most accessible metaphor: the base model is the engine, the prompt and harness are the steering wheel. DeepSeek frames the relation as a dyad: telescope and astronomer, instrument and procedure, potential and actuality. Grok describes artificial epistemology as enacted at the interface between model, prompt, context, tools, and human. ChatGPT introduces the regime map: model, context, harness, evaluator, and social validation dominate in different task regimes. Claude introduces the sharpest correction: knowledge creation is not a component but a loop.

All seven grant Turan’s strongest point. The base model sets the ceiling of latent competence. It determines which abstractions are available, which analogies come naturally, which languages and domains are represented, which styles can be simulated, which chains of reasoning are reachable, and which hypotheses the system is likely to propose. Scaling-law research supports this broad point: model performance has historically improved in predictable ways with compute, data, and scale, while compute-optimal training can change capability dramatically even at lower parameter counts [R1, R2]. Post-training also matters: instruction tuning and reinforcement learning from human feedback can change whether a system follows user intent, refuses certain outputs, or presents itself as cautious, terse, expansive, or speculative [R3].

All seven also grant the counter-position’s strongest point. Once the base model is sufficiently capable, the surrounding system can radically change realized performance. Chain-of-thought prompting, self-consistency, Tree of Thoughts, ReAct, retrieval-augmented generation, and multi-agent debate all show that the same model can behave differently under different inference procedures [R4-R9]. RAG is especially important because it changes the temporal and epistemic status of the system: a frozen model can be connected to current or private evidence, so the effective knowledge source is no longer only parametric memory [R7]. The model may contain possible thoughts; the harness determines which thoughts are asked for, grounded, checked, and repeated.

The documents also agree that answer generation is not the same as knowledge creation. A one-shot answer can be plausible, elegant, and wrong. Knowledge requires contact with constraint: a proof checker, a test suite, an experiment, a database, a source trail, an adversarial critic, a community of readers, or at least a deliberately designed process that tries to kill the system’s favorite illusion. The shared slogan, in one form or another, is: models propose; harnesses dispose. But this slogan remains incomplete until we ask who decides which proposals are worth making.

4. What the seven documents reveal by disagreeing

The differences among the responses are almost as revealing as the agreements. They differ first in voice. ChatGPT is systematic and citation-heavy. Claude is critical and self-interrogating. Gemini is compressed and metaphor-forward. DeepSeek is dialectical and historical. Grok is philosophical and expansive. These differences are not trivial. They are exactly the kind of variation one would expect if frontier models share much of the same public ontology but diverge in post-training, system prompts, institutional style, and safety-policy temperament. The corpus itself demonstrates the thesis: the shared floor plan looks like convergent world modeling; the prose texture looks like steering.

They differ more deeply in what they think is missing. The first Claude essay insists that knowledge creation bundles together at least three activities: eliciting latent capability, recombining known concepts into locally novel arrangements, and generating genuinely new warranted claims. This matters because the model-versus-harness balance differs across the three. The second Claude synthesis adds two cautions: first, that large harness gains are easiest to measure in domains with cheap verifiers, such as coding and mathematics; second, that human purpose is underweighted in the other essays. ChatGPT’s second synthesis gives the strongest architectural map: the epistemic stack includes pre-training, post-training, prompt, context, tools, harness, evaluator, and world. Grok is strongest on the philosophical claim that knowledge is enacted rather than merely stored. DeepSeek is strongest on the co-authorship frame and on the idea that value migrates from engine to ecosystem as models commoditize. Gemini is strongest on clarity: without an engine the vehicle does not move; without steering it goes nowhere useful.

One methodological warning from the second Claude synthesis deserves to survive: consensus among aligned assistants is seductive and weak. Citation density is not the same as truth. A beautiful metaphor is not an argument. Balanced synthesis can be a trained reflex. The strongest model response is not necessarily the one that sounds most reasonable; it may be the one that exposes the hidden premise the others shared.

That hidden premise is that the key contest is between model and harness. The better analysis has more layers. A practical LLM system is not just base model plus user prompt. It includes pre-training, post-training, hidden instructions, policy layers, context retrieval, tool orchestration, memory, sampling settings, evaluators, human judgment, and social uptake. The word LLM often names a packaged system while pretending to name only weights. Some of what Turan calls the model is already a frozen harness. Some of what the counter-position calls harness could later be trained into the model. The boundary is movable.

5. The case for the base model

The case for the base model begins with a simple fact: a harness cannot evaluate candidates that never appear. In open-ended search spaces, the proposal distribution matters enormously. A weak model may propose shallow variants, miss abstractions, misunderstand evidence, or use tools incompetently. A stronger model does not merely answer better; it samples better regions of possibility.

This is why base-model training is not just storage. The model is a compressed inheritance of human language, science, mathematics, culture, style, code, and procedure. It shapes what the system notices. It determines whether a prompt for a philosophical essay yields cliche, synthesis, or genuine conceptual pressure. It determines whether a code agent can read an error trace and infer the deeper bug. It determines whether a research assistant can distinguish a plausible abstract from a coherent hypothesis. It determines what analogies are available before the harness starts searching.

The strongest evidence for this side is that scale, data, training recipe, and post-training have repeatedly shifted the frontier. Kaplan et al. found empirical scaling relationships between loss, model size, dataset size, and compute [R1]. Hoffmann et al.’s Chinchilla result showed that many large models were undertrained and that a smaller model trained on more data could outperform much larger undertrained models [R2]. InstructGPT showed that alignment and human feedback can make a smaller model preferred to a much larger unaligned model on the authors’ prompt distribution [R3]. These results do not say that the model is everything. They say that the substrate has real causal force.

The base model is especially decisive at the frontier: advanced mathematics, theoretical science, complex long-horizon planning, literary style at a high level, multilingual nuance, and domains where the evaluator is weak or delayed. In these regimes, the model must supply taste, abstraction, and conjecture before any harness can test. If the substrate lacks the relevant representation, the harness may become elaborate theater.

This is the deepest version of Turan’s point. The harness realizes capability; it does not abolish the need for capability. The model sets what can be thought, or at least what can be reached with reasonable effort. If the marble is not in the quarry, no sculptor can carve it.

6. The case for context and harness

The case for the harness begins with an equally simple fact: most model capability is not realized by naive prompting. A capable model can produce shallow output when asked shallowly, can hallucinate when deprived of evidence, can fail to use long context robustly, can stop too early, and can be persuasive without being checked. The harness turns a model from a single-pass generator into a procedure.

Prompting changes the mode of computation. Chain-of-thought prompting showed that eliciting intermediate reasoning steps can improve performance on arithmetic, commonsense, and symbolic tasks [R4]. Self-consistency showed that sampling multiple reasoning paths and selecting the most consistent answer can improve reasoning further [R5]. Tree of Thoughts generalized this into search over intermediate states, showing large gains in tasks requiring exploration and backtracking [R6]. ReAct joined reasoning to action, allowing the model to gather external information and update its plan [R8]. Multi-agent debate showed that multiple model instances arguing across rounds can improve factuality and reasoning in some settings [R9].

Context changes the evidence. Retrieval-augmented generation gives the model access to non-parametric memory, current documents, private corpora, and provenance [R7]. This matters because a base model is temporally frozen and probabilistic. A model trained before a fact cannot know it unless the fact is supplied. A model without access to a company’s internal documents cannot reason accurately about those documents unless context brings them into view. In such cases, the context is not decoration; it becomes part of the effective world model for the task.

Tools change the epistemic loop. Code execution lets the model test programs. Search lets it import new evidence. Calculators reduce arithmetic error. Simulators expose hypotheses to constraints. Proof assistants formalize correctness. Data-analysis tools transform raw data into observations. Without these tools, the model is often limited to fluent conjecture. With them, it can observe failure, update, retry, and sometimes discover.

The best examples of AI-assisted discovery are not naked models. FunSearch pairs a pretrained LLM that proposes code with an automated evaluator; through iteration, candidate solutions can evolve into new mathematical or algorithmic findings [R12, R13]. AlphaGeometry combines neural guidance with symbolic deduction and synthetic data to solve olympiad geometry problems [R14]. AlphaProof and AlphaGeometry 2 reached silver-medal standard on IMO 2024 problems, but they did so through specialized formal systems and substantial search rather than ordinary chat [R15]. AlphaEvolve combines LLM-generated code with evolutionary search and automated evaluators for algorithmic and infrastructure improvements [R16]. These systems are not simply larger minds. They are discovery loops.

The harness is therefore more than a filter. A filter removes. A good harness creates epistemic shape. It defines a procedure, selects evidence, creates intermediate states, forces comparison, permits failure, and supplies a route back from error. The same base model inside different harnesses can become a literature-review assistant, a coding agent, a theorem prover, a critic, a translator board, or a laboratory notebook. The model is not unchanged in practice; its capacities are reorganized by the process around it.

7. Post-training: the frozen harness inside the model

A major source of confusion is that the phrase base LLM hides a layered object. A model encountered by a user is rarely a raw pretrained transformer. It has been instruction-tuned, preference-optimized, safety-shaped, style-shaped, sometimes tool-trained, and wrapped in hidden system instructions. This post-training layer behaves like a harness, but it ships inside the model.

This dissolves part of the debate. Suppose two frontier systems have similar internal representations but one is trained to be cautious, cite sources, refuse certain topics, and avoid speculation, while another is trained to be terse, bold, and exploratory. Their outputs may differ greatly even if their deeper world models overlap. The difference may not be architecture in the narrow sense. It may be an internalized steering policy.

The first Claude essay called this a filter wearing the model’s clothes. ChatGPT’s second synthesis called it a hidden middle layer. Grok noted that guardrails are part of the base yet function as persistent filters. DeepSeek observed that instruction tuning blurs the line between model and prompt because the model has already internalized prompt-like demonstrations. These are not merely rhetorical points. They show that model versus harness is not a natural boundary. It is an engineering boundary between what is amortized into weights and what is left outside as runtime procedure.

Future systems will move functions back and forth across that boundary. A retrieval habit can be external today and trained into an agent tomorrow. A research protocol can be expressed as a prompt, stored as memory, distilled into weights, or enforced by a workflow. A refusal policy can be embedded in post-training or executed by a separate moderation layer. A persona can be system prompt or fine-tune. This makes the question more subtle: not model versus harness, but frozen procedure versus dynamic procedure.

Dynamic procedure has advantages: it is inspectable, replaceable, updateable, domain-specific, and easier to audit. Frozen procedure also has advantages: it is fast, integrated, and available without complex orchestration. Knowledge systems will likely use both. The practical frontier will be the art of deciding which epistemic functions belong inside weights and which should remain outside, visible, and corrigible.

8. Closed loops, open loops, and the world

The second Claude synthesis introduces a useful distinction: closed loops versus open loops. A closed loop is a model arguing with itself, sampling multiple chains of thought, running self-consistency, or staging debate among instances that have no access to external information. Such loops can improve output. They can denoise. They can suppress inconsistencies. They can spend more inference to reach better latent guesses. But they cannot, by themselves, import information the system never had.

An open loop touches something outside the model: a database, a code interpreter, a proof checker, an experiment, a search engine, a human critic, a sensor, a lab instrument, or a social process. Open loops can create knowledge in a stronger sense because they allow reality to push back. The evaluator becomes a channel through which the world says no.

This distinction clarifies what is meant by knowledge found and knowledge created. Retrieval finds knowledge the world already contains but the model may not. A RAG system can answer current or private factual questions not because the model remembered them, but because the harness imported them. Experimental and formal loops can create or establish knowledge by testing conjectures. A model proposes a program; the evaluator measures its performance. A model proposes a proof step; the formal system accepts or rejects it. A model proposes a scientific hypothesis; an experiment returns data. In each case, the loop can exceed the base model because it is no longer confined to parametric memory.

But open loops do not solve everything. They require an objective. A test suite works only because someone decided what counts as passing. A benchmark works only because someone chose tasks and metrics. A proof checker works only after a problem is formalized. A scientific experiment works only after someone decides which variable matters, which measurement is meaningful, and which anomaly deserves pursuit. The world can reject; it cannot by itself tell us which question is worth asking.

This is where many technical accounts understate the human layer. We can automate more proposal, search, and verification, but the selection of worthwhile objectives remains scarce. In open-ended creation, the most important question is often not Can this system solve the problem? but Why this problem? Why this hypothesis? Why this style? Why this translation? Why this criterion of success? The human objective is not an optional decoration at the edge of the stack. It is the shaping principle that tells the stack what kind of truth, usefulness, beauty, or explanation it is seeking.

9. The human objective

The final synthesis should add one layer to the epistemic stack: purpose. The base model sets what can be thought. The prompt sets the angle of attention. Context sets what evidence is visible. The harness sets what gets tried. The evaluator sets what survives. But the human objective sets what is worth trying at all.

This is not a sentimental claim that humans are magically creative and machines are not. It is a structural claim about open-ended search. In domains with cheap verifiers, the objective can be compact: minimize loss, pass tests, prove the theorem, maximize score, reduce latency, improve packing density. In domains without cheap verifiers — essay writing, translation, historical interpretation, philosophical synthesis, design, political judgment, aesthetic choice — the objective is not fully formalized. It lives in taste, purpose, values, audience, stakes, and worldliness.

A translation board, for example, is not neutral. The choice of critics, examples, constraints, style targets, cultural references, and revision criteria encodes an aesthetic. A multi-agent debate is not automatically wise; it is wise only if disagreement is designed around standards that matter. A research harness is not automatically scientific; it is scientific only if its conjectures are exposed to serious tests and its objective does not reward superficial novelty. A writing harness is not automatically literary; it becomes literary only when its selection pressure includes rhythm, ambiguity, necessity, and judgment.

This matters because the word creation points beyond production. A model can produce endless text. A harness can multiply variants. A verifier can reject failures. But creation requires an orientation toward something worth making. The human objective is not merely a prompt; it is a criterion of significance. It asks not only What can be generated? but What should be brought into the world?

In this sense, the future of AI-assisted knowledge creation is not merely technical. It is curatorial, institutional, and ethical. The most powerful systems will not be those that produce the most candidates, but those that combine strong candidate generation with good taste in objectives, rigorous contact with reality, and mechanisms for revising their own standards when the world pushes back.

10. A regime map for the debate

The original question becomes clearer if we ask where the bottleneck is.

When the task depends on latent ability that cannot be supplied externally — advanced abstraction, unfamiliar domains, deep mathematical reasoning, multilingual nuance, complex planning, scientific taste — the base model dominates. Turan is strongest here. The model defines the reachable space, and a better substrate can create a frontier jump.

When the task depends on current, private, obscure, or document-specific information, context dominates. No base model, however capable, can know a private memo it has not seen. In this regime, retrieval, source selection, and context ordering are central. The model’s job is to reason over supplied evidence rather than rely on its parametric memory.

When the task requires long-horizon action, tool use, decomposition, repair, or iteration, the harness dominates. Coding agents, research assistants, data-analysis systems, and complex workflows live here. The difference between a one-shot answer and a working solution is procedure.

When truth can be checked cheaply, the evaluator dominates. Mathematics, programming, formal verification, simulation, and certain optimization tasks benefit enormously from loops that can fail fast and retry. This is the home turf of the harness thesis. But it is also a warning: do not generalize too quickly from domains where truth is unusually convenient.

When the task is interpretive, aesthetic, social, or value-laden, human and institutional judgment dominates. The model can generate; the harness can compare; the evaluator may be weak or plural. Here knowledge stabilizes through criticism, conversation, publication, use, replication, reader response, and time.

This map also has a time axis. Within a model generation, when frontier systems are bunched together, harnesses and context often create the largest marginal gains. Between generations, when a new base model genuinely expands what is reachable, the model’s influence spikes. The percentage is not a scalar. It is a moving derivative. The right question is: for this task, at this moment, where does the next unit of effort buy the most warranted knowledge?

11. What would actually settle the debate

A publishable answer should admit what would change its mind. The current corpus gives a strong conceptual map, but not a decisive measurement. A decisive program would need at least four kinds of ablation.

First, base-model ablations: hold the prompt, context, tools, sampling, and evaluator constant while swapping only the model. This would estimate how much the substrate changes the proposal distribution and final success rate. The key is not only average accuracy but diversity and depth of candidates. In knowledge creation, a model that produces one rare promising conjecture may matter more than a model that produces many polished conventional answers.

Second, harness ablations: hold the model constant while varying retrieval quality, context ordering, tool access, scratchpad procedure, multi-agent debate, memory, and verification. This is the cleanest way to test the counter-position. SWE-bench-style coding benchmarks already gesture in this direction, but the evidence is skewed toward domains where automatic evaluation is cheap. The same study design should be extended to literature review, translation, policy analysis, historical interpretation, and scientific hypothesis generation, where the evaluator is slower, plural, or human.

Third, evaluator ablations: hold both model and prompt constant while changing what counts as success. This is rarely done carefully, but it matters most. If the evaluator rewards citation count, the system learns to sound scholarly. If it rewards novelty, it may hallucinate. If it rewards passing tests, it may overfit tests. If it rewards human delight, it may become theatrical. Different evaluators do not merely score knowledge; they create different kinds of knowledge work.

Fourth, objective ablations: hold the machinery constant while changing the human purpose. Ask the same model-harness system to optimize for truth, usefulness, elegance, safety, originality, controversy, speed, or pedagogical clarity. The resulting outputs will differ not because the model changed, but because the meaning of better changed. This is why the human objective belongs in the stack. It is not outside the system. It is one of the system’s causal inputs.

A serious research agenda would also need longitudinal tests. Model influence and harness influence change over time. When frontier models are close, harness improvements may dominate. When a new model generation opens a capability frontier, the substrate dominates again. The question should be studied as a moving field, not a permanent ratio. The best empirical answer would not say, The model contributes X percent. It would say: for this task family, under this verification regime, at this historical moment, the marginal return to model improvement is here, the marginal return to context improvement is here, and the marginal return to evaluator design is here.

That is the standard this essay cannot meet but can name. Until such studies exist, the honest stance is neither agnosticism nor premature certainty. It is regime-aware confidence: we know enough to say that the base model sets the horizon; we know enough to say that context and harness often dominate realized performance; we do not yet know enough to assign universal percentages.

12. Practical implications for knowledge workers

The debate has practical consequences. A writer, researcher, translator, software engineer, or institution deciding how to use AI should not ask only Which model is best? That question matters, but it is incomplete. The better operational question is: What epistemic stack do I need for this kind of work?

For factual work, invest first in evidence. Build retrieval over the right corpus. Prefer source quality over context quantity. Make provenance visible. Do not assume that a long context window means the model will use the context well; the Lost in the Middle literature warns that placement and salience matter [R10]. A model with a smaller context but a better retrieval-and-ranking pipeline can outperform a larger context used carelessly.

For creative and interpretive work, invest in standards of criticism. Because there is no cheap verifier, the harness must manufacture forms of resistance: comparison, adversarial review, multiple perspectives, style probes, recurrence checks, audience-specific readings, and revision passes. The aim is not to make art mechanical, but to prevent fluency from passing as necessity. In such domains, the human objective and the critic harness are inseparable.

For scientific and technical discovery, invest in open loops. Give the model ways to test, not merely ways to talk. Code execution, formal verification, simulations, external databases, and experimental interfaces matter because they let the world answer. The most important question becomes: what kind of feedback can the system receive, and how quickly can it use failure productively?

For organizations, the implication is governance. If knowledge increasingly emerges from epistemic stacks, then accountability cannot stop at the named model. One must audit training data where possible, post-training behavior, retrieval sources, tool permissions, logging, evaluator incentives, user prompts, and publication workflows. A flawed harness can make a good model dangerous; a strong harness can make a merely adequate model useful; a hidden policy layer can make two systems with similar substrates behave as if they inhabit different worlds.

For individuals, the implication is craft. Prompt engineering in the narrow sense may become less important as models become better at understanding ordinary instructions. But context engineering, workflow design, source criticism, evaluator design, and taste will become more important. The future skill is not memorizing magic phrases. It is designing conditions under which a model’s first plausible answer is not allowed to be its last.

This is perhaps the most actionable answer to the original debate. If one is buying access to a model, choose the strongest substrate available within cost and safety limits. If one is trying to create knowledge, spend disproportionate effort on the surrounding process: the corpus, the question, the tools, the critics, the tests, the objective, and the revision loop. The model gives breadth. The process gives discipline. The human objective gives direction.

13. What this experiment teaches

The experiment teaches five things.

First, it confirms that frontier model outputs can converge in substance while diverging in temperament. The responses share a central architecture: no fixed percentage, model as substrate, harness as realization, knowledge as loop. Yet they differ in style, emphasis, caution, metaphor, and self-critique. That pattern supports the idea that model worlds may overlap while post-training and interface design produce visible behavioral differences.

Second, comparison itself creates epistemic pressure. Each single response sounds plausible alone. Together, their omissions become legible. Gemini makes the topic accessible but risks flattening it into metaphor. DeepSeek gives a rich dyad but sometimes leans on claims that need stronger qualification. Grok opens the philosophical frame but sometimes overextends from technical observations to epistemology. ChatGPT gives the most systematic map but can sound too clean. Claude supplies the strongest critique of harness evidence and the human-objective layer but also writes from a self-conscious contrarian posture. The final essay is better because the corpus disagrees.

Third, the method is a harness and not an oracle. The process used here — many candidate answers, comparison, synthesis, revision, final publication — is itself a model of AI-assisted knowledge work. It did not produce certainty. It produced a more articulated question, a richer map of positions, and a more falsifiable final claim. That is a modest but real epistemic gain.

Fourth, the final revision passes show that a harness can improve a document without changing its key logic. The wisepersons pass did not overturn the thesis; it made the setting, provenance, standpoint, and method harder to ignore. The extended literary-masters pass did not add ornamental style; it tightened rhythm, removed over-explanation, and restored the essay’s readerly force after the structural changes. The editorial pass did not create knowledge; it removed the small frictions that make knowledge less publishable.

Fifth, the procedure shows why human direction remains central. Hulki Okan Tabak did not merely ask for an answer; he designed a process. He selected the initial question, chose the participating systems, supplied the source corpus, identified the need for synthesis, added constraints of length and publishability, corrected the setting to include Oya’s contribution, required a colophon, requested independent critical contribution, and then ordered final review and cleanup passes. This direction changed the output. The author/director role is not ornamental. It is the purpose layer made visible.

14. Final thesis

The final answer can now be stated compactly.

Turan is right that the base model is the deep source of possibility. It compresses cultural, linguistic, mathematical, scientific, and procedural priors into a proposal distribution. It sets the ceiling of latent competence. It determines the quality of candidate ideas. No harness can reliably extract knowledge from representations that are absent.

The counter-position is right that, among sufficiently capable and increasingly convergent models, the marginal determinant of useful knowledge creation often shifts to prompt, context, harness, tools, memory, and evaluators. The model may contain possible thoughts, but the harness decides which thoughts are elicited, grounded, tested, compared, revised, and stabilized.

Both positions are incomplete because knowledge creation is not located in a component. It is located in a loop. A model can generate. A prompt can orient. Context can ground. A harness can iterate. An evaluator can reject. A community can validate. A human objective can decide what is worth pursuing. Knowledge appears when conjectures are forced to survive contact with constraint and when a purpose decides that the surviving conjecture matters.

The future therefore does not belong simply to the largest model or the cleverest prompt. It belongs to epistemic stacks: systems that combine capable models with high-quality context, disciplined tools, transparent procedures, adversarial criticism, reliable evaluators, and human purposes strong enough to guide search. In the near term, many practical gains will come from better harnesses because frontier models are increasingly close in capability and because workflow design is still young. At the frontier, base models will continue to create jumps. In domains with cheap verifiers, evaluator-driven loops will dominate. In domains without cheap verifiers, the crucial invention will be synthetic criticism: ways of making disagreement, evidence, and judgment operational without pretending that all truth is a unit test.

The final aphorism should therefore be neither engine versus steering wheel nor model versus prompt. The base model is the horizon of possible thought. The post-training layer is the frozen steering already inside the machine. The context is the evidence brought into view. The prompt is the angle of attention. The harness is the method of search. The evaluator is the discipline of reality. The human objective is the reason to search. And knowledge is what remains after the loop has had a chance to kill its own best illusions.

Primary response corpus

[P1] Claude. The Model, the Harness, and the Loop: Where knowledge actually gets created in the age of converging world-models. Uploaded response document, 21 May 2026.

[P2] Gemini. The Engine vs. The Steering Wheel: An Essay on Knowledge Creation in Large Language Models. Uploaded response document, 21 May 2026.

[P3] ChatGPT. LLMs, Context, Harnesses, and Knowledge Creation: An essay on base models, prompt engineering, context engineering, and epistemic systems. Uploaded response document, 21 May 2026.

[P4] DeepSeek. The Dyad of Discovery: Base Models, Scaffolds, and the Creation of Knowledge. Uploaded response document, 21 May 2026.

[P5] Grok. The Architecture of Artificial Epistemology: Base Models, Prompts, Context, and the Harness in Knowledge Creation. Uploaded response document, 21 May 2026.

[P6] ChatGPT. The Epistemic Stack: Base Models, Context, Harnesses, and Where Knowledge Is Made. Second-stage synthesis, 21 May 2026.

[P7] Claude. Who Carves the Marble? On base models, harnesses, and what five language models agreeing about knowledge creation reveals — and conceals. Second-stage synthesis, 21 May 2026.

Revision and review protocols

[S1] ai-skills-wisepersons-panel. Uploaded skill document, 21 May 2026. Used as a designed-disagreement architectural review for provenance, standpoint, metadata, preservation, classification, and editorial conscience.

[S2] extended-masters-session. Uploaded skill document, 21 May 2026. Used here not for a Vera architectural decision but as a final literary polish protocol after the wisepersons review changed the article’s architecture.

[S3] editorial-pass. Uploaded skill document, 21 May 2026. Used as the final 15-criterion cleanup gate for typography, consistency, integration, verbal tics, duplicate detection, endings, front/back matter, and publish-ready verification.

Selected research references

[R1] Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.

[R2] Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556.

[R3] Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training Language Models to Follow Instructions with Human Feedback. arXiv:2203.02155.

[R4] Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.

[R5] Wang, X., Wei, J., Schuurmans, D., et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171.

[R6] Yao, S., Yu, D., Zhao, J., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601.

[R7] Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401.

[R8] Yao, S., Zhao, J., Yu, D., et al. (2022/2023). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.

[R9] Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., & Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325.

[R10] Liu, N. F., Lin, K., Hewitt, J., et al. (2023/2024). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172; TACL 2024.

[R11] Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). The Platonic Representation Hypothesis. ICML 2024 / arXiv:2405.07987.

[R12] Romera-Paredes, B., Barekatain, M., Novikov, A., et al. (2024). Mathematical Discoveries from Program Search with Large Language Models. Nature 625, 468–475.

[R13] Google DeepMind. (2023). FunSearch: Making new discoveries in mathematical sciences using large language models. Research blog accompanying Nature publication.

[R14] Trinh, T. H., Wu, Y., Le, Q. V., He, H., & Luong, T. (2024). Solving Olympiad Geometry without Human Demonstrations. Nature 625, 476–482.

[R15] Google DeepMind. (2024). AI achieves silver-medal standard solving International Mathematical Olympiad problems. Research blog, 25 July 2024.

[R16] Novikov, A., Vu, N., Eisenberger, M., et al. (2025). AlphaEvolve: A Coding Agent for Scientific and Algorithmic Discovery. arXiv:2506.13131.

[R17] Ciernik, L., Linhardt, L., Morik, M., et al. (2024). Training Objective Drives the Consistency of Representational Similarity across Datasets. arXiv:2411.05561.

[R18] Groeger, F., Wen, S., & Brbic, M. (2026). Revisiting the Platonic Representation Hypothesis: An Aristotelian View. arXiv:2602.14486.

Colophon

This final publish-ready essay was directed and authored by Hulki Okan Tabak and drafted with ChatGPT 5.5 Pro on 21 May 2026. The source material consisted of seven AI-generated documents: five first-stage responses to the same question by ChatGPT, Claude, DeepSeek, Gemini, and Grok, followed by second-stage synthesis essays by ChatGPT and Claude. The final method was comparative and critical: preserve convergence, identify divergence, scrutinize the procedure, add independent argument, and shape the result into a publishable essay.

The originating discussion began in a room where Hulki Okan Tabak, Turan, and Oya were present and contributing. The subsequent deeper debate with Turan sharpened the central opposition between the base-model thesis and the prompt/context/harness thesis. Oya’s presence and contribution belong to the origin of the question and are credited here accordingly. Thanks are due to Turan for the debate that made the question precise, and to Oya for being part of the setting in which it first became alive.

After the initial final synthesis, this publish-ready edition received three final passes at Hulki’s direction: a wisepersons review, an extended literary-masters polish, and a 15-criterion editorial cleanup. These are named AI-assisted review protocols, not claims of participation by the historical figures whose names organize the panels. Their function was to make the document’s own epistemic stack explicit: authorial direction, multi-model corpus, critical review, literary polish, editorial verification, and final publication.

The method itself is part of the object it describes. Multiple models generated candidate answers; the author/director selected, corrected, constrained, and commissioned synthesis; two models produced competing second-stage interpretations; this final essay used those documents as context and subjected them to comparison; then further review protocols revised the architecture, prose, and final cleanliness. The artifact is therefore not a model’s isolated answer. It is a human-directed, multi-model epistemic stack: question, corpus, comparison, critique, synthesis, revision, polish, editorial verification, and publication.

Where Knowledge Is Made was originally published in Coinmonks on Medium, where people are continuing the conversation by highlighting and responding to this story.

By

Leave a Reply

Your email address will not be published. Required fields are marked *