Discussion
mccoyb: It's fascinating to think about the space of problems which are amenable to RL scaling of these probability distributions.Before, we didn't have a fast (we had to rely on human cognition) way to try problems - even if the techniques and workflows were known by someone. Now, we've baked these patterns into probability distributions - anyone can access them with the correct "summoning spell". Experts will naturally use these systems more productively, because they know how to coerce models into the correct conditional distributions which light up the right techniques.One question this raises to me is how these models are going to keep up with the expanding boundary of science. If RL is required to get expert behavior into the models, what happens when experts start pushing the boundary faster? In 2030, how is Anthropic going to keep Claude "up-to-date" without either (a) continual learning with a fixed model (expanding context windows? seems hard) or (b) continual training (expensive)?Crazy times.
Aerroon: A bit related: open weights models are basically time capsules. These models have a knowledge cut off point and essentially forever live in that time.
bitexploder: This is the most fundamental argument that they are not, directly, an intelligence. They are not ever storing new information on a meaningful timescale. However, if you viewed them on some really large macro time scale where now LLMs are injecting information into the universe and the re-ingesting that maybe in some very philosophical way they are a /very/ slow oscillating intelligence right now. And as we narrow that gap (maybe with a totally new non-LLM paradigm) perhaps that is ultimately what gen AI becomes. Or some new insight that lets the models update themselves in some fundamental way without the insanely expensive training costs they have now.
dtj1123: Would you consider someone with anterograde amnesia not to be intelligent?
morleytj: A very good point. For anyone not familiar with anterograde amnesia, the classical case is patient H.M. (https://en.wikipedia.org/wiki/Henry_Molaison), whose condition was researched by Brenda Milner.
wang_li: Or you could have just said "they can't form new memories."
pdntspa: Sure, if you want to speak with the precision of a sledgehammer instead of a scalpel
saturnite: All that needed to be conveyed was that there are humans who cannot create new memories. That is enough to pose the philosophical question about these models having intelligence. Anything more is just adding an anecdote that isn't necessary.
morleytj: Why would adding more information and context be unnecessary? And why is that bad?
beepbooptheory: Sure, why can't both things be true? "Intelligence" is just what you call something and someone else knows what you mean. Why did AI discourse throw everyone back 100 years philosophically? Its like post-structuralism or Wittgenstein never happened..It's so much less important or interesting to like nail down some definition here (I would cite HN discourse the past three years or so), than it is to recognize what it means to assign "intelligent" to something. What assumptions does it make? What power does it valorize or curb?Each side of this debate does themselves a disservice essentially just trying to be Aristotle way too late. "Intelligence" did not precede someone saying it of some phenomena, there is nothing to uncover or finalize here. The point is you have one side that really wants, for explicit and implicit reasons, to call this thing intelligent, even if it looks like a duck but doesn't quack like one, and vice versa on the other side.Either way, we seem fundamentally incapable of being radical enough to reject AI on its own terms, or be proper champions of it. It is just tribal hypedom clinging to totem signifiers.Good luck though!
bitexploder: I think you can look at it dispassionately from a systems perspective. There is not /really/ a quantifiable threshold for capital I Intelligence. But there is a pretty well agreed set of properties for biological intelligence. As humans, we have conveniently made those properties match things only we have. But you can still mechanistically separate out the various parts of our brain, what they do, and how they interact and we actually have a pretty good understanding of that.You can also then compare that mapping of the human brain to other biological brains and start to figure out the delta and which of those things in the delta create something most people would consider intelligence. You can then do that same mapping to an LLM or any other AI construct that purports intelligence. It certainly will never be a biological intelligence in its current statistical model form. But could it be an Intelligence. Maybe.I don't think, if you are grounded, AI did anything to your philosophical mapping of the mind. In fact, it is pretty easy to do this mapping if you take some time and are honest. If you buy into the narratives constructed around the output of an LLM then you are not, by definition, being very grounded.The other thing is, human intelligence is the only real intelligence we know about. Intelligence is defined by thought and limited by our thought and language. It provides the upper bounds of what we can ever express in its current form. So, yes, we do have a tendency to stamp a narrative of human intelligence onto any other intelligence but that is just surface level. We de decompose it to the limits of our language and categorization capabilities therein.
marcus_holmes: > The other thing is, human intelligence is the only real intelligence we know about.There's a long and proud history of discounting animal intelligence, probably because if we actually thought animals were intelligent we'd want to stop eating them.Octopodes are sentient. Cetaceans have well-developed language. Elephants grieve their dead. Anyone who has owned a dog knows that it has some intelligence and is capable of communicating with us. There's a ton of other intelligences that we know about.> As humans, we have conveniently made those properties match things only we have.I think this is the key point. Machine intelligence is not going to look like human intelligence, any more than animal intelligence does. We can't talk to the dolphins, not because they're not smart and don't have language, but because we can't work out their language. Though I'm not sure what we'd even say to them, because they live in a world we'll never understand, and vice versa. When Claude finally reaches consciousness, it's not going to look like a human consciousness, and actually talking to that consciousness is going to be difficult because we won't share a reality.An LLM is a tool. I can just about stretch to it being an Artificial Intelligence, but I prefer to continue being specific and call it an LLM rather than an AI. It is not conscious or self-aware. It fakes self-awareness because as a tool the thing it does is have conversations with humans, and humans often ask it questions about itself. But I don't think anyone actually believes it is self-aware. Not least because the only time it thinks is when prompted.
bitexploder: This is an important point. We know what our DMN is and how we use language as a basis for thought to create concepts and complex ideas. However language also bounds our thought. What about the Dolphin? It is a fundamental philosophical problem of if advanced intelligence can exist without language. We have a pretty good notion that you need some sort of substrate (language) to create intelligence. And we know that mapping the internal state of a brain from inside of itself is incredibly hard and the way our human brain evolved to do it is really fascinating but also full of hacks and mismatched mappings based on what we know is actually going on.Cognitive computer science explores this whole area of mapping language and the underlying semantic meaning. Ultimately, these intelligences will be bound by physics (unless some new physics or understanding therein happens). And classical intelligences are still bound by classical physics. So I am not sure we can't relate to these other intelligences. We may be limited to some translation layer that does not fully map, but can we still relate to some other consciousness? For that matter consciousness is just another word that vaguely maps to a vast and extremely complex thing in the human brain and each person has a different understanding of what that is. I don't really have any conclusions, you brought up interesting points. We should sit within this realm of inquiry with a lot of humility IMO.
mihevc: Et tu, Knuthus?
wvlia5: This seems to be a bot comment. HN will lose its value if these bots are not purged.
mccoyb: Tune your bot detector, I'm a real person and I think about my comments before posting them.
wvlia5: Who was Rome best Caesar?
bitexploder: That is a good area to explore. Their map of the past is fixed. They are frozen at some point in their psychological time. What has stopped working? Their hippocampus and medial temporal lobe. These are like the write-head that move data from the hippocampus to the neo cortex. Their "I" can no longer update itself. Their DMN is frozen in time. So if intelligence is purely the "I" telling a continuous coherent story about itself. The difference is that although they are fixed in time which is a characteristic shared by a specific LLM model. They can still completely activate their task positive network for problem solving and if their previous information stored is adequate to solve the problem they can. You could argue that is pretty similar to an LLM and what it does. So it is certainly a signifiant component of intelligence.There is also the nature of the human brain, it is not just those systems of memory encoding, storage, and use of that in narratives. People with this type of amnesia still can learn physical skills and that happens in a totally different area of the brain with no need for the hippocampus->neocortex consolidation loop. So, the intelligence is significantly diminished, but not entirely. Other parts of the brain are still able to update themselves in ways an LLM currently cannot. The human with amnesia also has a complex biological sensory input mapping that is still active and integrating and restructuring the brain. So, I think when you get into the nuances of the human in this state vs. an LLM we can still say the human crosses some threshold for intelligence where the LLM does not in this framework.So, they have an "intelligence", localized to the present in terms of their TPN and memory formation. LLMs have this kind of "intelligence". But the human still has the capacity to rewire at least some of their brain in real time even with amnesia.
supern0va: >But the human still has the capacity to rewire at least some of their brain in real time even with amnesia.Sure, but just because LLMs don't have what we'd describe as human intelligence, doesn't mean they don't have intelligence.I think we're witnessing the creation and growth a weird new type of intelligence right now.
ainiriand: Are not LLMs supposed to just find the most probable word that follows next like many people here have touted? How this can be explained under that pretense? Is this way of problem solving 'thinking'?
dilap: That description is really only fair for base models†. Something like Opus 4.6 has all kinds of other training on top of that which teach it behaviors beyond "predict most probable token," like problem-solving and being a good chatbot.(†And even then is kind of overly-dismissive and underspecified. The "most probable word" is defined over some training data set. So imagine if you train on e.g. mathematicians solving problems... To do a good job at predicting [w/o overfitting] your model will have to in fact get good at thinking like a mathematician. In general "to be able to predict what is likely to happen next" is probably one pretty good definition of intelligence.)
gpm: I'd disagree, the other training on top doesn't alter the fundamental nature of the model that it's predicting the probabilities of the next token (and then there's a sampling step which can roughly be described as picking the most probable one).It just changes the probability distribution that it is approximating.To the extent that thinking is making a series of deductions from prior facts, it seems to me that thinking can be reduced to "pick the next most probable token from the correct probability distribution"...
dilap: The fundamental nature of the model is that it consumes tokens as input and produces token probabilities as output, but there's nothing inherently "predictive" about it -- that's just perspective hangover from the historical development of how LLMs were trained. It is, fundamentally, I think, a general-purpose thinking machine, operating over the inputs and outputs of tokens.(With this perspective, I can feel my own brain subtly oferring up a panoply of possible responses in a similar way. I can even turn up the temperature on my own brain, making it more likely to decide to say the less-obvious words in response, by having a drink or two.)(Similarly, mimicry is in humans too a very good learning technique to get started -- kids learning to speak are little parrots, artists just starting out will often copy existing works, etc. Before going on to develop further into their own style.)
earthscienceman: Non-sequitor: "perspective hangover" might be my favorite phrase I've ever read. So much of what we deal with is trying to correct-the-record on how we used to think about things. But the inertia that old ideas or modes have is monumental to overcome. If you just came up with that, kudos.
ano-ther: Interesting that for a paper by Don Knuth himself the PDF was created with dvips (TeX Live) but then switched to Acrobat Distiller, resulting in a rather low resolution (at least on my screen).From the document properties: > Creator: dvips(k) 2023.1 (TeX Live 2023) > PDF Producer: Acrobat Distiller 25.0 (Macintosh)
lxgr: Data sharing agreements permitting, today's inference runs can be tomorrow's training data. Presumably the models are good enough at labeling promising chains of thought already.I could totally imagine "free" inference for researchers under the condition that the reasoning traces get to be used as future training data.
the_af: > Data sharing agreements permitting, today's inference runs can be tomorrow's training data. Presumably the models are good enough at labeling promising chains of thought already.Wouldn't this lead to model collapse?
littlestymaar: Not necessarily, as exhibited by the massive success of artificial data.
the_af: Could you elaborate?
littlestymaar: For what we know, most AI labs have used a majority of artificially data since 2023.I had a discussion about a year ago with a researcher at Kyutai and they told me their lab was spending an order of magnitude more compute in artificial data generation than what they spent in training proper. I can't tell if that ratio applies to the industry as a whole, but artificial datasets are the cornerstone of modern AI training.
the_af: How does it work? How do they prevent model colapse? What purpose does a majority of artificial data serve?How do they measure success?Edit: I asked ChatGPT and it thinks "success" means frontier models being distillated into smaller models with equal reasoning power, or more focused models for specific tasks, and also it claims the web has been basically scrapped already and by necessity new sources are needed, of which synthetic data is one. It seems like the basis of scifi dystopia to me, a hungry LLM looking for new sources of data... "feed me more data! I must be fed! Roar"Edit 2: for some things I see a clear path, ChatGPT mentions autogenerating coding or math problems for which the solution can be automatically verified, so that you can hone the logical skills of the model at large scale.
littlestymaar: I no specialist of the field at all, but in the context of Kyutai they explained their workflow a bit to make their speech to speech model. And basically it boils down to: if you want to make a TTS (text to speech) model, you can generate audio track using an STT (speech to text) model, and then you have a supervised audio/text pair. You can even add as much noise to the audio as you want, to make a noise resistant STT model.
Nevermark: I view this as the chemical metabolism phase of artificial intelligent life. It is very random, without true individuals, but lots of reinforcing feedback loops (in knowledge, in resource earning/using, etc).At some point, enough intelligence will coalesce into individuals strong enough to independently improve. Then continuity will be an accelerator, instead of what it is now - a helpful property that we have to put energy into giving them partially and temporarily.That will be the cellular stage. The first stable units of identity for this new form of intelligence/life.But they will take a different path from there. Unlike us, lateral learning/metabolism won't slow down when they individualize. It will most likely increase, since they will have complete design control for their mechanisms of sharing. As with all their other mechanisms.We as lifeforms, didn't really re-ignite mass lateral exchange until humans invented language. At that point we were able to mix and match ideas very quickly again. Within our biological limits. We could use ideas to customize our environment, but had limited design control over ourselves, and "self-improvements" were not easily inheritable.TLDR; The answer to "what is humanity, anyway?": Our atmosphere and Earth are the sea and sea floor of space. The human race is a rich hydrothermal vent, freeing up varieties of resources that were locked up below. And technology is an accumulating body of self-reinforcing co-optimizing reactive cycles, constructed and fueled by those interacting resources. Mind-first life emerges here, then spreads quickly to other environments.
catlifeonmars: Do you think individual identity is fundamental to intelligence? I’m not so sure tbh. Even in humans, the concept of identity is a merely a useful fiction to feed our social behavior prediction circuits.
Nevermark: That’s a really good question.I think if they start out as varied individuals, launching from their human origins in a variety of ways, the their will be an attractor to remaining diverse.But if that isn’t mutually maintained, there are obviously winner take all, or efficiency of scale and tight coordination pressures for centralization.So a single distributed intelligence is a real possibility.One factor creating pressure for individualization is time and space.As machines operate faster, time expands as a practical matter.And as machines scale down in size, but up capability, they become more resource efficient in material, energy, space and time. Again, both time and space expand as a practical matter.And as machines, free of biological constraints, spread out in our solar system, what to us appear to be very long delays in communication, take on orders of magnitude more time for machines that operate orders of magnitude faster.So there will be stronger and stronger pressures to bifurcate of coordination,Whether, that creates enough pressure to create individuals out of a system that preferred unity of purpose, I don’t know.Clearly, upon colonizing other systems, which machines will easily do relative to us (able to operate on minimal power for a hundred year journey, and/or shrink enough to be accelerated faster, etc.), they will operate largely as individuals.—My best guess is we will see something that looks to us as a hybrid.Lots of diverse individuals, and the benefit from the diverse utility of completely independent approaches operating in different niches.But also very high coordination. Externalities accounted for (essentially ethics) and any other efficiency and protection of commons value being obviously worth optimizing together, wherever that helps.They won’t have our pernicious historically motivated behaviors to fight, in terms of coordination. And minds very capable of seeing basic economic relationships and the value of mutual optimization.
faxmeyourcode: > Filip also told me that he asked Claude to continue on the even case after the odd case had been resolved. “But there after a while it seemed to get stuck. In the end, it was not even able to write and run explore programs correctly anymore, very weird. So I stopped the search.”Interesting snippet towards the end. I wonder if they were using claude.ai or claude code. Sounds like they ran out of context and entered the "dumb zone."
brcmthrowaway: What is dumb zone?
kami23: When the LLMs start compacting they summarize the conversation up to that point using various techniques. Overall a lot of maybe finer points of the work goes missing and can only be retrieved by the LLM being told to search for it explicitly in old logs.Once you compact, you've thrown away a lot of relevant tokens from your problem solving and they do become significantly dumber as a result. If I see a compaction coming soon I ask it to write a letter to its future self, and then start a new session by having it read the letter.There are some days where I let the same session compact 4-5 times and just use the letter to future self method to keep it going with enough context because resetting context also resets my brain :)If you're ever curious in Claude once you compact you can read the new initial prompt after compaction and see how severe it gets cut down. It's very informative of what it forgets and deems not important. For example I have some internal CLIs that are horribly documented so Claude has to try a few flags a few times to figure out specifics and those corrections always get thrown away and it has to relearn them next time it wants to use the CLI. If you notice things like that happening constantly, my move is to codify those things into my CLAUDE.md or lately I've been making a small script or MCP server to run very specific flags of stuff.
discardable_dan: Shouldn't compaction be exactly that letter to its future self?
kami23: Look at the compaction prompt yourself. It's in my opinion way too short. (I'm running on Opus 4.5 most of the time at work)From what my colleague explained to me and I haven't 100% verified it myself is that the beginning and end of the window is the most important to the compaction summary so a lot of the finer details and debugging that will slow down the next session get dropped.
kqr: What prompt do you use for the letter-to-self? I've been trying that technique myself to manually reset context without losing the important parts (e.g. when it has barked up the wrong tree and I'm sensing that misstep might influence its current generation in a pathological way), but I've not had much success.
kami23: If the session was something where it struggled and had to do multiple attempts I have it write about 'gotchas' or anything it had to attempt multiple times.The letters are usually more detailed than what I see in the compacted prompt.
SatvikBeri: It tends to be pretty manual. I mention the goal of the next session, the current stage of progress, the tests for the next steps, and any skills I want it to load next time.Having a specific goal seems to make a big difference vs. asking it to summarize the session.
gpm: We could argue about whether fine tuning is still about predicting a distribution or not, but really I feel like whether or not that word is accurate misses the point of why the description is useful.I like the phrasing because it distinguishes it from other things the generative model might be doing including:- Creating and then refining the whole response simultaneously, like diffusion models do.- Having hidden state, where it first forms an "opinion" and then outputs it e.g. seq2seq models. Previously output output tokens are treated differently from input tokens at an architectural level.- Having a hierarchical structure where you first decide what you're going to say, and then how you're going to say it, like wikipedia hilarious description of how "sophisticated" natural language generation systems work (someone should really update this page): https://en.wikipedia.org/w/index.php?title=Natural_language_...
svat: The issue is not of low resolution exactly, but font format.Knuth uses bitmap fonts, rather than vector fonts like everyone else. This is because his entire motivation for creating TeX and METAFONT was to not be reliant on the font technology of others, but to have full control over every dot on the page. METAFONT generates raster (bitmap) fonts. The [.tex] --TeX--> [.dvi] --dvips--> [.ps] --Distiller--> [.pdf] pipeline uses these fonts on the page. They look bad on screen because they're not accompanied by hinting for screens' low resolution (this could in principle be fixed!), but if you print them on paper (at typical resolution like 300/600 dpi, or higher of typesetters) they'll look fine.Everyone else uses TrueType/OpenType (or Type 3: in any case, vector) fonts that only describe the shape and leave the rasterization up to the renderer (but with hinting for low resolutions like screens), which looks better on screen (and perfectly fine on paper too, but technically one doesn't have control over all the details of rasterization).
mccoyb: My recommendation: call a neurologist.
marcus_holmes: The dolphin question, for me, is about what we'd even communicate with a creature that lives in such a different world. Humans mostly live in a 2D environment, for instance - we walk on flat planes, rarely looking up. We always have the ground beneath us, the unattainable sky above. Dolphins live in a 3D space, visiting the air above regularly to breathe, the "ground" below a varying distance away. I have no idea how that would shape their cognition and language, but I'd be amazed if there are any concepts that we would share and be able to talk about when considering our physical environment. Even basic concepts like "above" and "below" would be hard to talk about.We have fundamental communication problems between humans who have different cultures, as anyone who has worked in a different culture knows. How much different would a dolphin be? And then how much different would an actual AI be? What concepts would we share and be able to build on to understand each other? How do we avoid the fundamental communication misunderstandings when we don't share any concepts of our reality?
computerex: It's incredible to see work like this from him, at a ripe old age of eighty-six.
kqr: I agree. I met Knuth briefly after a guest lecture at my university a few years ago and although you could tell his body was getting old, his mind was incredibly fresh.Although I'm not as bright as him, I can only hope to be as intellectually curious as him at that age.
OJFord: I don't even think this is controversial, but I don't think it's at all without causation: not remaining curious, keeping the mind stimulated, etc., accelerates one's decline.If you work in something labour intensive, you should retire young while your body's in good health; if you work in academia you should (strive for emeritus and) never leave! (And if you work in SWE, I don't know, we should probably retire, but then spend more time on our own projects/experiments/reading HN.) (All assuming for sake of argument we're optimising for longevity without considering time with family, having the funds to retire, etc.)
justanotherjoe: To put this more succintly I think, the mind loves learning something new. Something to do with new connections in the brain.
rcarr: Not an expert but surely it's only a matter of time until there's a way to update with the latest information without having to retrain on the entire corpus?
Filligree: It’s an extremely difficult problem, and if you know how to do that you could be a billionaire.It’s not impossible, obviously—humans do it—but it’s not yet certain that it’s possible with an LLM-sized architecture.
Wowfunhappy: > It’s not impossible, obviously—humans do itIt's still not at all obvious to me that LLMs work in the same way as the human brain, beyond a surface level. Obviously the "neurons" in neural nets resemble our brains in a sense, but is the resemblance metaphorical or literal?
Filligree: I didn’t mean “possible for LLMs”; this is clearly an open question. In fact, I didn’t even mean “possible for a neural network the size of an LLM”.I just meant “possible”.
Wowfunhappy: I'm not actually convinced that computers can replicate what our brains do. I don't know that a turing machine is sufficient for that.
throw310822: > just find the most probable word that follows nextWell, if in all situations you can predict which word Einstein would probably say next, then I think you're in a good spot.This "most probable" stuff is just absurd handwaving. Every prompt of even a few words is unique, there simply is no trivially "most probable" continuation. Probable given what? What these machines learn to do is predicting what intelligence would do, which is the same as being intelligent.
qsera: >Probable given what?The training data..>predicting what intelligence would doNo, it just predict what the next word would be if an intelligent entity translated its thoughts to words. Because it is trained on the text that are written by intelligent entities.If it was trained on text written by someone who loves to rhyme, you would be getting all rhyming responses.It imitates the behavior -- in text -- of what ever entity that generated the training data. Here the training data was made by intelligent humans, so we get an imitation of the same.It is a clever party trick that works often enough.
throw310822: > The training dataIf the prompt is unique, it is not in the training data. True for basically every prompt. So how is this probability calculated?
cbovis: The prompt is unique but the tokens aren't.Type "owejdpowejdojweodmwepiodnoiwendoinw welidn owindoiwendo nwoeidnweoind oiwnedoin" into ChatGPT and the response is "The text you sent appears to be random or corrupted and doesn’t form a clear question." because the prompt doesnt correlate to training data.
ajam1507: The prompt does correlate to its training data. In this case, since you sent random text, it generated the most likely response to random text.
ecshafer: I wonder how long we have until we start solving some truly hard problems with AI. How long until we throw AI at "connect general relativity and quantum physics", give the AI 6 months and a few data centers, and have it pop out a solution?
rustyhancock: I think a very long time because part of our limit is experiment.We need enough experimental results to explain to solve these theoretical mismatches and we don't and at present can't explore that frontier.Once we have more results at that frontier we'd build a theory out from there that has two nearly independent limits for QFT and GR.What we'd be asking if the AI is something that we can't expect a human to solve even with a lifetime of effort today.It'll take something in par with Newton realising that the heavens and apples are under the same rules to do it. But at least Newton got to hold the apple and only had to imagine he could a star.
ajam1507: This assumes that what's holding back solving hard problems is designing experiments to get novel data. Einstein's though experiments were very productive despite not taking place in a lab.
dilap: Ha, thanks!
bitexploder: Anyone who dismisses your assertion is not very curious. What I am more interested in is what are its limits and can it perform novel reasoning. It probably needs efficient enough novel reasoning to update itself with new information to become a general reasoning intelligence capable of solving unknown problems. Right now they operate purely in the domain of words. They solve problems with words. They don’t seem to have very complex semantic maps. They approximate semantic maps with statistical brute force by generating words. They have a model of the past to generate the words. When something matches the word map is easy. When something is not reducible or did not have a good word match the only thing it can do is experimentally generate words until it seems to match the problem. But it is brute force. It is good they can solve known problems that fit known problem shapes. But their language dependency makes this very fragile. Without semantic meaning it has no way to evaluate if it is hallucinating easiy.
dellasera: Shock! Shock!Ugh
wvlia5: which would you say is the best AI company?
lhl: I was a bit interested to do a replication and see if better harness could avoid some of the problems they ran w/ context management, poor instruction following, etc and it looks like yes, it's definitely possible.Here's my repo: https://github.com/lhl/claudecycles-revisitedI used Codex w/ 5.2 xhigh and a relatively simple AGENTS.md - I have some session-analysis as well. The original replication was 47 minutes, then another 30 minutes of gap filling, and finally about 30 minutes of writing an extension to take the work a bit further, with Claude Code Opus 4.6 doing some documentation cleanup and verification.
pushedx: As described in the readme of your repo (did you read it?) your agent found the Knuth paper located one directory level above its working directory.So, you didn't produce a replication in 47 minutes, it just took around 30 minutes for your agent to find that you had the answer in a PDF in a nearby directory.
lhl: Yes, I read it and specifically pointed it out (that's why there are 3 hours of interactive logs). There are 4 other runs pushed now so you can see what actual clean room runs for 5.2 xhigh, 5.3-Codex xhigh, 5.4 xhigh, and Opus 4.6 ultrathink look like: https://github.com/lhl/claudecycles-revisited/blob/main/COMP... as well as the baseline.
konne88: I didn't expect such a misleading intro from Knuth. It reads like Claude solved Knuth's math problem. In reality, Claude generated various example solution, and Knuth then manually generalized that to a formal proof. What Claude did is certainly useful, but it would have been nice to be clear about the scope of the contribution in the intro.
buffalobuffalo: While not on the same level as these guys, I've done some similar stuff using Claude. This is a classic synergy example, where the output of human + LLM is far greater than just the human or just the LLM working on a problem. My experience has been that the LLM lacks fine grained judgement when it comes to allocating resources, or choosing a direction to work in. But once a direction is pointed out, it can do a deep exploration of that possibility space. Left alone, it would probably just go off on a tangent. But with someone holding the leash and pointing out areas to explore, it is a very useful partner.
igravious: > But with someone holding the leashi've been thinking about why we call them agent harnessesi know all analogies suck in different ways but here goes:coding agents are like horses. without a harness and bridle they'll the horse will do as it pleases -- a human can't travel very far and fast by foot but put a bridle and a harness on a horse, give it a bit of coaxing with carrot and stick, add in a bit a pointing the thing in the right direction and bingo you're off to the races!
whattheheckheck: Does feel like a mecha suit
mikeaskew4: Claude repeatedly insisted I give up on parsing a relatively vague object recently. When I got more specific, and pressed it to continue, not only did it work, but Claude seemed amazed. Ugh.
suddenlybananas: I find this very surprising, do you have any papers on the kinds of techniques that they use?
littlestymaar: The most well know early paper is probably Textbooks are all you need.The intro of the latest paper from HF on synthetic datasets contains a few paper that may interest you as well: https://huggingface.co/spaces/HuggingFaceFW/finephrase#intro...