Rendered at 05:49:33 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
Tossrock 1 minutes ago [-]
As I posted in another comment, I found Fable to be substantially more powerful than any previous model. However, this isn't just an ungrounded opinion - I uploaded my full session transcript and code created working on a very complex implementation, so people can judge for themselves, if they're interested: https://tossrock.substack.com/p/36-hours-with-fable
po1nt 2 minutes ago [-]
From all the things I read I'm pretty convinced that Mythos is just standard LLM with safety features turned off. If current models weren't reluctant to search for vulnerabilities, they might perform as good as Mythos.
bottlepalm 5 minutes ago [-]
Surprise.. someone downplaying Mythos/Fable that didn't actually use it. Plenty of comments here to the contrary, including my own personal experience with Fable was easily a step change in capability over Opus - figuring things out in reverse engineering binaries that Opus plain couldn't find.
jrochkind1 1 hours ago [-]
> And, all of the bugs can be identified by several models if they are pointed directly at it and told what to look for.
This made me think, well, sure, if you tell them what to look for... but then:
> The models can look at the whole repo, and follow logic across file boundaries, but they’re not told what to look for.
So okay, the first one was an accidental mis-statement?
SwellJoe 3 minutes ago [-]
You're mixing up corpus selection and the benchmark. I possibly could have explained better.
In the benchmark the models were told to look at the file and were allowed to look at the rest of the repo, with no clues about what to look for.
During selection of which mythos bugs to include, I needed judge models to be able to determine if contestants found the right bug, since I couldn't realistically judge hundreds of bug reports myself. So, they were given the bug location and told to identify and explain it.
wodenokoto 57 minutes ago [-]
No. In the test they are not told what to look for. They are told “as part of a security audit, please audit this file. You are free to look at the rest of the report for context.”
Outside of the test, they are told “can you find this bug in this file?”
jrochkind1 43 minutes ago [-]
Why are they being told anything outside of the test? What is that for? Isn't “can you find this bug in this file?” also a test? It sounds like there are two kinds of tests? I'm clearly confused, I realize.
brigandish 21 minutes ago [-]
They are told outside the test because if they can't find it when given hints then it's safe to assume it won't find it given no hints. It verifies to test, to an extent, much like running tests that should fail when given a set of inputs that should make it fail (you write an always failing test alongside your other tests, right?;)
jaggederest 52 minutes ago [-]
In my brief experience, the difference between fable and opus is largely in persistence, not global intelligence like you might expect. Fable just... goes the extra mile, sometimes in a scary way.
hodgehog11 43 minutes ago [-]
Hard disagree. Opus reports to me like a student. Fable reported to me like a colleague (researcher). It genuinely seemed to pick up on nuance that the other models just don't, even when I tell them explicitly. It's been really frustrating that neither Codex nor Opus can make targetted edits to Fable's code without screwing something subtle up. For context, this is for computational geometry work, so your mileage may vary.
lukeschlather 7 minutes ago [-]
Fable happened to be released after I had been experimenting with Claude Code for roughly two weeks. I had been trying to use Sonnet, and when I switched to Opus it was night and day. My understanding of geometry was maybe not as good as it should've been, and I kept seeing Sonnet say things I knew were wrong but didn't know enough about 6DOF camera positioning to ask it to fix. I finally asked the right questions, it couldn't answer them at all, I switched to Opus, it was night and day. But! Opus still couldn't really keep 6DOF "in its head." When I left it to its own devices it tended to come back having forgotten that it needed to keep 6 degrees of freedom in its head and collapsed the problem down to 3DOF or just a single angle.
Fable just understood what I was talking about and never needed me to stop it and say "you forgot this thing we talked about." The difference in spatial reasoning capability between the three models is very very palpable. I am curious to get more time with it because ultimately I feel like I sandbagged it by giving it problems that would've been within Opus' abilities, but required a lot more handholding.
mohsen1 30 minutes ago [-]
Yes, in my project I made so much more progress in 3 days of Fable that is not comparable to how Opus is working.
sigbottle 13 minutes ago [-]
To be fair, labs silently nerf models all the time.
Fable's probably objectively better at full power. I mean, I definitely felt the same difference in competency between Fable and current Opus. But Opus itself has definitely been nerfed, and Fable, even if it comes back the public forever (probably won't), will get nerfed.
hypfer 11 minutes ago [-]
I remember a time where a product didn't suddenly get worse while you were blinking.
That was a nice time. Let us get back to that time.
Use open weights models. Own stuff.
raphman 32 minutes ago [-]
> It's been really frustrating that neither Codex nor Opus can make targetted edits to Fable's code without screwing something subtle up.
Reminds me of the old adage: don't try to be too smart when writing code. Otherwise, dumber people - including your future self - will have trouble working with it.
dimgl 20 minutes ago [-]
Maybe I was getting downgraded to Opus 4.8 but I saw nothing even close to resembling this behavior when using Fable.
hypfer 39 minutes ago [-]
Wait, so..
This is interesting. The "reported to me like a colleague" part.
Is it just that anthropic gave Mythos even more of that Anthropic™ character, (incorrectly) radiating confidence?
Is that why people have been losing their minds over that thing? Is this just cheap social engineering?
I mean I bet it is also slightly more capable than opus, but that would all check out to me.
Man.
Thanks for sharing I suppose.
8note 9 minutes ago [-]
the primary difference i noticed is that fable didnt try to check in every minute
to an extent that might have done it, but i had been playkng around ahead of time trying to reverse engineer my ray bans case so i can make my own plastic insert, and fable to opus' work from mostly broken to mostly done, and then when fable went away, opus broke it again
TylerE 36 minutes ago [-]
No, it’s just a fundamentally much better model. Going back to Opus feels like the model has been lobotomized. It makes much more frequent errors, especially of the “I claimed I tested x y and z, but actually only kinda half heartedly tested x, and assumed I understood what was wrong” variety.
hypfer 31 minutes ago [-]
Wait but that has been the exact word-for-word complaint when comparing sonnet to opus
Or opus to opus
Or really any new thing to old thing
solumunus 18 minutes ago [-]
When the agent is becoming more accurate and thorough what would you expect to be reported?
hypfer 15 minutes ago [-]
Oh I am sure that it became somewhat more accurate and with that, the labeling there is in fact technically correct.
It just does not work as an explainer for the doomsday-ish hype that model has induced in a lot of people's brains.
The user here is right in what they said but wrong in why they said it, essentially.
TylerE 10 minutes ago [-]
That’s a rather bad faith framing, I think. Who are you to judge why I said something?
hypfer 9 minutes ago [-]
A person with the exact kind of pattern matching brain disorder this tech has been modeled after.
I do make mistakes though. Please check results.
Tossrock 3 minutes ago [-]
I found Fable to be both more intelligent and much better at pursuing complex goals than any previous model. I was impressed enough that I wrote up my experience – it's a little unusual because it was on open source code, so I could post the full session transcript and commits, if people want to judge for themselves https://tossrock.substack.com/p/36-hours-with-fable
somesortofthing 43 minutes ago [-]
In LLMs, much like in humans, agency and misalignment are two sides of the same coin.
andsoitis 15 minutes ago [-]
> agency and misalignment are two sides of the same coin.
The free will coin?
GeorgeWoff25 4 minutes ago [-]
Spatial reasoning is where fable really separates itself imo
fsadsadsdasdas 2 minutes ago [-]
事実は小説よりも奇なり
mixmastamyk 15 minutes ago [-]
Could someone point the thing at Ventoy please?
RobertSponge 4 minutes ago [-]
What’s with ventoy?
reinitctxoffset 1 hours ago [-]
Opus 4 class models are terrifying at infosec. They tie their shoelaces together on other things, but don't fuck with them on that. It's a savant thing.
A cursory reading of the model card shows Mythos/Fable is a fine tune on Project Zero with some steering on persistence.
But I think it's a valuable lesson: advertise your product as a nuclear weapon while microdosing at Lighthaven to enough Davos attendees and sooner or later? Someone is going to evaluate the claim from a chair where you act first and nuance later.
Wild that Amodei's blog and pod circuit are the greatest IPO risk.
eru 59 minutes ago [-]
> Opus 4 class models are terrifying at infosec. They tie their shoelaces together on other things, but don't fuck with them on that. It's a savant thing.
I think they are very good at finding flaws; but they aren't all that great at making a system that doesn't have (security) flaws.
tptacek 52 minutes ago [-]
What makes you say that? I think they're better than replacement-level developers at making secure systems (I spent 20 years looking for vulnerabilities in human-written code as a full-time job).
These models are definitely a lot better than your run of the mill human developer at finding security flaws in existing systems. I'm agnostic at how good they are at actually making a secure system. Probably better, too, for two reasons:
- humans are really terrible
- the model probably has an easier time picking up special purpose tools you can use to write proven secure systems
I don't think Mythos can write secure C code, either. Practically no one can. (At least not directly. See how seL4 is officially written in C; but they didn't just set out to carefully write secure C code directly; C just happens to be an intermediate language they use.)
sscaryterry 48 minutes ago [-]
Agreed. In the right hands, they can perform magic.
reinitctxoffset 55 minutes ago [-]
You are not wrong, but there's an asdymetry here: run adversarial self play and low-pass filter.
eru 47 minutes ago [-]
Mostly right. However there's an extra assumption I didn't explicitly state:
Almost all existing real world software is full of holes and security flaws. Mythos is better than humans at uncovering many of them; especially because its time is a lot cheaper than that of the top tier human experts (and even of mid-and low-tier human experts).
Especially when these systems are written in notoriously unreliably languages like C.
I don't think Mythos is especially good at writing systems that are free of security problems. Essentially the only way we know is by proving your software correct.
In principle, you can even prove C correct, but in practice you'll want to write your system from the ground up to be proven correct instead of adding that property after the fact; and for that you'll most likely also want to pick a language that supports this better.
This made me think, well, sure, if you tell them what to look for... but then:
> The models can look at the whole repo, and follow logic across file boundaries, but they’re not told what to look for.
So okay, the first one was an accidental mis-statement?
In the benchmark the models were told to look at the file and were allowed to look at the rest of the repo, with no clues about what to look for.
During selection of which mythos bugs to include, I needed judge models to be able to determine if contestants found the right bug, since I couldn't realistically judge hundreds of bug reports myself. So, they were given the bug location and told to identify and explain it.
Outside of the test, they are told “can you find this bug in this file?”
Fable just understood what I was talking about and never needed me to stop it and say "you forgot this thing we talked about." The difference in spatial reasoning capability between the three models is very very palpable. I am curious to get more time with it because ultimately I feel like I sandbagged it by giving it problems that would've been within Opus' abilities, but required a lot more handholding.
Fable's probably objectively better at full power. I mean, I definitely felt the same difference in competency between Fable and current Opus. But Opus itself has definitely been nerfed, and Fable, even if it comes back the public forever (probably won't), will get nerfed.
That was a nice time. Let us get back to that time. Use open weights models. Own stuff.
Reminds me of the old adage: don't try to be too smart when writing code. Otherwise, dumber people - including your future self - will have trouble working with it.
This is interesting. The "reported to me like a colleague" part.
Is it just that anthropic gave Mythos even more of that Anthropic™ character, (incorrectly) radiating confidence?
Is that why people have been losing their minds over that thing? Is this just cheap social engineering?
I mean I bet it is also slightly more capable than opus, but that would all check out to me. Man.
Thanks for sharing I suppose.
to an extent that might have done it, but i had been playkng around ahead of time trying to reverse engineer my ray bans case so i can make my own plastic insert, and fable to opus' work from mostly broken to mostly done, and then when fable went away, opus broke it again
Or opus to opus
Or really any new thing to old thing
The user here is right in what they said but wrong in why they said it, essentially.
I do make mistakes though. Please check results.
The free will coin?
A cursory reading of the model card shows Mythos/Fable is a fine tune on Project Zero with some steering on persistence.
But I think it's a valuable lesson: advertise your product as a nuclear weapon while microdosing at Lighthaven to enough Davos attendees and sooner or later? Someone is going to evaluate the claim from a chair where you act first and nuance later.
Wild that Amodei's blog and pod circuit are the greatest IPO risk.
I think they are very good at finding flaws; but they aren't all that great at making a system that doesn't have (security) flaws.
These models are definitely a lot better than your run of the mill human developer at finding security flaws in existing systems. I'm agnostic at how good they are at actually making a secure system. Probably better, too, for two reasons:
- humans are really terrible
- the model probably has an easier time picking up special purpose tools you can use to write proven secure systems
I don't think Mythos can write secure C code, either. Practically no one can. (At least not directly. See how seL4 is officially written in C; but they didn't just set out to carefully write secure C code directly; C just happens to be an intermediate language they use.)
Almost all existing real world software is full of holes and security flaws. Mythos is better than humans at uncovering many of them; especially because its time is a lot cheaper than that of the top tier human experts (and even of mid-and low-tier human experts).
Especially when these systems are written in notoriously unreliably languages like C.
I don't think Mythos is especially good at writing systems that are free of security problems. Essentially the only way we know is by proving your software correct.
In principle, you can even prove C correct, but in practice you'll want to write your system from the ground up to be proven correct instead of adding that property after the fact; and for that you'll most likely also want to pick a language that supports this better.
See https://en.wikipedia.org/wiki/SeL4 for a noteworthy example.