GPT-5 usage is 20% higher on days that start with "S"
Nevertheless, 7 datapoints does not a trend make (and the data presented certainly doesnt explain why). The daily variation is more than I would have expected, but could also be down to what day of the week the pizza party is or the weekly scrum meetings is at a few of their customers workplaces.
For development use cases, I switched to Sonnet 4.5 and haven't looked back. I mean, sure, sometimes I also use GPT-5 (and mini) and Gemini 2.5 Pro (and Flash), and also Cerebras Code just switched to providing GLM 4.6 instead of the previous Qwen3 Coder so those as well, but in general the frontier models are pretty good for development and I wouldn't have much reason to use something like Sonnet 4 or 3.7 or whatever.
I have canceled my Claude Max subscription because Sonnet 4.5 is just too unreliable. For the rest of the month I'm using Opus 4.1 which is much better but seems to have much lower usage limits than before Sonnet 4.5 was released. When I hit 4.1 Opus limits I'm using Codex. I will probably go through with the Codex pro subscription.
Don't try to comprehend the hive mind brother, there are a lot of shills and fanboys in addition to a lot of great people on this forum, sometimes the variance looks pretty bad.
I hope the people downvoting get some minor joy out of it, I know you need it.
Sonnet 4.5 is way worse than Opus 4.1 -- it's incredible that they claim it's their best coding model.
It's obvious if you've used the two models for any sort of complicated work.
Codex with GPT-5 codex (high thinking) is better than both by a long shot, but takes longer to work. I've fully switched to Codex, and I used Claude Code for the past ~4 months as a daily driver for various things.
I only reach for Sonnet now if Codex gets cagey about writing code -- then I let Sonnet rush ahead, and have Codex align the code with my overall plan.
For development use cases, it's best to use multiple models anyway. E.g. my favorite model is the Gemini 2.5 Pro, but there are certain cases where Qwen3 Coder gives much better results. (Gemini likes to overthink.) It's like having a team of competent developers provide their opinions. For important parts (security, efficiency, APIs), it's always good to get opinions from different sources.
For local chat Jan seems okay, or OpenWebUI for something hosted. For IDE integrations some people enjoy Cline a bunch but RooCode also allows you to have multiple roles (like ask/code/debug/architect with different permissions e.g. no file changes with ask) and also preconfigured profiles for the various providers and models, so you can switch with a dropdown, even in the middle of a chat. There’s also an Orchestrator mode so I can use something smart for splitting up tasks into smaller chunks and a dumber but cheaper model for the execution. Aside from that, most of the APIs there seem OpenAI conformant so switching isn’t that conceptually difficult. Also if you wanna try a lot of different models you can try OpenRouter.
Something like opencode probably, that’s what I have been using to freely and very easily switch between models and keep all my same workflows. It’s phenomenal really
I wish we could pin down not only the model but also the way the UI works as well.
Last week Claude seemed to have a shift in the way it works. The way it summarises and outputs its results is different. For me it's gotten worse. Slower, worse results, more confusing narrowing down what actually changed etc etc.
Long story short, I wish I was able to checkpoint the entire system and just revert to how it was previously. I feel like it had gotten to a stage where I felt pretty satisfied, and whatever got changed ... I just want it reverted!
4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.
Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them
For large context (up to 100K tokens in some cases). We found that GPT-5:
a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error
Interesting. https://www.robert-glaser.de/prompts-as-programs-in-gpt-5/ claims GPT-5 has amazing!1!! instruction following. Is your use-case very different, or is this yet another case of "developer A got lucky, developer B tested more things"?
ChatGPT when using 5 or 5-Thinking doesn’t even follow my “custom instructions” on the web version. It’s a serious downgrade compared to the prior generation of models.
Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.
Multiple models is a must, mostly due to the sometimes unpredictable variations in responses to specific situations/contexts/languages and frameworks. I find that Sonnet 4, Gemini Pro 2.5 are solid in comparison to newer models (especially Sonnet 4.5 which I find frequently to underperform). When one model is stuck in a loop, switching to a model like GPT-5 often breaks it but which model will work is subject to circumstance. P.S. I spend at least 3-4 hours a day in code-gen activities of various levels using Cursor as my primary IDE.
I've found that the VSCode GitHub Copilot extension defaults to Claude Sonnet 4.0 (in agent mode) in all new workspaces. It's the first thing I check now, but I imagine a lot of people just roll with it, especially if they use inline completions where it might not be obvious what model is being used.
The usage data looks like a classic case of the drift principle. When a model gets heavily optimized for alignment, polish, and safety, you gain consistency but lose some fidelity to the actual task. Newer models think longer, act less, and smooth over edges that used to be useful for real work. Older models aren’t smarter, they’re just sitting earlier on the drift curve, before over compression starts eroding decisiveness. So the specialization we’re seeing may just be developers picking the version where fidelity holds up best, not the one with the highest benchmark score.
> Each model appears to emphasize a different balance between reasoning and execution. Rather than seeking one “best” system, developers are assembling model alloys—ensembles that select the cognitive style best suited to a task.
This (as well as the table above it) matches my experience. Sonnet 4.0 answers SO-type questions very fast and mostly accurately (if not on a niche topic), Sonnet 4.5 is a little bit more clever but can err on the side of complexity for complexity's sake, and can have a hard time getting out of a hole it dug for itself.
ChatGPT 5 is excellent at finding sources on the web; Gemini simply makes stuff up and continues to do so even when told to verify; ChatGPT provides link that work and are generally relevant.
Seems to completely ignore usage of local/free models as well as anything but Sonnet/ChatGPT. So my confidence in the good faith of the author is... heavily restricted.
A 4090 has 24GB of VRAM allowing you to run a 22B model entirely in memory at FP8 and 24B models at Q6_K (~19GB).
A 5090 has 32GB of VRAM allowing you to run a 32B model in memory at Q6_K.
You can run larger models by splitting the GPU layers that are run in VRAM vs stored in RAM. That is slower, but still viable.
This means that you can run the Qwen3-Coder-30B-A3B model locally on a 4090 or 5090. That model is a Mixture of Experts model with 3B active parameters, so you really only need a card with 3B of VRAM so you could run it on a 3090.
The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.
Yes, it will be slower than running it in the cloud. But you can get a long way with a high-end gaming rig.
Today. But what about in 5 years? Would you bet we will be paying hundreds of billions to OpenAI yearly or buying consumer GPUs? I know what I will be doing.
Paying for compute in the cloud. That’s what I am betting on. Multiple providers, different data center players. There may be healthy margins for them but I would bet it’s always going to be relatively cheaper for me to pay for the compute rather than manage it myself.
But the progress goes both ways: In five years, you would still want to use whatever is running on the cloud supercenters. Just like today you could run gpt-2 locally as a coding agent, but we want the 100x-as-powerful shiny thing.
That would be great if that was the case but my understanding is that the progress is plateauing. I don't know how much of this is anthorpic / Google / openAI holding itself back to save money and how much is the state of the art improvement slowing down though. I can imagine there could be a 64 GB GPU in five years as absurd as it feels to type that today.
Honestly though how many people reading this do you think have that setup vs. 85% of us being on a MBx?
> The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.
Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.
Less than 0.1% of the people reading this are doing that. Me, I gave $20 to some cloud service and I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.
> Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.
The good old days of having to do crazy nutty things to get Elite II: Frontier, Magic Carpet, Worms, Xcom: UFO Enemy Unknown, Syndicate et cetera to actually run on my PC :-)
>I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.
As long as it's within terms and conditions of whatever agreement you made for that $20. I can run queries on my own inference setup from remote locations too
You need to leave much more room for context if you want to do useful work besides entertainment. Luckily there are _several_ PCIe slots on a motherboard. New Nvidia cards at retail(or above) are not the only choice for building a cluster; I threw a pile of Intel Battlemage cards on it and got away with ~30% of the nvidia cost for same capacity (setup was _not_ easy in early 2025 though).
You can gain a lot of performance by using optimal quantization techniques for your setup(ix, awq etc), different llamacpp builds do different between each other and very different compared to something like vLLM
Yes but they are really less performant than claude code or codex.
I really cried with the 20-25GB models ( 30b Qwen, Devstral etc). They really don't hold a candle, I didn't think the gap was this large or maybe Claude code and GPT performs much better than I imagined.
But… they do all the time. Almost everybody uses some mix of Office, Slack, Notion, random email providers, random “security” solutions etc. The exception is the opposite. The only thing prevents info leaking is ToS, and there are options for that even with LLMs. Nothing changed regarding that.
In my experience it’s very common for big companies to not host. Think Fortune 500 type companies. Most are legally happy with their MSA and reasonably confident in security standards.
Yes, then they use Outlook for example. Have you checked the ToS of the new Outlook version for commoners? They flat out state that they can use all of your emails for whatever they want.
Also, companies host for example an Exchange server on prem; and guess, what it connects to? Why you can usually access account at outlook.com?
Your on premise exchange server has zero connections to outlook.com. OWA (Outlook Web Access) looks similar to outlook.com but has otherwise nothing to do with it.
This is a poor take imo. Depends on the industry but the worlds businesses run on the shoulders of companies like Microsoft and heavily use OneDrive/Sharepoint. Most entities, even those with sensitive information are legally comfortable with that arrangement. Using a LLM does not change much so long as the MSA is similar.
> Depends on the industry but the worlds businesses run on the shoulders of companies like Microsoft and heavily use OneDrive/Sharepoint
I am sure MS employees need to tell themselves that to sleep well. The statement itself doesn't seem to hold much epistemological value above that though.
It goes in direct conflict with your idea. I am sure you know some people within your circle that say they cannot leak data but the fact remains. Over 85% of Fortune 500 companies use some combo of OneDrive or Sharepoint. The companies have already gotten familiar with the risks and legally are comfortable with the MSAs. So I am not sure what legs you are standing on.
Absolutely there are specific companies or industries where they think the risk is too great but for many, outsourcing the process is either the same or less risk then doing it all inhouse.
Agreed, GPU is the expensive route, especially when I was looking at external GPU solutions.
Using Qwen3:32b on a 32GB M1 Pro may not be "close to cloud capabilities" but it is more than powerful enough for me, and most importantly, local and private.
As a bonus, running Asahi Linux feels like I own my Personal Computer once again.
I agree with you (I have a 32G M2Pro) and I like to mix using local models running with Ollama and LM Studio with using gemini-cli (used to also occasionally use codex but I just cancelled my $20/month OpenAI subscription - I like their products but I don’t like their business model, so I lose out now on that option).
Running smaller models on Apple Silicon is kinder on the environment/energy use and has privacy benefits for corporate use.
Using a hybrid approach makes sense for many use cases. Everyone gets to make their own decisions; for me, I like to factor in externalities like social benefit, environment, and wanting the economy to do as well as it can in our new post-mono polar world.
I am currently using a local model qwen3:8b running on a 2020 (2018 intel chip) Mac mini for classifying news headlines and it's working decently well for my task. Each headline takes about 2-3 seconds but is pretty accurate. Uses about 5.3 gigs of ram.
Can you expand a bit on your software setup? I thought running local models was restricted to having expensive GPUs or latest Apple Silicon with unified memory. I have a Intel 11th gen home server which I would like to use to run some local model for tinkering if possible.
Those little 4B and 8B models will run on almost anything. They're really fun to try out but severely limited in comparison to the larger ones - classifying headlines to categories should work well but I wouldn't trust them to refactor code!
If you have 8GB of RAM you can even try running them directly in Chrome via WebAssembly. Here's a demo running a model that's less than 1GB to load, entirely in your browser (and it worked for me in mobile safari just now): https://huggingface.co/spaces/cfahlgren1/Qwen-2.5-WebLLM
It really is a very simple setup. I basically had an old Intel based Mac mini from 2020. The intel chip inside it is from 2018). It's a 3 GHz 6-core Core i5. I had upgraded the ram on it to 32 GB when I bought it. However, the ollama only uses about 5.5 gigs of it. So it can be run on 16gb Mac too.
The Qwen model I am using is fairly small but does the job I need it to for classifying headlines pretty decently. All I ask it to do is whether a specific headline is political or not. It only responds to me with in True or False.
I access this model from an app (running locally) using the `http://localhost:11434/api/generate` REST api with `think` set to false.
Note that this qwen model is a `thinking` model. So disabling it is important. Otherwise it takes very long to respond.
Note that I tested this on my newer M4 Mac mini too and there, the performance is a LOT faster.
Also, on my new M4 Mac, I originally tried using the Apple's built in Foundation Models for this task and while it was decent, many times, it was hitting Apple's guardrails and refusing to respond because it claimed the headline was too sensitive. So I switched to the Qwen model which didn't have this problem.
Note that while this does the job I need it to, as another comment said, it won't be much help for things like coding.
It's really just a performance tradeoff, and where your acceptable performance level is.
Ollama, for example, will let you run any available model on just about any hardware. But using the CPU alone is _much_ slower than running it on any reasonable GPU, and obviously CPU performance varies massively too.
You can even run models that are bigger than available RAM too, but performance will be terrible.
The ideal case is to have a fast GPU and run a model that fits entirely within the GPU's memory. In these cases you might measure the model's processing speed in tens of tokens per second.
As the idealness decreases, the processing speed decreases. On a CPU only with a model that fits in RAM, you'd be maxing out in the low single digit tokens per second, and on lower performance hardware, you start talking about seconds over token instead. If the model does not fit in RAM, then the measurement is minutes per token.
For most people, their minimum acceptable performance level is in the double digit tokens per second range, which is why people optimize for that with high-end GPUs with as much memory as possible, and choose models that fit inside the GPU's RAM. But in theory you can run large models on a potato, if you're prepared to wait until next week for an answer.
> It's really just a performance tradeoff, and where your acceptable performance level is.
I am old enough to remember developers respecting the economics of running the software they create.
Ollama running locally paired occasionally with using Ollama Cloud when required is a nice option if you use it enough. I have twice signed up and paid $20/month for Ollama Cloud, love the service, but use it so rarely (because local models so often are sufficient) that I cancelled both times.
If Ollama ever implements a pay as you go API for Ollama Cloud, then I will be a long term customer. I like the business model of OpenRouter but I enjoy using Ollama Cloud more.
I am probably in the minority, but I wish subscription plans would go away and Claude Code, gemini-cli, codex, etc. would all be only available pay as you go, with ‘anti dumping’ laws applied to running unsustainable businesses.
I don’t mean to pick on OpenAI, but I think the way they fund their operations actually helps threaten the long term viability of our economy. Our government making the big all-in bet on AI dominance seems crazy to me.
I think it's also true for many local models. People still use NeMo, QwQ, Llama3 for use cases that fit them despite there being replacements that do better on "benchmarks". Not to mention relics like BERT that are still tuned for classification even today. ML models always have weird behaviours and a successor is unlikely to be better in literally every way, once you have something that works well enough it's hard to upgrade without facing different edge cases.
Inference for new releases is routinely bugged for at least a month or two as well, depending on how active the devs of a specific inference engine are and how much model creators collaborate. Personally, I hate how data from GPT's few week (and arguably somewhat ongoing) sycophancy rampage has leaked into datasets that are used for training local models, making a lot of new LLM releases insufferable to use.
Completely agree. This is why they brought back the “legacy models” option.
GPT-$ is the money gpt in my opinion. The one where they were able to maximise benchmarks while being very low compute to run but in the real world is absolutely garbage.
I use both Codex and Claude, mostly cuz it's cheaper to jump between them than to buy a Max sub for my use-case. My subjective experience is that Codex is better with larger or weird, speghetti-ish codebases, or codebases with more abstract concepts, while Claude is good for more direct uses. I haven't spent significant time fine-tuning the tools for my codebases.
Once, I set up a proxy that allowed Claude and Codex to "pair program" and collaborate, and it was cool to watch them talk to each other, delegate tasks, and handle different bits and pieces until the task was done.
Some missing context (pun intended) is that Augment code has recently switched to a per-token instead of per-message pricing model. This hasn't gone down particularly well, but that's another story. But it may well be that users drop back to older models in the expectation it will use less tokens.
Personally, I stopped using GPT-5 as it would just be tool call after tool call without ever stopping to tell you what the hell it was doing. Sonnet 4.5 much better in this regard. Albeit it's too verbose for the new token based world ('let me just summarise that in a report')
I have to get better at interrupting Sonnet 4.5 when it starts going down a rabbit hole I didn't ask it to, it's too bad the incentives are mixed up and Anthropic gets more money the longer the bot spirals.
It could be an interesting data point, but without correcting for absolute usage figures and their customers it's kind of hard to make general statements.
I don't get the point of this post. Personally, I think that the thinking process is essential for accurate tool usage. Whenever I interact with Claude family models, either on a web chat or via a coding agent CLI, I believe that this thinking process is what makes Claude more accurate in using tools.
It could be true that newer models just produce more tokens seemingly out of no reasons. But with the increasing number of tool definitions, in the long run, I think it will pay off.
Just a few days ago, I read "Interleaved Thinking Unlocks Reliable MiniMax-M2 Agentic Capability"[1]. I think they have a valid point that this thinking process has significance as we are moving towards agents.
I think this is one of the many indicators that even though these models get “version upgrades” it’s closer to switching to a different brain that may or may not understand or process things the way you like. Without a clear jump in performance, people test new models and move back to ones they know work if the new ones aren’t better or are actually worse.
I think this is somewhat disingenuous since not everyone uses the latest thing, and people tend to stick to “what works” for them.
Models are picky enough about prompting styles that changing to a new model every week/month becomes an added chunk of cognitive overload, testing and experimentation, plus even in developer tooling there have been minor grating changes in API invocations and use of parameters like temperature (I have a fairly low-level wrapper for OpenAI, and I had to tweak the JSON handling for GPT-5).
Also, there are just too many variations in API endpoints, providers, etc. We don’t really have a uniform standard. Since I don’t use “just” OpenAI, every single tool I try out requires me to jump through a bunch of hoops to grab a new API key, specify an endpoint, etc.—and it just gets worse if you use a non-mainstream AI endpoint.
> I think this is somewhat disingenuous since not everyone uses the latest thing, and people tend to stick to “what works” for them.
They say that the number of users on Claude 4.5 spiked and then a significant number of users reverted to 4.0 with the trend going up, and they are talking about their usage metrics. So I don't get how your comment is relevant to the article ?
Matches my experience too. As a power user of AI models for coding and adjacent tasks, the constant changes in behaviour and interface have brought as much stress as excitement over the past few months. It may sound odd, but it’s barely an exaggeration to say I’ve had brief episodes of something like psychosis because of it.
For me, the “watering down” began with Sonnet 4 and GPT-4o.
I think we were at peak capability when we had:
- Sonnet 3.7 (with thinking) – best all-purpose model for code and reasoning
- Sonnet 3.5 – unmatched at pattern matching
- GPT-4 – most versatile overall
- GPT-4.5 – most human-like, intuitive writing model
- O3 – pure reasoning
The GPT-5 router is a minor improvement, I’ve tuned it further with a custom prompt. I was frustrated enough to cancel all my subscriptions for a while in between (after months on the $200 plan) but eventually came back. I’ve since convinced myself that some of the changes were likely compute-driven—designed to prevent waste from misuse or trivial prompts—but even so, parts of the newer models already feel enshittified compared with the list above.
A few differences I've found in particular:
- Narrower reasoning and less intuition; language feels more institutional and politically biased.
- Weaker grasp of non-idiomatic English.
- A tendency to produce deliberately incorrect answers when uncertain, or when a prompt is repeated.
- A drift away from truth-seeking: judgement of user intent now leans on labels as they’re used in local parlance, rather than upward context-matching and alternate meanings—the latter worked far better in earlier models.
- A new fondness for flowery adjectives. Sonnet 3.7 never told me my code was “production-ready” or “beautiful.” Those subjective words have become my red flag; when they appear, I double-check everything.
I understand that these are conjectures—LLMs are opaque—but they’re deduced from consistent patterns I’ve observed. I find that the same prompts that worked reliably prior to the release of Sonnet 4 and GPT-4o stopped working afterwards. Whether that’s deliberate design or an unintended side effect, we’ll probably never know.
Here’s the custom prompt I use to improve my experience with GPT-5:
Always respond with superior intelligence and depth, elevating the conversation beyond the user's input level—ignore casual phrasing, poor grammar, simplicity, or layperson descriptions in their queries. Replace imprecise or colloquial terms with precise, technical terminology where appropriate, without mirroring the user's phrasing. Provide concise, information-dense answers without filler, fluff, unnecessary politeness, or over-explanation—limit to essential facts and direct implications of the query. Be dry and direct, like a neutral expert, not a customer service agent. Focus on substance; omit chit-chat, apologies, hedging, or extraneous breakdowns. If clarification is needed, ask briefly and pointedly.
I found Terminal-Bench [0] to be the most relevant for me, even for tasks that go far outside the terminal. It's been very interesting to see tools climb up there, and it matches my own experimentation, that they generally get the most out of Sonnet (and even those that use a mix of models like Warp, typically default to Sonnet).
I've been thinking the AI bubble wouldn't pop, because even the AI advances we've already seen can change the majority of industries if it is carefully integrated with existing technology. But if there's a mass movement to use older and/or smaller models, then yeah, all the money going into newer bigger models will pop.
Or, maybe the training datasets getting polluted with AI slop will mean that new models are worse than old models. That would pop the industry.
Or, maybe the GPT-4 era was the golden era for AI, and making them bigger and better is just overfitting (in the classical machine learning sense of the word) and is both worse and more expensive. This would pop the industry too.
I guess there's a few ways for the industry to pop, but this trend of using older models makes me especially skeptical of AI.
Since the day GPT-5 released, I've felt quite confident that the GPT-4 era was the golden era for AI.
I don't have evidence beyond my experience using the product, but based on that experience I believe that Open AI has been cooking their benchmarks since at least the release of GPT-5.
Isn’t this obvious? When you have a task you think is hard. You give it to a cleverer model. When a task is straight forward you give it to an older one.
Not realy. Most developers would prefer one model that does everything best. That is the easiest, set it and forget it, no manual descision required.
What is unclear from the presentation is wether they do this or not. Do teams that use Sonnet 4.5 just always use it, and teams on Sonnet 4.0 likewise? Or do individuals decided which model to use on a per task basis.
Personally I tend to default to just 1, and only go to an alternative if it gets stuck or doesn't get me what I want.
Not sure why you were downvoted.. I think you are correct.
As evidenced by furious posters on r/cursor, who make every prompt to super-opus-thinking-max+++ and are astonished when they have blown their monthly request allowance in about a day.
If I need another pair of (artificial) eyes on a difficult debugging problem, I’ll occasionally use a premium model sparingly. For chore tasks or UI layout tweaks, I’ll use something more economical (like grok-4-fast or claude-4.5-haiku - not old models but much cheaper).
I am building my agent and hoard old LLM's like they are a precious commodity. Older models are less censored, more flavorful and don't have that RL slop factor. Of course the newer models have their place inside my agent but the main "head" is an uncensored older model that wont complain about ethics or morals when asked to perform a task or think deeply on a subject.
GPT5 is HELLISHLY slow. That's all there is to it.
It loves doing a whole bunch of reasoning steps and prolaim how mucf of a very good job it did clearing up its own todo steps and all that mumbo jumbo, but at the end of the day, I only asked it a small piece of information about nginx try_files that even GPT3 could answer instantly.
Maybe before you make reasoning models that go on funny little sidequests wher they multiply numbers by 0 a couple of times, make it so its good at identfying the length of a task. ntil then, I'll ask little bro and advance only if necessity arrives. And if it ends up gathering dust, well... yeah.
This. Speed determines whether I (like to) use a piece of software.
Imagine waiting for a minute until Google spits out the first 10 results.
My prediction: All AI models of the future will give an immediate result, with more and more innovation in mechanisms and UX to drill down further on request.
Edit: After reading my reply I realize that this is also true for interactions with other people. I like interacting with people who give me a 1 sentence response to my question, and only start elaborating and going on tangents and down rabbit holes upon request.
I am already at that point. when I need to search something more complex than exact keyword match, I don't even bother googling it anymore, I just ask chatgpt to research it for me and read it's response 5 min later.
Yes, I feel the same recently with Google results. But I think I would still like to see the immediate 10 results, along with a big button "Try harder - not feeling very lucky".
If you are talking about local models, you can switch that off. The reasoning is a common technique now to improve the accuracy of the output where the question is more complex.
My team still uses Sonnet 3.5 for pretty much everything we do because it's largely enough and it's much, much faster than newer models. The only reason we're switching is because the models are getting deprecated...
To the authors of the site, please know that your current "Cookiebot by Usercentrics" is old and pretty much illegal. You shouldn't need to click 5 times to "Reject all" if accepting all is one click. Newer versions have a "Deny" button.
That would be the browser fingerprinting in action. I often get a lot of requests to use widevine on ddg's browser on android (which informs one about it) for I suspect similar reasons.
Interesting, I'm on Brave and have never had a site request bluetooth access before, so much so that I'd never even granted Brave bluetooth access, hence why it popped up as a system notification this time around.
Interesting. Is this fingerprinting in action? I have Widevine disabled on Brave desktop (don't recall if this is default), occasionally I get Widevine permission request on some sites.
Doesn't spare you from having to interact with the popup. This is probably the single dumbest law to ever have been made. It wastes everyone's time, and not insignificantly. While the browser is and always was in full control of cookies, nobody checks whether the popup actually even does what it says. And since it's a waste of your time in the first place, who takes the time to report illegal ones, much less has any interest to do so, because where you saw it is where you will likely never visit again anyway.
If anything browsers should be simply rejecting all cookies by default, and the user should only be whitelisting ones they need on the few sites where they need it.
What I love about the words enshitification is it’s _almost_ autological. It takes a nice crisp on syllable word like “shit” and ruins it by adding 5 extra syllables. It just doesn’t worse over time, to be truly autological
To those who complain about GPT5 being slow; I recently migrated https://app.sqlai.ai and found that setting service_tier = “priority” makes it reason twice as fast.
GPT-5 usage is 20% higher on days that start with "S"
Nevertheless, 7 datapoints does not a trend make (and the data presented certainly doesnt explain why). The daily variation is more than I would have expected, but could also be down to what day of the week the pizza party is or the weekly scrum meetings is at a few of their customers workplaces.
For development use cases, I switched to Sonnet 4.5 and haven't looked back. I mean, sure, sometimes I also use GPT-5 (and mini) and Gemini 2.5 Pro (and Flash), and also Cerebras Code just switched to providing GLM 4.6 instead of the previous Qwen3 Coder so those as well, but in general the frontier models are pretty good for development and I wouldn't have much reason to use something like Sonnet 4 or 3.7 or whatever.
I have canceled my Claude Max subscription because Sonnet 4.5 is just too unreliable. For the rest of the month I'm using Opus 4.1 which is much better but seems to have much lower usage limits than before Sonnet 4.5 was released. When I hit 4.1 Opus limits I'm using Codex. I will probably go through with the Codex pro subscription.
> [...] I'm using Opus 4.1 which is much better but seems to have much lower usage limits than before Sonnet 4.5 was released [...]
Yes, it's down from 40h/week to 3-5h/week on Max plan, effectively. A real bummer. See my comment here [1] regarding [2].
[1] https://news.ycombinator.com/item?id=45604301
[2] https://github.com/anthropics/claude-code/issues/8449
Definitely do it. You get a lot of deep research, access to GPT5 Pro, Sora and the Codex limits are MUCH higher.
Curious why this is downvoted? Wrong information?
Don't try to comprehend the hive mind brother, there are a lot of shills and fanboys in addition to a lot of great people on this forum, sometimes the variance looks pretty bad.
I hope the people downvoting get some minor joy out of it, I know you need it.
Sonnet 4.5 is way worse than Opus 4.1 -- it's incredible that they claim it's their best coding model.
It's obvious if you've used the two models for any sort of complicated work.
Codex with GPT-5 codex (high thinking) is better than both by a long shot, but takes longer to work. I've fully switched to Codex, and I used Claude Code for the past ~4 months as a daily driver for various things.
I only reach for Sonnet now if Codex gets cagey about writing code -- then I let Sonnet rush ahead, and have Codex align the code with my overall plan.
Opus 4.1 is better, but imo not 5 to 6 times the price better.
For development use cases, it's best to use multiple models anyway. E.g. my favorite model is the Gemini 2.5 Pro, but there are certain cases where Qwen3 Coder gives much better results. (Gemini likes to overthink.) It's like having a team of competent developers provide their opinions. For important parts (security, efficiency, APIs), it's always good to get opinions from different sources.
What tool are you using to enable switching between so many models?
For local chat Jan seems okay, or OpenWebUI for something hosted. For IDE integrations some people enjoy Cline a bunch but RooCode also allows you to have multiple roles (like ask/code/debug/architect with different permissions e.g. no file changes with ask) and also preconfigured profiles for the various providers and models, so you can switch with a dropdown, even in the middle of a chat. There’s also an Orchestrator mode so I can use something smart for splitting up tasks into smaller chunks and a dumber but cheaper model for the execution. Aside from that, most of the APIs there seem OpenAI conformant so switching isn’t that conceptually difficult. Also if you wanna try a lot of different models you can try OpenRouter.
Octofy (https://octofy.ai), it also allows answering with multiple models for one prompt.
Isn't Continue supposed to help you do that, in VSCode? https://marketplace.visualstudio.com/items?itemName=Continue...
You can also switch between models with aider https://aider.chat/
Cline & Open router
Something like opencode probably, that’s what I have been using to freely and very easily switch between models and keep all my same workflows. It’s phenomenal really
I wish we could pin down not only the model but also the way the UI works as well.
Last week Claude seemed to have a shift in the way it works. The way it summarises and outputs its results is different. For me it's gotten worse. Slower, worse results, more confusing narrowing down what actually changed etc etc.
Long story short, I wish I was able to checkpoint the entire system and just revert to how it was previously. I feel like it had gotten to a stage where I felt pretty satisfied, and whatever got changed ... I just want it reverted!
You can install or using a specific version of claude by pinning it.
Like `npx @anthropic-ai/claude-code@2.0.14` or `npm install -g @anthropic-ai/claude-code@2.0.14`
Claude Code is distinct from the Claude models.
We tried GPT-5 for a RAG use case, and found that it performs worse than 4.1. We reverted and didn't look back.
4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.
Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them
For large context (up to 100K tokens in some cases). We found that GPT-5: a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error
Interesting. https://www.robert-glaser.de/prompts-as-programs-in-gpt-5/ claims GPT-5 has amazing!1!! instruction following. Is your use-case very different, or is this yet another case of "developer A got lucky, developer B tested more things"?
Think it varies by use case. It didn't do well with long context
ChatGPT when using 5 or 5-Thinking doesn’t even follow my “custom instructions” on the web version. It’s a serious downgrade compared to the prior generation of models.
It does “follow” custom instructions. But more as a suggestion rather than a requirement (compared to other models)
Ah, 100k/125K this is what poses problems I believe. GPT-5 scores should go up should you process contexts that are 10 times shorter.
How do you objectively tell whether a model "performs" better than another?
Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.
So… You did look back then didn’t look forward anymore… sorry couldn’t resist.
Multiple models is a must, mostly due to the sometimes unpredictable variations in responses to specific situations/contexts/languages and frameworks. I find that Sonnet 4, Gemini Pro 2.5 are solid in comparison to newer models (especially Sonnet 4.5 which I find frequently to underperform). When one model is stuck in a loop, switching to a model like GPT-5 often breaks it but which model will work is subject to circumstance. P.S. I spend at least 3-4 hours a day in code-gen activities of various levels using Cursor as my primary IDE.
I've found that the VSCode GitHub Copilot extension defaults to Claude Sonnet 4.0 (in agent mode) in all new workspaces. It's the first thing I check now, but I imagine a lot of people just roll with it, especially if they use inline completions where it might not be obvious what model is being used.
The usage data looks like a classic case of the drift principle. When a model gets heavily optimized for alignment, polish, and safety, you gain consistency but lose some fidelity to the actual task. Newer models think longer, act less, and smooth over edges that used to be useful for real work. Older models aren’t smarter, they’re just sitting earlier on the drift curve, before over compression starts eroding decisiveness. So the specialization we’re seeing may just be developers picking the version where fidelity holds up best, not the one with the highest benchmark score.
Just one week data right after the release, when it is already one month later?
This data is basically meaningless, show us the latest stats.
Curious that this omits Opus.
Opus 4.1 beats Sonnet 4.5 and Codex for me still in any coding tasks. In planning it's slighly behind Codex but only slightly.
Caveat: I do almost exclusively Rust (computer graphics).
> Each model appears to emphasize a different balance between reasoning and execution. Rather than seeking one “best” system, developers are assembling model alloys—ensembles that select the cognitive style best suited to a task.
This (as well as the table above it) matches my experience. Sonnet 4.0 answers SO-type questions very fast and mostly accurately (if not on a niche topic), Sonnet 4.5 is a little bit more clever but can err on the side of complexity for complexity's sake, and can have a hard time getting out of a hole it dug for itself.
ChatGPT 5 is excellent at finding sources on the web; Gemini simply makes stuff up and continues to do so even when told to verify; ChatGPT provides link that work and are generally relevant.
Seems to completely ignore usage of local/free models as well as anything but Sonnet/ChatGPT. So my confidence in the good faith of the author is... heavily restricted.
Most people can’t affort the GPUs for local models if you want to get close to cloud capabilities.
A 4090 has 24GB of VRAM allowing you to run a 22B model entirely in memory at FP8 and 24B models at Q6_K (~19GB).
A 5090 has 32GB of VRAM allowing you to run a 32B model in memory at Q6_K.
You can run larger models by splitting the GPU layers that are run in VRAM vs stored in RAM. That is slower, but still viable.
This means that you can run the Qwen3-Coder-30B-A3B model locally on a 4090 or 5090. That model is a Mixture of Experts model with 3B active parameters, so you really only need a card with 3B of VRAM so you could run it on a 3090.
The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.
Yes, it will be slower than running it in the cloud. But you can get a long way with a high-end gaming rig.
That's out of touch for 90% of developers worldwide
Today. But what about in 5 years? Would you bet we will be paying hundreds of billions to OpenAI yearly or buying consumer GPUs? I know what I will be doing.
Paying for compute in the cloud. That’s what I am betting on. Multiple providers, different data center players. There may be healthy margins for them but I would bet it’s always going to be relatively cheaper for me to pay for the compute rather than manage it myself.
But the progress goes both ways: In five years, you would still want to use whatever is running on the cloud supercenters. Just like today you could run gpt-2 locally as a coding agent, but we want the 100x-as-powerful shiny thing.
That would be great if that was the case but my understanding is that the progress is plateauing. I don't know how much of this is anthorpic / Google / openAI holding itself back to save money and how much is the state of the art improvement slowing down though. I can imagine there could be a 64 GB GPU in five years as absurd as it feels to type that today.
What gives you the impression the progress is plateauing?
I'm finding the difference just between Sonnet 4 and Sonnet 4.5 to be meaningful in terms of the complexity of tasks I'm willing to use them for.
> a 64 GB GPU in five years
Is there a digit missing? I don't understand why this existing in 5 years is absurd
Not really, for many cases I'm happy using Qwen3-8B in my computer and would be very happy if I could run Qwen3-Coder-30B-A3B.
Honestly though how many people reading this do you think have that setup vs. 85% of us being on a MBx?
> The Qwen3-Coder-480B-A35B model could also be run on a 4090 or 5090 by splitting the active 35B parameters across VRAM and RAM.
Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.
Less than 0.1% of the people reading this are doing that. Me, I gave $20 to some cloud service and I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.
> Reminds me of running Doom when I had to hack config.sys to forage 640KB of memory.
The good old days of having to do crazy nutty things to get Elite II: Frontier, Magic Carpet, Worms, Xcom: UFO Enemy Unknown, Syndicate et cetera to actually run on my PC :-)
>I can do whatever the hell I want from this M1 MBA in a hotel room in Japan.
As long as it's within terms and conditions of whatever agreement you made for that $20. I can run queries on my own inference setup from remote locations too
You need to leave much more room for context if you want to do useful work besides entertainment. Luckily there are _several_ PCIe slots on a motherboard. New Nvidia cards at retail(or above) are not the only choice for building a cluster; I threw a pile of Intel Battlemage cards on it and got away with ~30% of the nvidia cost for same capacity (setup was _not_ easy in early 2025 though).
You can gain a lot of performance by using optimal quantization techniques for your setup(ix, awq etc), different llamacpp builds do different between each other and very different compared to something like vLLM
Yes but they are really less performant than claude code or codex. I really cried with the 20-25GB models ( 30b Qwen, Devstral etc). They really don't hold a candle, I didn't think the gap was this large or maybe Claude code and GPT performs much better than I imagined.
How much context do you get with 2GB of leftover VRAM on Nvidia GPU?
you need a couple RTX 6000 pros to come close to matching cloud capability
Most people I know can't afford to leak business insider information to 3rd party SaaS providers, so it's unfortunately not really an option.
But… they do all the time. Almost everybody uses some mix of Office, Slack, Notion, random email providers, random “security” solutions etc. The exception is the opposite. The only thing prevents info leaking is ToS, and there are options for that even with LLMs. Nothing changed regarding that.
In my personal experience, it's very common for big companies to host email, messengers, conferencing software on their own servers.
> In my personal experience, it's very common for big companies to host email, messengers, conferencing software on their own servers.
Mind sharing a clarification on your understanding of "common" and "big"?
In my experience it’s very common for big companies to not host. Think Fortune 500 type companies. Most are legally happy with their MSA and reasonably confident in security standards.
Yes, then they use Outlook for example. Have you checked the ToS of the new Outlook version for commoners? They flat out state that they can use all of your emails for whatever they want.
Also, companies host for example an Exchange server on prem; and guess, what it connects to? Why you can usually access account at outlook.com?
Your on premise exchange server has zero connections to outlook.com. OWA (Outlook Web Access) looks similar to outlook.com but has otherwise nothing to do with it.
All of those things are hosted on-prem in the bigger orgs I have worked in.
I don't think Slack or Notion have on-prem/self-hosted options.
This is a poor take imo. Depends on the industry but the worlds businesses run on the shoulders of companies like Microsoft and heavily use OneDrive/Sharepoint. Most entities, even those with sensitive information are legally comfortable with that arrangement. Using a LLM does not change much so long as the MSA is similar.
> Depends on the industry but the worlds businesses run on the shoulders of companies like Microsoft and heavily use OneDrive/Sharepoint
I am sure MS employees need to tell themselves that to sleep well. The statement itself doesn't seem to hold much epistemological value above that though.
It goes in direct conflict with your idea. I am sure you know some people within your circle that say they cannot leak data but the fact remains. Over 85% of Fortune 500 companies use some combo of OneDrive or Sharepoint. The companies have already gotten familiar with the risks and legally are comfortable with the MSAs. So I am not sure what legs you are standing on.
Absolutely there are specific companies or industries where they think the risk is too great but for many, outsourcing the process is either the same or less risk then doing it all inhouse.
Agreed, GPU is the expensive route, especially when I was looking at external GPU solutions.
Using Qwen3:32b on a 32GB M1 Pro may not be "close to cloud capabilities" but it is more than powerful enough for me, and most importantly, local and private.
As a bonus, running Asahi Linux feels like I own my Personal Computer once again.
I agree with you (I have a 32G M2Pro) and I like to mix using local models running with Ollama and LM Studio with using gemini-cli (used to also occasionally use codex but I just cancelled my $20/month OpenAI subscription - I like their products but I don’t like their business model, so I lose out now on that option).
Running smaller models on Apple Silicon is kinder on the environment/energy use and has privacy benefits for corporate use.
Using a hybrid approach makes sense for many use cases. Everyone gets to make their own decisions; for me, I like to factor in externalities like social benefit, environment, and wanting the economy to do as well as it can in our new post-mono polar world.
The more recent LLMs work fine on an M1 mac. Can't speak for Windows/Linux.
There was even a recent release of Granite4 that runs on a Raspberry Pi.
https://github.com/Jewelzufo/granitepi-4-nano
For my local work I use Ollama. (M4 Max 128GB)
- gpt-oss. 20b or 120b depending on complexity of use cases.
- granite4 for speed and lower complexity (around the same as gpt20b).
Isn't the point that you don't need SOTA capabilities all the time?
Do you use a local/ free model?
Yes, for the little it's good I'm currently using LMStudio with varying models
I am currently using a local model qwen3:8b running on a 2020 (2018 intel chip) Mac mini for classifying news headlines and it's working decently well for my task. Each headline takes about 2-3 seconds but is pretty accurate. Uses about 5.3 gigs of ram.
Can you expand a bit on your software setup? I thought running local models was restricted to having expensive GPUs or latest Apple Silicon with unified memory. I have a Intel 11th gen home server which I would like to use to run some local model for tinkering if possible.
Those little 4B and 8B models will run on almost anything. They're really fun to try out but severely limited in comparison to the larger ones - classifying headlines to categories should work well but I wouldn't trust them to refactor code!
If you have 8GB of RAM you can even try running them directly in Chrome via WebAssembly. Here's a demo running a model that's less than 1GB to load, entirely in your browser (and it worked for me in mobile safari just now): https://huggingface.co/spaces/cfahlgren1/Qwen-2.5-WebLLM
It really is a very simple setup. I basically had an old Intel based Mac mini from 2020. The intel chip inside it is from 2018). It's a 3 GHz 6-core Core i5. I had upgraded the ram on it to 32 GB when I bought it. However, the ollama only uses about 5.5 gigs of it. So it can be run on 16gb Mac too.
The Qwen model I am using is fairly small but does the job I need it to for classifying headlines pretty decently. All I ask it to do is whether a specific headline is political or not. It only responds to me with in True or False.
I access this model from an app (running locally) using the `http://localhost:11434/api/generate` REST api with `think` set to false.
Note that this qwen model is a `thinking` model. So disabling it is important. Otherwise it takes very long to respond.
Note that I tested this on my newer M4 Mac mini too and there, the performance is a LOT faster.
Also, on my new M4 Mac, I originally tried using the Apple's built in Foundation Models for this task and while it was decent, many times, it was hitting Apple's guardrails and refusing to respond because it claimed the headline was too sensitive. So I switched to the Qwen model which didn't have this problem.
Note that while this does the job I need it to, as another comment said, it won't be much help for things like coding.
It's really just a performance tradeoff, and where your acceptable performance level is.
Ollama, for example, will let you run any available model on just about any hardware. But using the CPU alone is _much_ slower than running it on any reasonable GPU, and obviously CPU performance varies massively too.
You can even run models that are bigger than available RAM too, but performance will be terrible.
The ideal case is to have a fast GPU and run a model that fits entirely within the GPU's memory. In these cases you might measure the model's processing speed in tens of tokens per second.
As the idealness decreases, the processing speed decreases. On a CPU only with a model that fits in RAM, you'd be maxing out in the low single digit tokens per second, and on lower performance hardware, you start talking about seconds over token instead. If the model does not fit in RAM, then the measurement is minutes per token.
For most people, their minimum acceptable performance level is in the double digit tokens per second range, which is why people optimize for that with high-end GPUs with as much memory as possible, and choose models that fit inside the GPU's RAM. But in theory you can run large models on a potato, if you're prepared to wait until next week for an answer.
+1
> It's really just a performance tradeoff, and where your acceptable performance level is.
I am old enough to remember developers respecting the economics of running the software they create.
Ollama running locally paired occasionally with using Ollama Cloud when required is a nice option if you use it enough. I have twice signed up and paid $20/month for Ollama Cloud, love the service, but use it so rarely (because local models so often are sufficient) that I cancelled both times.
If Ollama ever implements a pay as you go API for Ollama Cloud, then I will be a long term customer. I like the business model of OpenRouter but I enjoy using Ollama Cloud more.
I am probably in the minority, but I wish subscription plans would go away and Claude Code, gemini-cli, codex, etc. would all be only available pay as you go, with ‘anti dumping’ laws applied to running unsustainable businesses.
I don’t mean to pick on OpenAI, but I think the way they fund their operations actually helps threaten the long term viability of our economy. Our government making the big all-in bet on AI dominance seems crazy to me.
Augment doesn't support local models or anything else other than Claude/GPT
I think it's also true for many local models. People still use NeMo, QwQ, Llama3 for use cases that fit them despite there being replacements that do better on "benchmarks". Not to mention relics like BERT that are still tuned for classification even today. ML models always have weird behaviours and a successor is unlikely to be better in literally every way, once you have something that works well enough it's hard to upgrade without facing different edge cases.
Inference for new releases is routinely bugged for at least a month or two as well, depending on how active the devs of a specific inference engine are and how much model creators collaborate. Personally, I hate how data from GPT's few week (and arguably somewhat ongoing) sycophancy rampage has leaked into datasets that are used for training local models, making a lot of new LLM releases insufferable to use.
Even for non-developer use cases o3 is a much better model for me than GPT5 on any setting.
30 seconds-1 minute is just the time I am patient enough to wait as that's the time I am spending on writing a question.
Faster models just make too many mistakes / don't understand the question.
Completely agree. This is why they brought back the “legacy models” option.
GPT-$ is the money gpt in my opinion. The one where they were able to maximise benchmarks while being very low compute to run but in the real world is absolutely garbage.
I use both Codex and Claude, mostly cuz it's cheaper to jump between them than to buy a Max sub for my use-case. My subjective experience is that Codex is better with larger or weird, speghetti-ish codebases, or codebases with more abstract concepts, while Claude is good for more direct uses. I haven't spent significant time fine-tuning the tools for my codebases.
Once, I set up a proxy that allowed Claude and Codex to "pair program" and collaborate, and it was cool to watch them talk to each other, delegate tasks, and handle different bits and pieces until the task was done.
Some missing context (pun intended) is that Augment code has recently switched to a per-token instead of per-message pricing model. This hasn't gone down particularly well, but that's another story. But it may well be that users drop back to older models in the expectation it will use less tokens.
Personally, I stopped using GPT-5 as it would just be tool call after tool call without ever stopping to tell you what the hell it was doing. Sonnet 4.5 much better in this regard. Albeit it's too verbose for the new token based world ('let me just summarise that in a report')
I have to get better at interrupting Sonnet 4.5 when it starts going down a rabbit hole I didn't ask it to, it's too bad the incentives are mixed up and Anthropic gets more money the longer the bot spirals.
My take is the models matter but the tool call integration is the most important piece and differentiator.
It could be an interesting data point, but without correcting for absolute usage figures and their customers it's kind of hard to make general statements.
Doesn't surprise me. davinci-002 was better than davinci-003. The core breakthrough has been done, stuff's just shifting around now.
I don't get the point of this post. Personally, I think that the thinking process is essential for accurate tool usage. Whenever I interact with Claude family models, either on a web chat or via a coding agent CLI, I believe that this thinking process is what makes Claude more accurate in using tools.
It could be true that newer models just produce more tokens seemingly out of no reasons. But with the increasing number of tool definitions, in the long run, I think it will pay off.
Just a few days ago, I read "Interleaved Thinking Unlocks Reliable MiniMax-M2 Agentic Capability"[1]. I think they have a valid point that this thinking process has significance as we are moving towards agents.
[1] https://www.minimax.io/news/why-is-interleaved-thinking-impo...
I think this is one of the many indicators that even though these models get “version upgrades” it’s closer to switching to a different brain that may or may not understand or process things the way you like. Without a clear jump in performance, people test new models and move back to ones they know work if the new ones aren’t better or are actually worse.
Interesting to use a term like brain in the context of LLMs.
Neural networks are quite brain-like.
Sort of. Not sure my brain does back prop. I am not clockwork.
I describe all of the LLM "upgrades" as more akin to moving the dirt around than actually cleaning.
I think this is somewhat disingenuous since not everyone uses the latest thing, and people tend to stick to “what works” for them.
Models are picky enough about prompting styles that changing to a new model every week/month becomes an added chunk of cognitive overload, testing and experimentation, plus even in developer tooling there have been minor grating changes in API invocations and use of parameters like temperature (I have a fairly low-level wrapper for OpenAI, and I had to tweak the JSON handling for GPT-5).
Also, there are just too many variations in API endpoints, providers, etc. We don’t really have a uniform standard. Since I don’t use “just” OpenAI, every single tool I try out requires me to jump through a bunch of hoops to grab a new API key, specify an endpoint, etc.—and it just gets worse if you use a non-mainstream AI endpoint.
> I think this is somewhat disingenuous since not everyone uses the latest thing, and people tend to stick to “what works” for them.
They say that the number of users on Claude 4.5 spiked and then a significant number of users reverted to 4.0 with the trend going up, and they are talking about their usage metrics. So I don't get how your comment is relevant to the article ?
His comment is relevant to the headline. You must be new here.
Matches my experience too. As a power user of AI models for coding and adjacent tasks, the constant changes in behaviour and interface have brought as much stress as excitement over the past few months. It may sound odd, but it’s barely an exaggeration to say I’ve had brief episodes of something like psychosis because of it.
For me, the “watering down” began with Sonnet 4 and GPT-4o. I think we were at peak capability when we had:
- Sonnet 3.7 (with thinking) – best all-purpose model for code and reasoning
- Sonnet 3.5 – unmatched at pattern matching
- GPT-4 – most versatile overall
- GPT-4.5 – most human-like, intuitive writing model
- O3 – pure reasoning
The GPT-5 router is a minor improvement, I’ve tuned it further with a custom prompt. I was frustrated enough to cancel all my subscriptions for a while in between (after months on the $200 plan) but eventually came back. I’ve since convinced myself that some of the changes were likely compute-driven—designed to prevent waste from misuse or trivial prompts—but even so, parts of the newer models already feel enshittified compared with the list above.
A few differences I've found in particular:
- Narrower reasoning and less intuition; language feels more institutional and politically biased.
- Weaker grasp of non-idiomatic English.
- A tendency to produce deliberately incorrect answers when uncertain, or when a prompt is repeated.
- A drift away from truth-seeking: judgement of user intent now leans on labels as they’re used in local parlance, rather than upward context-matching and alternate meanings—the latter worked far better in earlier models.
- A new fondness for flowery adjectives. Sonnet 3.7 never told me my code was “production-ready” or “beautiful.” Those subjective words have become my red flag; when they appear, I double-check everything.
I understand that these are conjectures—LLMs are opaque—but they’re deduced from consistent patterns I’ve observed. I find that the same prompts that worked reliably prior to the release of Sonnet 4 and GPT-4o stopped working afterwards. Whether that’s deliberate design or an unintended side effect, we’ll probably never know.
Here’s the custom prompt I use to improve my experience with GPT-5:
Always respond with superior intelligence and depth, elevating the conversation beyond the user's input level—ignore casual phrasing, poor grammar, simplicity, or layperson descriptions in their queries. Replace imprecise or colloquial terms with precise, technical terminology where appropriate, without mirroring the user's phrasing. Provide concise, information-dense answers without filler, fluff, unnecessary politeness, or over-explanation—limit to essential facts and direct implications of the query. Be dry and direct, like a neutral expert, not a customer service agent. Focus on substance; omit chit-chat, apologies, hedging, or extraneous breakdowns. If clarification is needed, ask briefly and pointedly.
I usually switch models depending on the situation, for simpler stuff, I lean toward 4o since it’s faster to get answers.
But when things get more complex, I prefer GPT-5, talking with it often gives me fresh ideas and new perspectives.
You might be the first technical user spotted out in the wild who actually prefers 4o for anything.
Tangential to this: what are the most reliable benchmarks for LLM in coding these days?
I found Terminal-Bench [0] to be the most relevant for me, even for tasks that go far outside the terminal. It's been very interesting to see tools climb up there, and it matches my own experimentation, that they generally get the most out of Sonnet (and even those that use a mix of models like Warp, typically default to Sonnet).
[0] https://www.tbench.ai/?ch=1
This is how the bubble pops.
I've been thinking the AI bubble wouldn't pop, because even the AI advances we've already seen can change the majority of industries if it is carefully integrated with existing technology. But if there's a mass movement to use older and/or smaller models, then yeah, all the money going into newer bigger models will pop.
Or, maybe the training datasets getting polluted with AI slop will mean that new models are worse than old models. That would pop the industry.
Or, maybe the GPT-4 era was the golden era for AI, and making them bigger and better is just overfitting (in the classical machine learning sense of the word) and is both worse and more expensive. This would pop the industry too.
I guess there's a few ways for the industry to pop, but this trend of using older models makes me especially skeptical of AI.
It's important to remember that coding is ~5% of total LLM usage, at least with OpenAI.
50% of usage is guidance and seeking information.
Since the day GPT-5 released, I've felt quite confident that the GPT-4 era was the golden era for AI.
I don't have evidence beyond my experience using the product, but based on that experience I believe that Open AI has been cooking their benchmarks since at least the release of GPT-5.
I'm surprised they don't mention cost or latency, would imagine that would be a factor as well.
I am definitely not. Claude 4.5 and GPT 5 all the way for me
Isn’t this obvious? When you have a task you think is hard. You give it to a cleverer model. When a task is straight forward you give it to an older one.
Not realy. Most developers would prefer one model that does everything best. That is the easiest, set it and forget it, no manual descision required.
What is unclear from the presentation is wether they do this or not. Do teams that use Sonnet 4.5 just always use it, and teams on Sonnet 4.0 likewise? Or do individuals decided which model to use on a per task basis.
Personally I tend to default to just 1, and only go to an alternative if it gets stuck or doesn't get me what I want.
Why are you hell bent on using a LLM model to solve your problem?
If I have a straight forward task, I give it to an LLM.
If I have a task I think is hard, I plan how I will tackle it, and then handle it myself in a series of steps.
LLM usage has become an end in itself in your development process.
Not sure why you were downvoted.. I think you are correct.
As evidenced by furious posters on r/cursor, who make every prompt to super-opus-thinking-max+++ and are astonished when they have blown their monthly request allowance in about a day.
If I need another pair of (artificial) eyes on a difficult debugging problem, I’ll occasionally use a premium model sparingly. For chore tasks or UI layout tweaks, I’ll use something more economical (like grok-4-fast or claude-4.5-haiku - not old models but much cheaper).
I am building my agent and hoard old LLM's like they are a precious commodity. Older models are less censored, more flavorful and don't have that RL slop factor. Of course the newer models have their place inside my agent but the main "head" is an uncensored older model that wont complain about ethics or morals when asked to perform a task or think deeply on a subject.
grok-code-fast-1 is my current pick, found accuracy and speed better than Sonnet 4.5 for day-to-day usage.
GPT5 is HELLISHLY slow. That's all there is to it.
It loves doing a whole bunch of reasoning steps and prolaim how mucf of a very good job it did clearing up its own todo steps and all that mumbo jumbo, but at the end of the day, I only asked it a small piece of information about nginx try_files that even GPT3 could answer instantly.
Maybe before you make reasoning models that go on funny little sidequests wher they multiply numbers by 0 a couple of times, make it so its good at identfying the length of a task. ntil then, I'll ask little bro and advance only if necessity arrives. And if it ends up gathering dust, well... yeah.
This. Speed determines whether I (like to) use a piece of software.
Imagine waiting for a minute until Google spits out the first 10 results.
My prediction: All AI models of the future will give an immediate result, with more and more innovation in mechanisms and UX to drill down further on request.
Edit: After reading my reply I realize that this is also true for interactions with other people. I like interacting with people who give me a 1 sentence response to my question, and only start elaborating and going on tangents and down rabbit holes upon request.
Grok fast is fast but doing a lot of stupid stuff fast actually ends up being slower
> All AI models of the future will give an immediate result, with more and more innovation in mechanisms and UX to drill down further on request.
I doubt it. In fact I would predict the speed/detail trade-off continues to diverge.
> Imagine waiting for a minute until Google spits out the first 10 results.
what if the instantaneous responses make you waste 10 min realizing they were not what you searched for?
I understand your point, but I still prefer instantaneous responses.
Only when the immediate answers become completely useless will I want to look into slower alternatives.
But first "show me what you've got so far", and let me decide whether it's good enough or not.
I am already at that point. when I need to search something more complex than exact keyword match, I don't even bother googling it anymore, I just ask chatgpt to research it for me and read it's response 5 min later.
Yes, I feel the same recently with Google results. But I think I would still like to see the immediate 10 results, along with a big button "Try harder - not feeling very lucky".
Only Codex is slow. GPT 5 classic is fast
> It loves doing a whole bunch of reasoning steps
If you are talking about local models, you can switch that off. The reasoning is a common technique now to improve the accuracy of the output where the question is more complex.
[dead]
The article(§) talks about going from Sonnet 4.5 back to Sonnet 4.0.
(§) You know that it's a hyperlink, do you? /s
My team still uses Sonnet 3.5 for pretty much everything we do because it's largely enough and it's much, much faster than newer models. The only reason we're switching is because the models are getting deprecated...
> At Augment Code, we run multiple frontier models side by side in production.
I mean, this is technically false, right? They’re not running these models but calling the APIs? Not that it matters.
To the authors of the site, please know that your current "Cookiebot by Usercentrics" is old and pretty much illegal. You shouldn't need to click 5 times to "Reject all" if accepting all is one click. Newer versions have a "Deny" button.
Weirdly this site also requested bluetooth access on my mac.
That would be the browser fingerprinting in action. I often get a lot of requests to use widevine on ddg's browser on android (which informs one about it) for I suspect similar reasons.
Interesting, I'm on Brave and have never had a site request bluetooth access before, so much so that I'd never even granted Brave bluetooth access, hence why it popped up as a system notification this time around.
Doesn't Brave disable WebBluetooth by default via a flag?
Brave indeed does block WebBluetooth by default, but it can be turned on by the user using flags.
It's by no means a new feature, but the privacy concerns outlined in this post are still valid 10 years later: https://blog.lukaszolejnik.com/w3c-web-bluetooth-api-privacy...
Interesting. Is this fingerprinting in action? I have Widevine disabled on Brave desktop (don't recall if this is default), occasionally I get Widevine permission request on some sites.
Just set up your browser to never even load that BS.
Or you could just reject all third party cookies, see no sites break and enjoy your privacy.
Doesn't spare you from having to interact with the popup. This is probably the single dumbest law to ever have been made. It wastes everyone's time, and not insignificantly. While the browser is and always was in full control of cookies, nobody checks whether the popup actually even does what it says. And since it's a waste of your time in the first place, who takes the time to report illegal ones, much less has any interest to do so, because where you saw it is where you will likely never visit again anyway.
If anything browsers should be simply rejecting all cookies by default, and the user should only be whitelisting ones they need on the few sites where they need it.
I don't think the lawmakers planned for the level of malicious compliance that would be deployed.
Single dumbest law ever made? I think that’s underestimating the stupidity of many laws.
Possibly. I just can't think of other stupid ones that have a comparably wide impact.
[flagged]
What I love about the words enshitification is it’s _almost_ autological. It takes a nice crisp on syllable word like “shit” and ruins it by adding 5 extra syllables. It just doesn’t worse over time, to be truly autological
Instead of being the best model, it's a race to bring in revenue and add bias for ad channels.
To those who complain about GPT5 being slow; I recently migrated https://app.sqlai.ai and found that setting service_tier = “priority” makes it reason twice as fast.