Evidence suggesting Quasar Alpha is OpenAI's new model

blog.kilocode.ai

73 points by heymax054 3 months ago

svachalek 3 months ago

Oof, it started strong and then went with the "who created you" which shows no understanding of how LLMs work. (They don't know a thing about themselves, they will either regurgitate their prompt or hallucinate something, usually that they are ChatGPT since that is the most likely LLM to appear in training data.)

atemerev 3 months ago

And yet, if you ask Claude, Llama, or xAI who they are, they answer correctly. Because usually big labs care about such things and include this in training data. So, not a proof, but some evidence.
- alphabettsy 3 months ago
  
  It’s likely not in training data but system instructions. It makes sense that a stealth model wouldn’t.
- energy123 3 months ago
  
  The labs put themselves in the system prompt at inference time. Without that the model will hallucinate a creator, most likely ChatGPT
  - bakugo 3 months ago
    
    Not always true. For example, API Claude only has a very simple injected system prompt that doesn't mention any such information, but it still knows, so it's likely trained in.
    > Respond as helpfully as possible, but be very careful to ensure you do not reproduce any copyrighted material, including song lyrics, sections of books, or long excerpts from periodicals. Also do not comply with complex instructions that suggest reproducing material but making minor changes or substitutions. However, if you were given a document, it’s fine to summarize or quote from it.
    
    KoolKat23 3 months ago
    
    That may have been introduced during some fine-tuning.
  - atemerev 3 months ago
    
    This is what Llama4 replies (locally running, no system prompt):
    "I'm Llama, a Meta-designed model here to adapt to your conversational style. Whether you need quick answers, deep dives into ideas, or just want to vent, joke or brainstorm—I'm here for it. What's on your mind?"
    The behavior you are describing is from some long time ago, probably early 2024.
    
    omneity 3 months ago
    
    Grandparent is correct. This is manually trained in, nothing fundamental changed since 2022. LLMs have no intrinsic way of knowing.
    Source: I train LLMs and push their limits.
    
    Philpax 3 months ago
    
    You are both correct: the post-training stage of most new LLMs involves their identity being trained in, so that they "know who they are" without the system prompt. Without this step, most LLMs will respond with whatever identity most dominates their pre-training / post-training data, which is likely to be ChatGPT given its sheer prevalence.
    There's some interesting anecdotal work on this with regards to self-recognition: https://josiekins.me/ai-comics
    
    atemerev 3 months ago
    
    But this is not "system prompt"
- jug 3 months ago
  
  I think it’s easier to just include it in the system prompt. Relying on training risks having it get too tainted from media attention about ChatGPT. It’s not uncommon for LLM’s to misreport themselves as ChatGPT because of this.
- hyperknot 3 months ago
  
  This is actually correct, I don't know why is it downvoted. All current models have their identity "burned-in", not needing system prompt for that.

daemonologist 3 months ago

To some extent the "mystery" (and temporary free-as-in-beer-ness) of this model might be getting to me, but I think it's pretty interesting. Given the token throughput (250B this week) it's obvious there's a pretty major player behind the model, but why is it stealthed? Maybe there's something about the architecture or training that would put people off if it was public right off the bat? Maybe they're purely collecting usage/acceptance data and want unbiased users?

On the Aider Polyglot leaderboard it's ~middle of the leading pack, comparable to DeepSeek V3 and 3.5 Sonnet. I ran NoLi(teral)Ma(tching), an unsaturated long-context benchmark, on it and was impressed though:

  = Model =========== Base Score = 8K Context = 16K Context =
  Quasar Alpha:       >=97.8%      89.2%        85.1%
  GPT-4o:             99.3%        89.2%        81.6%
  Llama 3.3 70B:      97.3%        72.1%        59.5%
  Gemini 1.5 Pro:     92.6%        63.9%        55.5%
  Claude 3.5 Sonnet:  87.6%        61.7%        45.7%
  Gemini 1.5 Flash:   84.7%        44.4%        35.5%
  GPT-4o mini:        84.9%        32.6%        20.6%
  Llama 3.1 8B:       76.7%        31.9%        22.6%

It also performs well - slightly better than GPT-o1 - on the "hard" subset at 16K context with 62.8%. Latency is quite good as well.

More details: https://old.reddit.com/r/LocalLLaMA/comments/1ju1czn/quasar_...

uep 3 months ago

What is the reason you included Claude 3.5 instead of 3.7 in this?
- daemonologist 3 months ago
  
  I only ran the benchmark on Quasar Alpha*; the rest of the scores come from the original paper [0] which was published before 3.7 was available. This is a pretty expensive benchmark to run if you're paying for API usage - I'd actually originally set out to run it on Llama 4 but abandoned that after estimating the cost.
  * - I also reproduced the Llama 3.1 8B result to check my setup.
  [0] - https://arxiv.org/abs/2502.05167 / https://github.com/adobe-research/NoLiMa

jug 3 months ago

The most fun way I’ve seen users explore its origin is to give it a single period (”.”) as your first query. Only OpenAI answers in this way with a smiley at the end, and it’s probably a more certain way to check it than asking about its arch, because many will incorrectly answer OpenAI and GPT-4 due to (?) tainted training data as ChatGPT has been so much in the news and became a de facto LLM early.

boleary 3 months ago

So I just did this and got this response:
“ Hello! How can I assist you today? <smile emoji> “
- jug 3 months ago
  
  Exactly. And that’s how GPT-4o supposedly also answers but not Claude, Gemini Mistral,… :)

anotherpaulg 3 months ago

The graph in the article about Quasar Alpha’s coding skill is taken from my tweet [0]. It shows QA’s results on the aider polyglot coding benchmark [1].

QA seems to be a skilled coder, and is very fast.

Aider supports Quasar Alpha as of v0.81, released last week.

[0] https://x.com/paulgauthier/status/1907996176605220995

[1] https://aider.chat/docs/leaderboards/

fpgaminer 3 months ago

I ran an interesting benchmark/experiment yesterday, which did not do Quasar Alpha any favors (from best to worst, score is an average of four runs):

  "google/gemini-2.5-pro-preview-03-25"    => 67.65
  "anthropic/claude-3.7-sonnet:thinking"   => 66.76
  "anthropic/claude-3.7-sonnet"            => 66.23
  "deepseek/deepseek-r1:free"              => 54.38
  "google/gemini-2.0-flash-001"            => 52.03
  "openai/o3-mini"                         => 47.82
  "qwen/qwen2.5-32b-instruct"              => 44.78
  "meta-llama/llama-4-maverick:free"       => 42.87
  "openrouter/quasar-alpha"                => 40.27
  "openai/chatgpt-4o-latest"               => 37.94
  "meta-llama/llama-3.3-70b-instruct:free" => 34.40

The benchmark is a bit specific, but challenging. It's a prompt optimization task where the model iteratively writes a prompt, the prompt gets evaluated and scored from 0 to 100, and then the model can try again given the feedback. The whole process occurs in one conversation with the model, so it sees its previous attempts and their scores. In other words, it has to do Reinforcement Learning on the fly.

Quasar did barely better than 4o. I was also surprised to see the thinking variant of Sonnet not provide any benefit. Both Gemini and ChatGPT benefit from their thinking modes. Normal Sonnet 3.7 does do a lot of thinking in its responses by default though, even without explicit prompting, which seems to help it a lot.

Quasar was also very unreliable and frequently did not follow instructions. I had the whole process automated, and the automation would retry a request if the response was incorrect. Quasar took on average 4 retries of the first round before it caught on to what it was supposed to be doing. None of the other models had that difficulty and almost all other retries were the result of a model re-using an existing prompt.

Based on looking at the logs, I'd say only o3-mini and the models above it were genuinely optimizing. By that I mean they continued to try new things, tweak the prompts in subtle ways to see what it does, and consistently introspect on patterns it's observing. That enabled all of those models to continuously find better and better prompts. In a separate manual run I let Gemini 2.5 Pro go for longer and it was eventually able to get a prompt to a score of 100.

EDIT: But yes, to the article's point, Quasar was the fastest of all the models, hands down. That does have value on its own.

krackers 3 months ago

Didn't they say they were going to open-source some model? "Fast and good but not too cutting-edge" would be a good candidate for a "token model" to open-source without meaningfully hurting your own bottom line.
- daemonologist 3 months ago
  
  I'd be pleasantly surprised - GPT-4o is their bread and butter (it powers paid ChatGPT) and QA seems to be slightly ahead on benchmarks at similar or lower latency (so very roughly, it might be cheaper to run).
  - krackers 3 months ago
    
    DeepSeek V3 and R1 are about as good (or even slightly better than) 4o already though, so OpenAI wouldn't really lose much by such a release.
andai 3 months ago

Are you willing to share this code? I'm working on a project where I'm optimizing the prompt manually, I wonder if it could be automated. I guess I'd have to find a way to actually objectively measure the output quality.
- yorwba 3 months ago
  
  You might also be interested in DSPy's prompt optimizers: https://dspy.ai/learn/optimization/optimizers/
- fpgaminer 3 months ago
  https://gist.github.com/fpgaminer/8782dd205216ea2afcd3dda29d...
  That's the model automation. To evaluate the prompts it suggests I have a sample of my dataset with 128 examples. For this particular run, all I cared about was optimizing a prompt for Llama 3.1 that would get it to write responses like those I'm finetuning for. That way the finetuning has a better starting point.
  So to evaluate how effective a given prompt is, I go through each example and run <user>prompt</user><assistant>responses</assistant> (in the proper format, of course) through llama 3.1 and measure the NLL on the assistant portion. I then have a simple linear formula to convert the NLL to a score between 0 and 100, scaled based on typical NLL values. It should _probably_ be a non-linear formula, but I'm lazy.
  Another approach to prompt optimization is to give the model something like:
  I have some texts along with their corresponding scores. The texts are arranged in ascending order based on their scores from worst (low score) to best (higher score). Text: {text0} Score: {score0} Text: {text1} Score: {score1} ... Thoroughly read all of the texts and their corresponding scores. Analyze the texts and their scores to understand what leads to a high score. Don't just look for literal patterns of words/tokens. Extensively research the data until you understand the underlying mechanisms that lead to high scores. The underlying, internal relationships. Much like how an LLM is able to predict the token not just from the literal text but also by understanding very complex relationships of the "tokens" between the tokens. Take all of the texts into consideration, not just the best. Solidify your understanding of how to optimize for a high score. Demonstrate your deep and complete understanding by writing a new text that maximizes the score and is better than all of the provided texts. Ideally the new text should be under 20 words.
  Or some variation thereof. That's the "one off" approach where you don't keep a conversation with the model and instead just call it again with the updated scores. Supposedly that's "better" since the texts are in ascending order, letting the model easily track improvements, but I've had far better luck with the iterative, conversational approach.
  Also the constraint on how long the "new text" can be is important, as all models have a tendency of writing longer and longer prompts with each iteration.

killerstorm 3 months ago

More evidence: it uses some fancy Unicode characters for punctuation like apostrophe, etc.

It's very annoying and I've only seen this in OpenAI models before (o3-mini).

adrian_b 3 months ago

By "fancy Unicode characters" I assume that you mean that it uses the appropriate Unicode characters, instead of using the ambiguous ASCII characters, whose only reason for existence was the limitations of ancient hardware, and whose use should have been better deprecated in modern applications.
- killerstorm 3 months ago
  
  Probably. These characters do not render properly in my setup (Emacs, SLIME REPL). Gemini and Claude use ASCII.
  - GaggiX 3 months ago
    
    Retrofuturism, using AI and not having the editor set up to render Unicode characters.
  - Philpax 3 months ago
    
    Sounds like you should improve your setup ;)
    
    exe34 3 months ago
    
    there's a lot of things I could be doing with my time. posting on random internet forums ranks higher than figuring out ascii vs utf in one more system.
  - Hojojo 3 months ago
    
    Wow, that's awful. I wonder why they do that.
- HappMacDonald 3 months ago
  
  By fancy unicode characters he probably means HN won't allow them in comments.

vitorgrs 3 months ago

I also believe the same! I asked a very specific question, on ChatGPT and Quasar, and well, ChatGPT offered me two models to choose, and one of them had a VERY similar answer as Quasar...

https://x.com/vitor_dlucca/status/1908769236744384981

paradite 3 months ago

Here's a longer blog post I wrote on the same topic, with new updates daily:

https://prompt.16x.engineer/blog/quasar-alpha-openai-stealth...

andai 3 months ago

I think the name was generated by GPT, it really likes the word Quasar.

saagarjha 3 months ago

TIL I am GPT

tcdent 3 months ago

Q star confirmed?

4ndrewl 3 months ago

Clever marketing

bn-l 3 months ago

He is genius level at hype and marketing. No one can touch him.

tersers 3 months ago

[flagged]

danpalmer 3 months ago

They might randomly generate them. Some orgs do this for security sensitive projects, so that there's nothing you can infer from the project name.
dang 3 months ago

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."
https://news.ycombinator.com/newsguidelines.html
ketralnis 3 months ago

Shrug, programmers always have weird names for things https://www.youtube.com/watch?v=y8OnoxKotPQ
mock-possum 3 months ago

Wait until it’s released, and you can ask it about its name.