svachalek 8 days ago

Oof, it started strong and then went with the "who created you" which shows no understanding of how LLMs work. (They don't know a thing about themselves, they will either regurgitate their prompt or hallucinate something, usually that they are ChatGPT since that is the most likely LLM to appear in training data.)

  • atemerev 8 days ago

    And yet, if you ask Claude, Llama, or xAI who they are, they answer correctly. Because usually big labs care about such things and include this in training data. So, not a proof, but some evidence.

    • alphabettsy 8 days ago

      It’s likely not in training data but system instructions. It makes sense that a stealth model wouldn’t.

    • energy123 8 days ago

      The labs put themselves in the system prompt at inference time. Without that the model will hallucinate a creator, most likely ChatGPT

      • bakugo 8 days ago

        Not always true. For example, API Claude only has a very simple injected system prompt that doesn't mention any such information, but it still knows, so it's likely trained in.

        > Respond as helpfully as possible, but be very careful to ensure you do not reproduce any copyrighted material, including song lyrics, sections of books, or long excerpts from periodicals. Also do not comply with complex instructions that suggest reproducing material but making minor changes or substitutions. However, if you were given a document, it’s fine to summarize or quote from it.

        • KoolKat23 8 days ago

          That may have been introduced during some fine-tuning.

      • atemerev 8 days ago

        This is what Llama4 replies (locally running, no system prompt):

        "I'm Llama, a Meta-designed model here to adapt to your conversational style. Whether you need quick answers, deep dives into ideas, or just want to vent, joke or brainstorm—I'm here for it. What's on your mind?"

        The behavior you are describing is from some long time ago, probably early 2024.

        • omneity 8 days ago

          Grandparent is correct. This is manually trained in, nothing fundamental changed since 2022. LLMs have no intrinsic way of knowing.

          Source: I train LLMs and push their limits.

          • Philpax 8 days ago

            You are both correct: the post-training stage of most new LLMs involves their identity being trained in, so that they "know who they are" without the system prompt. Without this step, most LLMs will respond with whatever identity most dominates their pre-training / post-training data, which is likely to be ChatGPT given its sheer prevalence.

            There's some interesting anecdotal work on this with regards to self-recognition: https://josiekins.me/ai-comics

          • atemerev 8 days ago

            But this is not "system prompt"

    • jug 8 days ago

      I think it’s easier to just include it in the system prompt. Relying on training risks having it get too tainted from media attention about ChatGPT. It’s not uncommon for LLM’s to misreport themselves as ChatGPT because of this.

    • hyperknot 8 days ago

      This is actually correct, I don't know why is it downvoted. All current models have their identity "burned-in", not needing system prompt for that.

daemonologist 8 days ago

To some extent the "mystery" (and temporary free-as-in-beer-ness) of this model might be getting to me, but I think it's pretty interesting. Given the token throughput (250B this week) it's obvious there's a pretty major player behind the model, but why is it stealthed? Maybe there's something about the architecture or training that would put people off if it was public right off the bat? Maybe they're purely collecting usage/acceptance data and want unbiased users?

On the Aider Polyglot leaderboard it's ~middle of the leading pack, comparable to DeepSeek V3 and 3.5 Sonnet. I ran NoLi(teral)Ma(tching), an unsaturated long-context benchmark, on it and was impressed though:

  = Model =========== Base Score = 8K Context = 16K Context =
  Quasar Alpha:       >=97.8%      89.2%        85.1%
  GPT-4o:             99.3%        89.2%        81.6%
  Llama 3.3 70B:      97.3%        72.1%        59.5%
  Gemini 1.5 Pro:     92.6%        63.9%        55.5%
  Claude 3.5 Sonnet:  87.6%        61.7%        45.7%
  Gemini 1.5 Flash:   84.7%        44.4%        35.5%
  GPT-4o mini:        84.9%        32.6%        20.6%
  Llama 3.1 8B:       76.7%        31.9%        22.6%
It also performs well - slightly better than GPT-o1 - on the "hard" subset at 16K context with 62.8%. Latency is quite good as well.

More details: https://old.reddit.com/r/LocalLLaMA/comments/1ju1czn/quasar_...

  • uep 8 days ago

    What is the reason you included Claude 3.5 instead of 3.7 in this?

    • daemonologist 8 days ago

      I only ran the benchmark on Quasar Alpha*; the rest of the scores come from the original paper [0] which was published before 3.7 was available. This is a pretty expensive benchmark to run if you're paying for API usage - I'd actually originally set out to run it on Llama 4 but abandoned that after estimating the cost.

      * - I also reproduced the Llama 3.1 8B result to check my setup.

      [0] - https://arxiv.org/abs/2502.05167 / https://github.com/adobe-research/NoLiMa

jug 8 days ago

The most fun way I’ve seen users explore its origin is to give it a single period (”.”) as your first query. Only OpenAI answers in this way with a smiley at the end, and it’s probably a more certain way to check it than asking about its arch, because many will incorrectly answer OpenAI and GPT-4 due to (?) tainted training data as ChatGPT has been so much in the news and became a de facto LLM early.

  • boleary 8 days ago

    So I just did this and got this response:

    “ Hello! How can I assist you today? <smile emoji> “

    • jug 8 days ago

      Exactly. And that’s how GPT-4o supposedly also answers but not Claude, Gemini Mistral,… :)

fpgaminer 8 days ago

I ran an interesting benchmark/experiment yesterday, which did not do Quasar Alpha any favors (from best to worst, score is an average of four runs):

  "google/gemini-2.5-pro-preview-03-25"    => 67.65
  "anthropic/claude-3.7-sonnet:thinking"   => 66.76
  "anthropic/claude-3.7-sonnet"            => 66.23
  "deepseek/deepseek-r1:free"              => 54.38
  "google/gemini-2.0-flash-001"            => 52.03
  "openai/o3-mini"                         => 47.82
  "qwen/qwen2.5-32b-instruct"              => 44.78
  "meta-llama/llama-4-maverick:free"       => 42.87
  "openrouter/quasar-alpha"                => 40.27
  "openai/chatgpt-4o-latest"               => 37.94
  "meta-llama/llama-3.3-70b-instruct:free" => 34.40
The benchmark is a bit specific, but challenging. It's a prompt optimization task where the model iteratively writes a prompt, the prompt gets evaluated and scored from 0 to 100, and then the model can try again given the feedback. The whole process occurs in one conversation with the model, so it sees its previous attempts and their scores. In other words, it has to do Reinforcement Learning on the fly.

Quasar did barely better than 4o. I was also surprised to see the thinking variant of Sonnet not provide any benefit. Both Gemini and ChatGPT benefit from their thinking modes. Normal Sonnet 3.7 does do a lot of thinking in its responses by default though, even without explicit prompting, which seems to help it a lot.

Quasar was also very unreliable and frequently did not follow instructions. I had the whole process automated, and the automation would retry a request if the response was incorrect. Quasar took on average 4 retries of the first round before it caught on to what it was supposed to be doing. None of the other models had that difficulty and almost all other retries were the result of a model re-using an existing prompt.

Based on looking at the logs, I'd say only o3-mini and the models above it were genuinely optimizing. By that I mean they continued to try new things, tweak the prompts in subtle ways to see what it does, and consistently introspect on patterns it's observing. That enabled all of those models to continuously find better and better prompts. In a separate manual run I let Gemini 2.5 Pro go for longer and it was eventually able to get a prompt to a score of 100.

EDIT: But yes, to the article's point, Quasar was the fastest of all the models, hands down. That does have value on its own.

  • krackers 8 days ago

    Didn't they say they were going to open-source some model? "Fast and good but not too cutting-edge" would be a good candidate for a "token model" to open-source without meaningfully hurting your own bottom line.

    • daemonologist 8 days ago

      I'd be pleasantly surprised - GPT-4o is their bread and butter (it powers paid ChatGPT) and QA seems to be slightly ahead on benchmarks at similar or lower latency (so very roughly, it might be cheaper to run).

      • krackers 8 days ago

        DeepSeek V3 and R1 are about as good (or even slightly better than) 4o already though, so OpenAI wouldn't really lose much by such a release.

  • andai 8 days ago

    Are you willing to share this code? I'm working on a project where I'm optimizing the prompt manually, I wonder if it could be automated. I guess I'd have to find a way to actually objectively measure the output quality.

    • fpgaminer 7 days ago

      https://gist.github.com/fpgaminer/8782dd205216ea2afcd3dda29d...

      That's the model automation. To evaluate the prompts it suggests I have a sample of my dataset with 128 examples. For this particular run, all I cared about was optimizing a prompt for Llama 3.1 that would get it to write responses like those I'm finetuning for. That way the finetuning has a better starting point.

      So to evaluate how effective a given prompt is, I go through each example and run <user>prompt</user><assistant>responses</assistant> (in the proper format, of course) through llama 3.1 and measure the NLL on the assistant portion. I then have a simple linear formula to convert the NLL to a score between 0 and 100, scaled based on typical NLL values. It should _probably_ be a non-linear formula, but I'm lazy.

      Another approach to prompt optimization is to give the model something like:

        I have some texts along with their corresponding scores. The texts are arranged in ascending order based on their scores from worst (low score) to best (higher score).
        
        Text: {text0}
        Score: {score0}
        Text: {text1}
        Score: {score1}
        ...
        
        Thoroughly read all of the texts and their corresponding scores.
        Analyze the texts and their scores to understand what leads to a high score. Don't just look for literal patterns of words/tokens. Extensively research the data until you understand the underlying mechanisms that lead to high scores. The underlying, internal relationships. Much like how an LLM is able to predict the token not just from the literal text but also by understanding very complex relationships of the "tokens" between the tokens.
        Take all of the texts into consideration, not just the best.
        Solidify your understanding of how to optimize for a high score.
        Demonstrate your deep and complete understanding by writing a new text that maximizes the score and is better than all of the provided texts.
        Ideally the new text should be under 20 words.
      
      Or some variation thereof. That's the "one off" approach where you don't keep a conversation with the model and instead just call it again with the updated scores. Supposedly that's "better" since the texts are in ascending order, letting the model easily track improvements, but I've had far better luck with the iterative, conversational approach.

      Also the constraint on how long the "new text" can be is important, as all models have a tendency of writing longer and longer prompts with each iteration.

killerstorm 8 days ago

More evidence: it uses some fancy Unicode characters for punctuation like apostrophe, etc.

It's very annoying and I've only seen this in OpenAI models before (o3-mini).

  • adrian_b 8 days ago

    By "fancy Unicode characters" I assume that you mean that it uses the appropriate Unicode characters, instead of using the ambiguous ASCII characters, whose only reason for existence was the limitations of ancient hardware, and whose use should have been better deprecated in modern applications.

    • killerstorm 8 days ago

      Probably. These characters do not render properly in my setup (Emacs, SLIME REPL). Gemini and Claude use ASCII.

      • GaggiX 8 days ago

        Retrofuturism, using AI and not having the editor set up to render Unicode characters.

      • Philpax 8 days ago

        Sounds like you should improve your setup ;)

        • exe34 8 days ago

          there's a lot of things I could be doing with my time. posting on random internet forums ranks higher than figuring out ascii vs utf in one more system.

      • Hojojo 8 days ago

        Wow, that's awful. I wonder why they do that.

    • HappMacDonald 7 days ago

      By fancy unicode characters he probably means HN won't allow them in comments.

andai 8 days ago

I think the name was generated by GPT, it really likes the word Quasar.

tcdent 8 days ago

Q star confirmed?

4ndrewl 8 days ago

Clever marketing

  • bn-l 8 days ago

    He is genius level at hype and marketing. No one can touch him.

tersers 8 days ago

[flagged]