Show HN: Magnitude – open-source, AI-native test framework for web apps

github.com

179 points by anerli 6 months ago


Hey HN, Anders and Tom here - we’ve been building an end-to-end testing framework powered by visual LLM agents to replace traditional web testing.

We know there's a lot of noise about different browser agents. If you've tried any of them, you know they're slow, expensive, and inconsistent. That's why we built an agent specifically for running test cases and optimized it just for that:

- Pure vision instead of error prone "set-of-marks" system (the colorful boxes you see in browser-use for example)

- Use tiny VLM (Moondream) instead of OpenAI/Anthropic computer use for dramatically faster and cheaper execution

- Use two agents: one for planning and adapting test cases and one for executing them quickly and consistently.

The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test.

It’s completely open source. Would love to have more people try it out and tell us how we can make it great.

Repo: https://github.com/magnitudedev/magnitude

NitpickLawyer - 6 months ago

> The idea is the planner builds up a general plan which the executor runs. We can save this plan and re-run it with only the executor for quick, cheap, and consistent runs. When something goes wrong, it can kick back out to the planner agent and re-adjust the test.

I've been recently thinking about testing/qa w/ VLMs + LLMs, one area that I haven't seen explored (but should 100% be feasible) is to have the first run be LLM + VLM, and then have the LLM(s?) write repeatable "cheap" tests w/ traditional libraries (playwright, puppeteer, etc). On every run you do the "cheap" traditional checks, if any fail go with the LLM + VLM again and see what broke, only fail the test if both fail. Makes sense?

tobr - 6 months ago

Interesting! My first concern is - isn’t this the ultimate non-deterministic test? In practice, does it seem flaky?

SparkyMcUnicorn - 6 months ago

This is pretty much exactly what I was going to build. It's missing a few things, so I'll either be contributing or forking this in the future.

I'll need a way to extract data as part of the tests, like screenshots and page content. This will allow supplementing the tests with non-magnitude features, as well as add things that are a bit more deterministic. Assert that the added todo item exactly matches what was used as input data, screenshot diffs when the planner fallback came into play, execution log data, etc.

This isn't currently possible from what I can see in the docs, but maybe I'm wrong?

It'd also be ideal if it had an LLM-free executor mode to reduce costs and increase speed (caching outputs, or maybe use accessibility tree instead of VLM), and also fit requirements when the planner should not automatically kick in.

o1o1o1 - 6 months ago

Thanks for sharing, this looks interesting.

However, I do not see a big advantage over Cypress tests.

The article mentions shortcomings of Cypress (and Playwright):

> They start a dev server with bootstrapping code to load the component and/or setup code you want, which limits their ability to handle complex enterprise applications that might have OAuth or a complex build pipeline.

The simple solution is to containerise the whole application (including whatever OAuth provider is used), which then allows you to simply launch the whole thing and then run the tests. Most apps (especially in enterprise) should already be containerised anyway, so most of the times we can just go ahead and run any tests against them.

How is SafeTest better than that when my goal is to test my application in a real world scenario?

chrisweekly - 6 months ago

This looks pretty cool, at least at first glance. I think "traditional web testing" means different things to different people. Last year, the Netflix engineering team published "SafeTest"[1] an interesting hybrid / superset of unit and e2e testing. Have you guys (Magnitude devs) considered incorporating any of their ideas?

1. https://netflixtechblog.com/introducing-safetest-a-novel-app...

arendtio - 6 months ago

It looks pretty cool. One thing that has bothered me a bit with Playwright is audio input. With modern AI applications, speech recognition is often integrated, but with Playwright, using voice as an input does not seem straightforward. Given that Magnitude has an AI focus, adding a feature like that would be great:

  test('can log in and see correct settings')
    .step('log in to the app')
      .say('my username is user@example.com')
grbsh - 6 months ago

I know moondream is cheap / fast and can run locally, but is it good enough? In my experience testing things like Computer Use, anything but the large LLMs has been so unreliable as to be unworkable. But maybe you guys are doing something special to make it work well in concert?

dimal - 6 months ago

> Pure vision instead of error prone "set-of-marks" system (the colorful boxes you see in browser-use for example)

One benefit not using pure vision is that it's a strong signal to developers to make pages accessible. This would let them off the hook.

Perhaps testing both paths separately would be more appropriate. I could imagine a different AI agent attempting to navigate the page through accessibility landmarks. Or even different agents that simulate different types of disabilities.

retreatguru - 6 months ago

Any advice about using ai to write test cases? For example recording a video while using an app and converting that to test cases. Seems like it should work.

jcmontx - 6 months ago

Does it only work for node projects? Can I run it against a Staging environment without mixing it with my project?

pandemic_region - 6 months ago

Bang me sideways, "AI-native" is a thing now? What does that even mean?

badmonster - 6 months ago

How does Magnitude differentiate between the planner and executor LLM roles, and how customizable are these components for specific test flows?

aoeusnth1 - 6 months ago

Why not make the strong model compile a non-ai-driven test execution plan using selectors / events? Is Moondream that good?

sergiomattei - 6 months ago

Hi, this looks great! Any plans to support Azure OpenAI as a backend?