Getting a Gemini API key is an exercise in frustration
ankursethi.com845 points by speckx 4 months ago
845 points by speckx 4 months ago
I was recently (vibe)-coding some games with my kid, and we wanted some basic text-to-speech functionality. We tested Google's Gemini models in-browser, and they worked great, so we figured we'd add them to the app. Some fun learnings:
1. You can access those models via three APIs: the Gemini API (which it turns out is only for prototyping and returned errors 30% of the time), the Vertex API (much more stable but lacking in some functionality), and the TTS API (which performed very poorly despite offering the same models). They also have separate keys (at least, Gemini vs Vertex).
2. Each of those APIs supports different parameters (things like language, whether you can pass a style prompt separate from the words you want spoken, etc). None of them offered the full combination we wanted.
3. To learn this, you have to spend a couple hours reading API docs, or alternatively, just have Claude Code read the docs then try all different combinations and figure out what works and what doesn't (with the added risk that it might hallucinate something).
Some other fun things you'll find:
- The models perform differently when called via the API vs in the Gemini UI.
- The Gemini API will randomly fail about 1% of the time, retry logic is basically mandatory.
- API performance is heavily influenced by the whims of the Google we've observed spreads between 30 seconds and 4 minutes for the same query depending on how Google is feeling that day.
> The Gemini API will randomly fail about 1% of the time, retry logic is basically mandatory.
That is sadly true across the board for AI inference API providers. OpenAI and Anthropic API stability usually suffers around launch events. Azure OpenAI/Foundry serving regularly has 500 errors for certain time periods.
For any production feature with high uptime guarantees I would right now strongly advise for picking a model you can get from multiple providers and having failover between clouds.
Yeah at $WORK we use various LLM APIs to analyze text; it's not heavy usage in terms of tokens but maybe 10K calls per day. We've found that response times vary a lot, sometimes going over a minute for simple tasks, and random fails happen. Retry logic is definitely mandatory, and it's good to have multiple providers ready. We're abstracting calls across three different APIs (openai, gemini and mistral, btw we're getting pretty good results with mistral!) so we can switch workloads quickly if needed.
I've been impressed by ollama running locally for my work, involving grouping short text snippets by semantic meaning, using embeddings, as well as summarization tasks. Depending on your needs, a local GPU can sometimes beat the cloud. (I get no failures and consistent response times with no extra bill.) Obviously YMMV, and not ideal for scaling up unless you love hardware.
It'd be kinda nice if they exposed whatever queuing is going on behind the scenes, so you could at least communicate that to your users.
IIRC this is almost exactly the use case for OpenRouter, down to provider fallback https://openrouter.ai/docs/guides/best-practices/uptime-opti...
I have also had some super weird stuff in my output (2.5-flash).
I'm passing docs for bulk inference via Vertex, and a small number of returned results will include gibberish in Japanese.
I had this last night from flash lite! My results were interspersed with random snippets of legible, non-gibberish English language. It was like my results had got jumbled with somenone else's.
I get this a lot too, have made most of the Gemini models essentially unusable for agent-esque tasks. I tested with 2.5 pro and it still sometimes devolved into random gibberish pretty frequently.
I’ve gotten Arabic randomly in Claude Code. Programming is becoming more and more like magic.
"The models perform differently when called via the API vs in the Gemini UI."
This shouldn't be surprised, e.g. the model != the product. The same way GPT4o behaves differently than the ChatGPT product when using GPT4o.
> The models perform differently when called via the API vs in the Gemini UI.
This difference between API vs UI responses being different is common across all the big players (Claude, GPT models, etc.)
The consumer chat interfaces are designed for a different experience than a direct API call, even if pinging the same model.
Even funnier, when Pro 3 answers to a previous message in my chat. Just making a duplicate answer with different words. Retry helps, but…
The way the models behave in Vertex AI Studio vs the API is unforgivable. Totally different.
Also, usage and billing takes a DAY to update. On top of that, there are no billing caps or credit-based billing. They put the entire burden on users not to ensure that they don't have a mega bill.
> there are no billing caps or credit-based billing.
Was really curious about that when I saw this in the posted article:
> I had some spare cash to burn on this experiment,
Hopefully the article's author is fully aware of the real risk of giving Alphabet his CC details on a project which has no billing caps.
there's prob a couple ppl out there with an Amex Black parked on a cloud acct, lol
Usage updates much quicker in the AI Studio UI, near realtime (but can take ~5 min in edge cases).
We are working on billing caps along with credits right now. Billing caps will land first in Jan!
Trying to implement their gRPC api from their specs and protobufs for Live is an exercise in immense frustration and futility. I wanted to call it from Elixir, even with our strong AI I wasted days then gave up.
We are updating the API to be REST centric. Very fair feedback, see the new Interactions API we just shipped, very REST centric and all future work we do will be REST centric : )
Oh man let me add onto that!
4. If you read about a new Gemini model, you might want to use it - but are you using @google/genai, @google/generative-ai (wow finally deprecated) or @google-ai/generativelanguage? Silly mistake, but when nano banana dropped it was highly confusing image gen was available only through one of these.
5. Gemini supports video! But that video first has to be uploaded to "Google GenAI Drive" which will then splices it into 1 FPS images and feeds it to the LLM. No option to improve the FPS, so if you want anything properly done, you'll have to splice it yourself and upload it to generativelanguage.googleapis.com which is only accessible using their GenAI SDK. Don't ask which one, I'm still not sure.
6. Nice, it works. Let's try using live video. Open the docs, you get it mentioned a bunch of times but 0 documentation on how to actually do it. Only suggestions for using 3rd party services. When you actually find it in the docs, it says "To see an example of how to use the Live API in a streaming audio and video format, run the "Live API - Get Started" file in the cookbooks repository". Oh well, time to read badly written python.
7. How about we try generating a video - open up AI studio, see only Veo 2 available from the video models. But, open up "Build" section, and I can have Gemini 3 build me a video generation tool that will use Veo 3 via API by clicking on the example. But wait why cant we use Veo 3 in the AI studio with the same API key?
8. Every Veo 3 extended video has absolutely garbled sound and there is nothing you can do about it, or maybe there is, but by this point I'm out of willpower to chase down edgy edge cases in their docs.
9. Let's just mention one semi-related thing - some things in the Cloud come with default policies that are just absurdly limiting, which means you have to create a resource/account, update policies related to whatever you want to do, which then tells you these are _old policies_ and you want to edit new ones, but those are impossible to properly find.
10. Now that we've setup our accounts, our AI tooling, our permissions, we write the code which takes less than all of the previous actions in the list. Now, you want to test it on Android? Well, you can:
- A. Test it with your account by signing in into emulators, be it local or cloud, manually, which means passing 2FA every time if you want to automate this and constantly risking your account security/ban.
- B. Create a google account for testing which you will use, add it to Licensed Testers on the play store, invite it to internal testers, wait for 24-48 hours to be able to use it, then if you try to automate testing, struggle with having to mock a whole Google Account login process which every time uses some non-deterministic logic to show a random pop-up. Then, do the same thing for the purchase process, ending up with a giant script of clicking through the options
11. Congratulations, you made it this far and are able to deploy your app to Beta. Now, find 12 testers to actively use your app for free, continuously for 14 days to prove its not a bad app.
At this point, Google is actively preventing you from shipping at every step, causing more and more issues the deeper down the stack you go.
12. Release your first version.
13. Get your whole google account banned.
14. Ask why it was banned and they respond with something like "oh you know what you did".
Hi there! I am the PM for Veo on the Gemini API. I wanted to check with you on Point 8 - getting garbled sound when extending the video. Veo 3.1 is limited to the last 24 frames, 1s of video for the extension feature so sometimes dialog and audio are lacking continuity. We are working on this limitation. If you are experiencing a different issue altogether, would you be able to share the prompt so I can debug on my end? Thank you!
> 4. If you read about a new Gemini model, you might want to use it - but are you using @google/genai, @google/generative-ai (wow finally deprecated) or @google-ai/generativelanguage? Silly mistake, but when nano banana dropped it was highly confusing image gen was available only through one of these.?
Yeah, I hear you, open to suggestions to make this more clear, but it is google/genai going forward. Switching packages sucks.
> Gemini supports video! But that video first has to be uploaded to "Google GenAI Drive" which will then splices it into 1 FPS images and feeds it to the LLM. No option to improve the FPS, so if you want anything properly done, you'll have to splice it yourself and upload it to generativelanguage.googleapis.com which is only accessible using their GenAI SDK. Don't ask which one, I'm still not sure.
We have some work ongoing (should launch in the next 3-4 weeks) which will let you reference files (video included) from links directly so you don't need to upload to the File API. We do also support custom FPS: https://ai.google.dev/gemini-api/docs/video-understanding#cu...
> 6. Nice, it works. Let's try using live video. Open the docs, you get it mentioned a bunch of times but 0 documentation on how to actually do it. Only suggestions for using 3rd party services. When you actually find it in the docs, it says "To see an example of how to use the Live API in a streaming audio and video format, run the "Live API - Get Started" file in the cookbooks repository". Oh well, time to read badly written python.
Just pinged the team, we will get a live video example added here: https://ai.google.dev/gemini-api/docs/live?example=mic-strea... should have it live Monday, not sure why that isn't there, sorry for the miss!
> 7. How about we try generating a video - open up AI studio, see only Veo 2 available from the video models. But, open up "Build" section, and I can have Gemini 3 build me a video generation tool that will use Veo 3 via API by clicking on the example. But wait why cant we use Veo 3 in the AI studio with the same API key?
We are working on adding Veo 3.1 into the drop down, I think it is being tested by QA right now, pinged the team to get ETA, should be rolling out ASAP though, sorry for the confusing experience. Hoping this is fixed by Monday EOD!
> 8. Every Veo 3 extended video has absolutely garbled sound and there is nothing you can do about it, or maybe there is, but by this point I'm out of willpower to chase down edgy edge cases in their docs.
Checking on this, haven't used extend a lot but will see if there is something missing we can clarify.
On some of the later points, I don't have enough domain expertise to weight in but will forward to folks n the Android / Play side to see what we can do to streamline things!
Thank you for taking the time to write up this feedback : ) hoping we can make the product better based on this.
Didn't catch in the updates that the custom FPS was released, amazing. Seems like the limit is just 20MB, but can use custom splitting for larger ones.
Trying to split all videos into frames was a PITA mostly due to weird inputs from different Android phones requiring handling all kinds of edge cases, then uploading each to Upload API with retry was also adding a lag + complexity, so doing it all in one go will save me both time and nerves (and tokens).
Thanks for listening and all the great work you do, since you came in the experience improved by an immeasurable amount.
Will take a pass with the team to see what we can do to tighten up this experience, very valid feedback on the confusion between the three APIs.
The odd thing about all of this (well, I guess it's not odd, just ironic), is that when Google AdWords started, one of the notable things about it was that anyone could start serving or buying ads. You just needed a credit-card. I think that bought Google a lot of credibility (along with the ads being text-only) as they entered an already disreputable space: ordinary users and small businesses felt they were getting the same treatment as more faceless, distant big businesses.
I have a friend that says Google's decline came when they bought DoubleClick in 2008 and suffered a reverse-takeover: their customers shifted from being Internet users and became other, matchingly-sized corporations.
I have had way too many arguments over the years with product and sales people at my job on the importance of instant self-signup. I want to be able to just pay and go, without having to talk to people or wait for things.
I know part of it is that sales wants to be able to price discriminate and wants to be able to use their sales skills on a customer, but I am never going to sign up for anything that makes me talk to someone before I can buy.
The number one rule of business that should just be passively reiterated to everyone working in any type of transactional field:
1. Never make it hard for people to give you money.
Parking apps don’t seem to care much for that. They know you’ll jump through their shoddy UIs and data collection because they have a local monopoly. Often with physical payment kiosks removed and replaced with “download our shitty app!” notices.
They get paid more if you get a parking ticket.
i'm currently disputing a bill with a parking company. there's a kiosk at the movie theater served by the parking lot, so that you can get free parking if you see a movie. the kiosk has an option for you to describe your car if you forgot your license plate number. i did that and they sent me a bill for unpaid parking.
customer service is unable to acknowledge why that feature is offered and can only assert that if you park you gotta pay. after threatening to complain to the BBB and my state AG they have graciously offered to drop the ticket to $25.
thank you for listening to me vent :)
The RyanAir model of technically legal, but actively playing a zero-sum game against their consumers' diligence.
At least in my country they face no competition. For a given location, only one app will work.
Plenty of people on here looking to disrupt a market with tech...c'mon guys, get on it
Edit: On second thought, there is a perverse incentive at work (and probably one of the "lowest friction" ways to get money), which is issuing government enforced fines.
The crappy apps that replaced parking meters are the people who disrupted the existing market with tech
Huh, where I live you often can use many different parking apps, and the one i tried is very simple and user friendly.
Start app, wait for gps, turn time wheel, press start.
Turn time wheel? How do you know in advance how long you stay? Where I live, you start and when you leave, you click stop. You also get reminders in case you forgot to stop.
Not GP, but I guess I'm using the same app. You guess (and then it gives you the price up front). 10 minutes before it expires it asks you if you want to extend it. There might also have been a detect if you drive away and stop feature (don't recall).
Mostly these days all paid parking has registration camera's, and it just starts and stops parking for you automatically. However, there are like 3 or so apps that compete here so you need a profile with all of them for this to work and you also need to enable this on all the apps.
There is no way this is not a degradation compared to a physical meter accepting cash plus whatever. My country doesn't really have parking apps yet here and paying for parking is never a friction.
> There is no way this is not a degradation compared to a physical meter accepting cash plus whatever.
Well you can extend the parking time while not at your car. That is a big plus.