Running prompts against images, PDFs, audio and video with Google Gemini#
I’m still working towards adding multi-modal support to my LLM tool. In the meantime, here are notes on running prompts against images and PDFs and audio and video files from the command-line using the Google Gemini family of models.
Update: I integrated the research from this TIL into my LLM tool, which can now run multi-modal prompts against Gemini like this:
Terminal window
1
llm-mgemini-1.5-flash"describe this image"-aimage.jpg
The model name goes in the URL - here I’m using gemini-1.5-flash-8b-latest, Google’s cheapest and fastest model.
Model values you can use are:
gemini-1.5-flash-8b-latest - the cheapest and fastest model, $0.04/million input tokens, 0.001 cents per image
gemini-1.5-flash-latest - the one in the middle, $0.07/million input tokens, 0.0019 cents per image
gemini-1.5-pro-latest - the most powerful model, $1.25/million input tokens, 0.0323 cents per image
It’s hard to overestimate how cheap these models are. An input image is charged at 258 tokens. That means the price per image processed is measured in fraction of a cent - those numbers above really are correct, an image even through Gemini Pro will cost less than 1/30th of a cent, and the other two models are even cheaper.
You get charged for output tokens too, which vary depending on the length of the response. Use my LLM pricing calculator to explore those.
The output of a prompt includes a usage section that shows you exactly how many tokens you spent. Here’s example output for the prompt “extract text from this image” against this image:
I got Claude to write me a script to automate this process. Here’s how you can use it:
Terminal window
1
export GOOGLE_API_KEY='... your key here ...'
2
3
prompt-gemini'extract text from this image'example-handwriting.jpg
It accepts PNG, JPG, GIF or PDF files, automatically sending the correct mimeType to the API. Note that PDFs with multiple pages are charged differently - I tried a 19 page PDF and it cost 12842 tokens, suggesting around 675 tokens per page.
You can also add a -m option to specify a different model:
Terminal window
1
prompt-gemini'extract text from this image'example-handwriting.jpg-mpro
Shortcuts pro, flash and 8b are supported - it defaults to the cheapest 8b model.
By default it outputs the full JSON response, so you can see things like the "usageMetadata" block. To output just the raw returned text add -r:
Terminal window
1
prompt-gemini'extract text from this image'example-handwriting.jpg-r
1
Example handwriting
2
Let's try this out
Here’s the script - save it somewhere on your path and run chmod 755 prompt-gemini to make it executable:
1
#!/bin/bash
2
3
# Check if GOOGLE_API_KEY is set
4
if [ -z"$GOOGLE_API_KEY" ]; then
5
echo"Error: GOOGLE_API_KEY environment variable is not set">&2