Running prompts against images, PDFs, audio and video with Google Gemini#

I’m still working towards adding multi-modal support to my LLM tool. In the meantime, here are notes on running prompts against images and PDFs and audio and video files from the command-line using the Google Gemini family of models.

Update: I integrated the research from this TIL into my LLM tool, which can now run multi-modal prompts against Gemini like this:
Terminal window
1
llm -m gemini-1.5-flash "describe this image" -a image.jpg
See You can now run prompts against images, audio and video in your terminal using LLM for details.

Using curl#

Here’s the initial recipe I figured out using curl.

The Gemini models take a JSON document sent via POST that looks like this:

1
{
2
  "contents": [
3
    {
4
      "role": "user",
5
      "parts": [
6
        {
7
          "text": "Extract text from this image"
8
        },
9
        {
10
          "inlineData": {
11
            "data": "... base 64 encoded image data ...",
12
            "mimeType": "image/png"
13
          }
14
        }
15
      ]
16
    }
17
  ]
18
}

So the first challenge is to construct that document, including the base64 encoded image.

On macOS you can encode a file using base64 -i image.png. On other platforms you may not need the -i option.

So we can create the JSON document like this:

1
cat <<EOF > input.json
2
{
3
  "contents": [
4
    {
5
      "role": "user",
6
      "parts": [
7
        {
8
          "text": "Extract text from this image"
9
        },
10
        {
11
          "inlineData": {
12
            "data": "$(base64 -i image.png)",
13
            "mimeType": "image/png"
14
          }
15
        }
16
      ]
17
    }
18
  ]
19
}
20
EOF

This creates a input.json file containing the base64 encoded image, ready to be sent to the Gemini API.

Now we can send it using curl:

1
export GOOGLE_API_KEY='... your key here ...'
2

3
curl -s "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-8b-latest:generateContent?key=$GOOGLE_API_KEY" \
4
  -H 'Content-Type: application/json' \
5
  -X POST \
6
  -d @input.json

The model name goes in the URL - here I’m using gemini-1.5-flash-8b-latest, Google’s cheapest and fastest model.

Model values you can use are:

gemini-1.5-flash-8b-latest - the cheapest and fastest model, $0.04/million input tokens, 0.001 cents per image
gemini-1.5-flash-latest - the one in the middle, $0.07/million input tokens, 0.0019 cents per image
gemini-1.5-pro-latest - the most powerful model, $1.25/million input tokens, 0.0323 cents per image

It’s hard to overestimate how cheap these models are. An input image is charged at 258 tokens. That means the price per image processed is measured in fraction of a cent - those numbers above really are correct, an image even through Gemini Pro will cost less than 1/30th of a cent, and the other two models are even cheaper.

You get charged for output tokens too, which vary depending on the length of the response. Use my LLM pricing calculator to explore those.

The output of a prompt includes a usage section that shows you exactly how many tokens you spent. Here’s example output for the prompt “extract text from this image” against this image:

Rough handwriting black marker on white card, it reads Example handwriting Let's try this out

1
{
2
  "candidates": [
3
    {
4
      "content": {
5
        "parts": [
6
          {
7
            "text": "Example handwriting\nLet's try this out"
8
          }
9
        ],
10
        "role": "model"
11
      },
12
      "finishReason": "STOP",
13
      "safetyRatings": [
14
        {
15
          "category": "HARM_CATEGORY_HATE_SPEECH",
16
          "probability": "NEGLIGIBLE"
17
        },
18
        {
19
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
20
          "probability": "NEGLIGIBLE"
21
        },
22
        {
23
          "category": "HARM_CATEGORY_HARASSMENT",
24
          "probability": "NEGLIGIBLE"
25
        },
26
        {
27
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
28
          "probability": "NEGLIGIBLE"
29
        }
30
      ],
31
      "avgLogprobs": -0.000025986179631824296
32
    }
33
  ],
34
  "usageMetadata": {
35
    "promptTokenCount": 264,
36
    "candidatesTokenCount": 9,
37
    "totalTokenCount": 273
38
  },
39
  "modelVersion": "gemini-1.5-flash-8b-001"
40
}

Total cost: 0.0011 cents.

Using a Bash script#

I got Claude to write me a script to automate this process. Here’s how you can use it:

1
export GOOGLE_API_KEY='... your key here ...'
2

3
prompt-gemini 'extract text from this image' example-handwriting.jpg

It accepts PNG, JPG, GIF or PDF files, automatically sending the correct mimeType to the API. Note that PDFs with multiple pages are charged differently - I tried a 19 page PDF and it cost 12842 tokens, suggesting around 675 tokens per page.

You can also add a -m option to specify a different model:

1
prompt-gemini 'extract text from this image' example-handwriting.jpg -m pro

Shortcuts pro, flash and 8b are supported - it defaults to the cheapest 8b model.

By default it outputs the full JSON response, so you can see things like the "usageMetadata" block. To output just the raw returned text add -r:

1
prompt-gemini 'extract text from this image' example-handwriting.jpg -r

1
Example handwriting
2
Let's try this out

Here’s the script - save it somewhere on your path and run chmod 755 prompt-gemini to make it executable:

1
#!/bin/bash
2

3
# Check if GOOGLE_API_KEY is set
4
if [ -z "$GOOGLE_API_KEY" ]; then
5
    echo "Error: GOOGLE_API_KEY environment variable is not set" >&2
6
    exit 1
7
fi
8

9
# Default model and options
10
model="8b"
11
prompt=""
12
image_file=""
13
jq_filter="."
14

15
# Parse arguments
16
while [[ $# -gt 0 ]]; do
17
    case $1 in
18
        -m)
19
            model="$2"
20
            shift 2
21
            ;;
22
        -r)
23
            jq_filter=".candidates[0].content.parts[0].text"
24
            shift
25
            ;;
26
        *)
27
            if [ -z "$prompt" ]; then
28
                prompt="$1"
29
            elif [ -z "$image_file" ]; then
30
                image_file="$1"
31
            fi
32
            shift
33
            ;;
34
    esac
35
done
36

37
# Validate prompt
38
if [ -z "$prompt" ]; then
39
    echo "Error: No prompt provided" >&2
40
    echo "Usage: prompt-gemini \"prompt\" [image_file] [-m model] [-r]" >&2
41
    exit 1
42
fi
43

44
# Map model names to full model strings
45
case $model in
46
    "8b"|"flash-8b")
47
        model_string="gemini-1.5-flash-8b-latest"
48
        ;;
49
    "flash")
50
        model_string="gemini-1.5-flash-latest"
51
        ;;
52
    "pro")
53
        model_string="gemini-1.5-pro-latest"
54
        ;;
55
    *)
56
        model_string="gemini-1.5-$model"
57
        ;;
58
esac
59

60
# Create temporary file
61
temp_file=$(mktemp)
62
trap 'rm -f "$temp_file"' EXIT
63

64
# Determine mime type if image file is provided
65
if [ -n "$image_file" ]; then
66
    if [ ! -f "$image_file" ]; then
67
        echo "Error: File '$image_file' not found" >&2
68
        exit 1
69
    fi
70

71
    # Get file extension and convert to lowercase
72
    ext=$(echo "${image_file##*.}" | tr '[:upper:]' '[:lower:]')
73

74
    case $ext in
75
        png)
76
            mime_type="image/png"
77
            ;;
78
        jpg|jpeg)
79
            mime_type="image/jpeg"
80
            ;;
81
        gif)
82
            mime_type="image/gif"
83
            ;;
84
        pdf)
85
            mime_type="application/pdf"
86
            ;;
87
        mp3)
88
            mime_type="audio/mp3"
89
            ;;
90
        mp4)
91
            mime_type="video/mp4"
92
            ;;
93
        *)
94
            echo "Error: Unsupported file type .$ext" >&2
95
            exit 1
96
            ;;
97
    esac
98

99
    # Create JSON with image data
100
    cat <<EOF > "$temp_file"
101
{
102
  "contents": [
103
    {
104
      "role": "user",
105
      "parts": [
106
        {
107
          "text": "$prompt"
108
        },
109
        {
110
          "inlineData": {
111
            "data": "$(base64 -i "$image_file")",
112
            "mimeType": "$mime_type"
113
          }
114
        }
115
      ]
116
    }
117
  ]
118
}
119
EOF
120
else
121
    # Create JSON without image data
122
    cat <<EOF > "$temp_file"
123
{
124
  "contents": [
125
    {
126
      "role": "user",
127
      "parts": [
128
        {
129
          "text": "$prompt"
130
        }
131
      ]
132
    }
133
  ]
134
}
135
EOF
136
fi
137

138
# Make API request with jq filter
139
curl -s "https://generativelanguage.googleapis.com/v1beta/models/$model_string:generateContent?key=$GOOGLE_API_KEY" \
140
    -H 'Content-Type: application/json' \
141
    -X POST \
142
    -d @"$temp_file" | jq "$jq_filter" -r

How I got Claude to write the Bash script#

Here’s the prompt I fed to Claude to create this, starting with the Bash + curl recipe I had already figured out:

Terminal window
1
cat <<EOF > input.json
2
{
3
  "contents": [
4
    {
5
      "role": "user",
6
      "parts": [
7
        {
8
          "text": "Extract text from this imaage"
9
        },
10
        {
11
          "inlineData": {
12
            "data": "$(base64 -i output_0.png)",
13
            "mimeType": "image/png"
14
          }
15
        }
16
      ]
17
    }
18
  ]
19
}
20
EOF
21

22
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-8b-latest:generateContent?key=$GOOGLE_API_KEY" \
23
  -H 'Content-Type: application/json' \
24
  -X POST \
25
  -d @input.json | jq
Turn this into a Bash script that runs like this:
Terminal window
1
prompt-gemini "this is the prompt"
2
prompt-gemini "This is the prompt" blah.png
3
prompt-gemini "This is the prompt" blah.pdf
4
prompt-gemini "this is the prompt" -m pro
It should exit with an error if GOOGLE_API_KEY is not set

It should use a temporary file for input.json which is deleted on completion

If no file was provided it should skip the inlineData bit

It should use the correct mimeType for PNG or PDF or JPG or JPEG or GIF depending on the file extension

The -m option should follow the following rules: it defaults to 8b, or it can be:

8b => gemini-1.5-flash-8b-latest (the default) flash-8b => gemini-1.5-flash-8b-latest flash => gemini-1.5-flash-latest pro => gemini-1.5-pro-latest

Any other value should be passed used directly in the gemini-1.5-flash:generateContent portion of the URL

Here’s the full Claude transcript.

Then I added the -r option by pasting in the previous script and prompting:

Modify this script to add an extra -r option which, if present, causes the final line to pipe through jq like this:
1
... | jq '.candidates[0].content.parts[0].text' -r

Claude transcript here.