Using Tesseract.js to OCR every image on a page#

Pasting this code into a DevTools console should load Tesseract.js from a CDN, loop through every image loaded by that page (every PNG, GIF, JPG or JPEG), run OCR on them and output the result to the DevTools console.

There’s one major catch: the images need to be served in a context that allows JavaScript to read their content - either from the same domain, or from a separate domain with a permissive CORS policy.

Very few sites do this! It worked on www.google.com for me, where it successfully OCRs the Google logo as containing the text “Google”.

1
var s = document.createElement("script")
2
s.src = "https://unpkg.com/tesseract.js@v2.1.0/dist/tesseract.min.js";
3
document.head.appendChild(s);
4
s.onload = (async () => {
5
  const imageUrls = performance.getEntries().map(f => f.name).filter(
6
    n => n.includes('.jpg') || n.includes('.gif') || n.includes('.png')  || n.includes('.jpeg')
7
  );
8
  const worker = Tesseract.createWorker();
9
  await worker.load();
10
  await worker.loadLanguage('eng');
11
  await worker.initialize('eng');
12
  for (const url of imageUrls) {
13
    console.log(url);
14
    var { data: { text } } = await worker.recognize(url);
15
    console.log(text);
16
  }
17
});