ocr

tesseract in Python

#!/usr/bin/env python3

from PIL import Image
import pytesseract
print(pytesseract.image_to_data('filelist.txt'))

compressed pdf

  • first compress images using mogrify -quality 40
  • use tesseract with 'filelist.txt'

OCRAD

brew install ocrad
convert 14.16.12.png img.ppm
ocrad img.ppm > output.txt

or

brew install netpbm
pngtopnm filename.png | ocrad

tesseract

brew install tesseract
brew install tesseract-lang

tesseract NL-UtHUA_A356828_000002.jpg outfile -l nld tsv
tesseract NL-UtHUA_A356828_000002.jpg outfile -l nld pdf

other