How OCR works in pdftoxlsx
When you upload a PDF, pdftoxlsx first checks whether the file contains selectable text or scanned images. If the PDF is image-based (a scan or photograph), pdftoxlsx automatically activates its OCR engine to recognize characters, numbers, and table structures. The OCR identifies transaction rows, separates date, description, and amount columns, and reconstructs the table layout. No configuration is needed - OCR activates automatically when required.
Scan quality recommendations
Scan quality directly affects OCR accuracy. For best results: scan at 300 DPI or higher (most modern scanners default to 300 DPI). Use color or grayscale mode - avoid pure black-and-white (1-bit) scanning, which loses detail on thin fonts and decimal points. Ensure the page is flat and aligned in the scanner - skewed pages reduce accuracy. If scanning with a phone camera, use a document scanning app (not the regular camera) and ensure even lighting with no shadows across the text.
Common issues with scanned statements
Skewed or rotated pages. pdftoxlsx can handle slight rotation (up to 5 degrees) automatically. For heavily skewed scans, straighten the page in your scanning app before uploading. Low DPI scans. Scans below 200 DPI may produce errors in decimal amounts and dates. Re-scan at 300 DPI if possible. Handwritten annotations. Handwritten notes, stamps, or signatures on the statement may interfere with OCR. pdftoxlsx ignores most annotations, but heavy handwriting over transaction rows can cause errors. Multi-column layouts. Some statements have complex multi-column layouts. OCR handles standard two-column (debit/credit) layouts well. Unusual layouts may require manual review of a few rows.
Tips for better OCR results
1. Use the original PDF from your bank whenever possible - digital PDFs are always more accurate than scans. 2. If you must scan, use a flatbed scanner at 300+ DPI in grayscale or color. 3. Scan one page at a time and ensure each page is straight. 4. Remove paper clips, staples, and sticky notes before scanning. 5. After conversion, spot-check amounts against the original statement - OCR on scanned documents typically achieves 95-98% accuracy versus 99%+ for digital PDFs.
Frequently asked questions
How can I tell if my PDF is scanned or digital?
Open the PDF and try to select text with your mouse. If you can highlight individual words, it is a digital PDF with embedded text. If clicking and dragging selects nothing or selects the entire page as an image, it is a scanned PDF. pdftoxlsx detects this automatically and applies OCR when needed.
What DPI should I scan my bank statement at?
Scan at 300 DPI or higher for optimal OCR accuracy. 300 DPI is the industry standard for document scanning and works well with all font sizes found in bank statements. Higher DPI (400-600) can improve results on statements with very small print but increases file size without major accuracy gains.
Can pdftoxlsx handle a photo of a bank statement taken with my phone?
Yes, as long as the image is reasonably clear with good lighting. Use a document scanning app like Apple Notes scanner, Google Drive scan, or Adobe Scan - these apps auto-crop, straighten, and enhance the image. Avoid regular camera photos taken at an angle with shadows, as these significantly reduce OCR accuracy.
Is OCR accuracy as good as digital PDF conversion?
OCR on well-scanned documents typically achieves 95-98% accuracy, while digital PDF conversion achieves 99%+. The difference comes from image artifacts, font rendering, and minor alignment issues inherent in scanning. Always spot-check a few transactions after OCR conversion, especially decimal amounts and dates.
Convert your scanned statement now
No signup. Files deleted in 1 hour. GDPR compliant.
Convert your scanned statement now →