difference betwwen PDF and PowerPoint OCR - abbyy

We do the following test
convert a PowerPoint slide to an image
Do an OCR on the image
When the OCR is to PDF the results are reasonable however when the OCR is to PowerPoint the results are much worse. Why there are differences and how can we get good results in PowerPoint ?


Check if image contains text

I want to detect and extroact texts in naturel image if it contains ,like google vision do.i found a library on githup that detect regions of image to find texts and after detection it does ocr. I want it to be faster and before text detection and extraction,I want to check if an image contains text or not.
I know I can run OCR on it but I want it to be faster than that. If it contains text then it should OCR, if not it should discard the image. Any ideas?

Problem in reading cursive text from image using cognitive services Azure

I am facing problem in reading cursive text from image which is written in Cursive.
I see the text has missing in pixels when its zoom-in.Image displaying cursive text with missing pixels in it when zoom-in
enter image description here

Cropping PDF files cannot crop out text for text extraction (textract and pdfminer)

I’m using the python library PyPDF2 to crop many PDF files to cut out the useless information on top and bottom of academic papers (i.e. page numbers and journal information at the bottom). Then I used the library textract to extract the texts from the cropped PDF files to txt files. However, the output txt files still contains the cropped out information despite the cropping.
This also applies to pdfminer, another text extraction library (not OCR). It seems that for text extraction, as opposed to OCR, the text cannot be eliminated by simply cropping. Can anyone explain why this is the case? Any idea on how else to eliminate useless information in PDF files for text extraction?

Using tesseract to extract text AND pictures (photos, charts and tables) referenced in the output text

I have a pdf composed of scanned images, both pictorial as well textual.
By pictorial, I mean photos, charts and tables.
As final output, I want to have an epub readable from e-readers, so that:
text can be selected and highlighted font is zoomable, I can apply notes (if supported by device), TOC references are in place
pictorial images are maintained
For handling TOC references and textual properties,
I achieve to obtain a good txt I can manually edit and apply markup, to then be finally processed into an Epub
For the pictorial images, I can't find a way to maintain them, like extract them in an image catalog AND reference them on the txt
I want to avoid manually create and reference images, I want to automate it: am I missing use of properties in tesseract documentation?
could I extract and reference pictorial images with tesseract?
as alternative, which steps - or scripts (preference for python) or tools - could be used to extract and
reference of pictures, from a searchable pdf created with tesseract ?
I looked at tesseract documentation, but could not find this use case:
tesseract mybook.tif mybookImages get.images
did not produced images.
process source pdf into tiff using ghostscript
gs -q -r600x600 -dNOPAUSE -sDEVICE=tiffg4 -dBATCH -sOutputFile=mybook.tif sourcePDF.pdf -c quit
process tiff in txt : I can apply markup to txt and edit it, but I loose pictorial images!
tesseract -o -l eng mybook.tif mybook
process tiff in searchable pdf: pdf is searchable, but font is rendered as noisy, I can't edit text and apply markup
tesseract -o -l eng mybook.tif mybook pdf
eventually use asciidoc to slim process to edit markup and image reference on text, and then convert to epub ... but how can I extract images from the tiff or searchable pdf, keeping references in the final txt ?

Scanned Image to excel

I have the scanned image, i tried to convert it to excel file using online tool but its not giving clear text. is there any tool/thing available to covert the image to excel?
As per comments above - some links to commercial OCR packages out there: