6
$\begingroup$

I have a great many images of text that I have created to obtain texts and abstracts from older literature (copyrights expired) for which pdf files are not available. Because of the shear number, I have been simply using my iPhone to take quick photos of pages from works in research libraries.

While this allows me to read the text and place the pages into a single pdf file, attempting to do TextRecognize on sometimes slightly rotated, binary images has proven more difficult, since in some cases it is difficult to create images of pages that are not strictly planar. Consequently, lines of text often appear curved.

Simple rotation a few degrees does not adequately correct these images for TextRecognize to generate useful results, since it assumes the text lines are linear rather than slightly or moderately curvilinear and columns of pixels (text characters) align. Hence, several related posts on rotated text are not of much help in such cases.

Obviously, this is overkill for a few lines of text that can be simply typed in manually, like that presented in the following example. However, many abstracts are much longer and there are hundreds or perhaps thousands that must be read/converted to text.

The following code demonstrates the problem.

abstract = ImageTake[photos[[2]], {970, 1220}, {480, 2300}] 

of the second page of a pdf file of aggregated page images generates the following image of an abstract, here converted to a .png file:

photo2.png

As is evident from the image of the abstract, lines of text are slightly curved owing to the fact that the photographed pages were not completely flat.

Is there an approach that might permit pixels in the image to be interpolated/computed as a parabolic function characterizing lines of text that could then be used to convert/rectify this image of text into lines of linear text so that when TextRecognize is applied to the resultant image, it more accurately reads the text?

I'm stumped as to how to proceed from here.

$\endgroup$
2
  • 2
    $\begingroup$ if you just need to ocr the image and get text, try llm like chatgpt-4o, gemini. they all have api to send image through web requests, and get result, which can be programmed into mathematica. $\endgroup$ Commented Jul 22, 2024 at 5:32
  • $\begingroup$ I'll look into that, but it begs the question as to how chatgpt might implement the straightening, which is not particularly informative in the context of Mathematica code that might mimic such an unspecified algorithm. Using a llm to solve simple problems also has the disadvantage of generating a very large carbon footprint, so a less damaging approach would be desirable for such a relatively basic task. $\endgroup$ Commented Jul 26, 2024 at 17:24

1 Answer 1

6
$\begingroup$

In 14.1, TextRecognize appears to work somewhat well (but not perfectly) on the image as-is:

img = Import["https://i.sstatic.net/EtV56JZP.png"]; TextRecognize[img] (* Yellowtail rockfish sebastones sarous aggregated along the southern edge of the Atoria Canyon, probably in response to the increased availability of mesopelagic food. The myctophid Stenobrachius leucopsarus was the dominant food organism. *) 

Here also is my attempt to straighten the text rows, but it's not very good. This may however be a good starting point for more investigation if you are interested.

I started by noticing that MorphologicalComponents does a pretty good job at recognizing the individual letters (note that I delete the dots above the lowercase i's, because they are too far from the central line of each row):

img = Import["https://i.sstatic.net/EtV56JZP.png"]; bin = ColorNegate@Binarize[img]; (*for now ignore dots over i's because they are too far above the row \ centers to work well*) bin = DeleteSmallComponents[bin, 100]; MorphologicalComponents[bin] // Colorize 

enter image description here

Since MorphologicalComponents finds the individual letters, we can grab the "Centroids" of each letter using ComponentMeasurements. Plotting these centroids we see the 5 distinct text rows shown in the image:

(*if we plot the centroids, we see 5 distinct rows of letters (as we \ expect)*) centroids = ComponentMeasurements[bin, "Centroid"]; ListPlot[centroids // Values, PlotRange -> All] 

enter image description here

We now use ClusteringComponents with the "SpanningTree" method to cluster the centroids (hopefully by row). Note that I have to specify the number of rows, I was unable to get this to work with automatic cluster number specification (although using the "DBSCAN" method comes close, but I don't show that here):

(*we can use ClusteringComponents with the SpanningTree method to get \ a good row separation*) nColumns = 5; cVals = centroids // Values; clusters = ClusteringComponents[cVals, nColumns, 1, Method -> "SpanningTree"]; 

We now gather the centroids by their cluster, and plot. We see that ClusteringComponents has successfully clustered the letters by row:

 (*gather the centroid values by there ClusteringComponent*) together = Thread[{cVals, clusters}]; gathered = GatherBy[together, Last]; gathered = Map[Most[#] &, gathered, 2]; gathered = Flatten[#, 1] & /@ gathered; (*sort rows L to R*) gatheredSorted = SortBy[#, First] & /@ gathered; (*the plot shows this worked pretty well*) ListPlot[gatheredSorted, PlotRange -> All] 

enter image description here

We now grab the sub-images of each component (letter) and cluster them according to clusters. We sort them in L-R fashion in each row:

(*we can now gather the component keys by the clustering number*) cKeys = centroids // Keys; keyTogether = Thread[{cKeys, clusters}]; keyGathered = GatherBy[keyTogether, Last]; keyGathered = Flatten /@ Map[Most, keyGathered, 2]; (*order L to R each row*) rowOrderings = Ordering /@ gathered; keyGathered = Table[keyGathered[[i, rowOrderings[[i]]]], {i, nColumns}]; (*grab the sub-images*) imgs = ComponentMeasurements[bin, "Image"] // Values; rowImgs = Part[imgs, #] & /@ keyGathered; 

We then record the dimensions of each sub-image, and ImagePad them appropriately so that they all have the same width as the original image, and a chosen height (I just chose 1/5 the original image height here). We first create our padding matrices:

(*record the dimensions of each sub-image*) width = First@ImageDimensions[bin]; wList = Map[First@ImageDimensions@# &, rowImgs, {-1}]; hList = Map[Last@ImageDimensions@# &, rowImgs, {-1}]; goalH = Round[Last@ImageDimensions@bin/5]; (*create padding matrix for ImagePad*) lPads = gatheredSorted[[All, All, 1]]; rPads = width - (lPads + wList); colPads = MapThread[Transpose@{#1, #2} &, {lPads, rPads}]; uPads = Floor[(goalH - hList)/2]; dPads = goalH - (uPads + hList); rowPads = MapThread[Transpose@{#1, #2} &, {uPads, dPads}]; padMats = MapThread[Transpose@{#1, #2} &, {colPads, rowPads}]; 

Then ImagePad and Total each row. We finally ImageAssemble all the rows together to get our final image:

(*created padded sub-images and total each row together*) finalRowImgs = Total /@ Table[MapThread[ImagePad, {rowImgs[[i]], padMats[[i]]}], {i, nColumns}]; (*and finally assemble together*) assembeled = ImageAssemble[List /@ finalRowImgs] 

enter image description here

And TextRecognize:

TextRecognize[assembeled] (*"e ecological 1 y Yllowal rockfish (Sebastodes fardus) aggregated along the southern \ edge of the At ta Canyon Probably in response to the increased avalability of \ mesopelagic food Te myctophed stenobrachus leucopsarus was the domnant food orgams m 1060"*) 

A few letters appear to be assigned to the wrong rows (notice the "e" in "Yellowtail" is misplaced). And some of the text is wrong. But playing the with the ClusteringComponents Methods and method sub-options may yield better results.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.