Status with OCR friendly form


With my previous two posts I discussed my progress with the development of the OCR friendly web form. Just to summarize the progress I have gain with the development, currently with this new feature SAHANA Web page is regenerated into a form that can be directly fed into the OCR module to extract data according to its data extraction mechanism, and during my first approach I mainly used CSS and images to layout the required components when the page is subjected to print mode. But this approach had the limitation of manually enabling the browser setting to print the background images and colors, and I had to incorporate considerably larger images with the CSS. Currently I’m developing a mechanism using JavaScript and CSS leaving behind the images to generate the required from out of the Web page, this functionality will only be available for the pages where the user needs to give input values to the system.

With my approach I’m traversing through the DOM to extract the required XHTML elements and their respective values (such as legend, label, input fileds, select, textarea) and regenerating the elements on the same page according to the OCR friendly layout. Currently I’m in the process of identifying and categorizing the elements according to their purpose and importance to come up with a suitable label layout. Herewith I’m attaching two images to show how it will looks like when I apply the feature to a page.
Normal web page

Sahana FOSS Disaster Management System

Sahana FOSS Disaster Management System


Source: http://demo.sahana.lk

Generated OCR friendly form
Sahana FOSS Disaster Management System XForm Library

Sahana FOSS Disaster Management System XForm Library

Leave a comment