Parallax Advanced Research and Air Force Research Laboratory innovate novel machine learning technology to transcribe complex documents

May 20, 2021

FOR IMMEDIATE RELEASE: May 20, 2021

Dayton, Ohio -- Parallax Advanced Research and the Air Force Research Laboratory Autonomy Capability Team 3, or ACT3, are developing a novel computer-based machine learning technology named Detextron. The innovation extracts information from complex documents containing signatures, lines of text and images and processes and transcribes it into a readable report. Detextron will first be used by ACT3 customers, such as the U.S. Office of Information Policy for fulfilling Freedom of Information Act requests.

From layout to format to the type of content contained with them, documents can be complex. They can include a mixture of typed and handwritten text and embedded and scanned items like images, audio, and video. For many, being inundated by documents is part of a usual workday. Reviewing the documents and organizing their contents into concise and structured data is also an increasingly daunting task.

A new technology named Detextron might provide a rapid solution. The technology is a neural network being developed by Parallax Advanced Research, a nonprofit research organization and the Air Force Research Laboratory, both headquartered in Dayton, Ohio. Leading the development from Parallax is Software Developer and Natural Language Processing Expert Vahid Eyorokon.

“The question is: how do you extract cohesive information from a complex document? That's the question I'm solving” Eyorokon said. “As I started exploring answers, I realized that we could create a model that could do three things: detect, classify, and segment information within documents which can contain a combination of images, handwritten text, and typed text, and extract the content into a cohesive report.”

The model functions in three steps: detection, classification, and segmentation. Detection requires an end-user to upload documents into Detextron, then the program selects its content on a screen by drawing a bounding box around each line of text and script. Classification involves the algorithm to distinguish the selected content between text, script, and images. Segmentation involves the algorithm to use a polygon that outlines text and identifies its pixels that correspond with a text, script, or image.

Now, Detextron will be further refined to process, classify, and transcribe written text exactly as it’s written. Written analysis is a complicated process because every individual has a unique way of writing as well as one’s handwriting evolves over time. Furthermore, signatures are considered sensitive information which makes identification more complex because, as a dataset, signatures are harder to collect and develop. In addition, Detextron will eventually be able to generate reports on documents in languages other than English and on false documents, which are documents created with a sense of authenticity and appear to be factual but are not.

Although there are other versions of this technology, the model created by Eyorokon is unique. “Earlier models didn't have this level of granularity. For example, they would identify paragraphs instead of lines of a text or they didn't perform as well. I created this new model from scratch. The issue, however, lies in the dataset because you must have the right kind of dataset to be able to train this model,” Eyorokon said.

For this dataset to be effective, every single line of text in a document must be labeled. Therefore, Eyorokon and his team developed a method for bounding boxes and generating segmentation data. Then, this data was used to train Detextron. The resulting dataset was generated in days as opposed to years if the data were manually labeled.

“The dataset itself was synthetically generated since hand labeling would have taken years,” said Eyorokon. The process creates artificial documents with lines of generated text and synthetic handwriting. By controlling the way the content is assembled, i.e. how the text and script are inserted into a blank image, I can automatically generate an artificial ‘scan’ which is an image of a document and produce the labeled data needed to train Detextron.”.

Eyorokon used various sources to build the dataset, including a separate neural network that generates handwritten text from the open-source GitHub repository. The last dataset he generated contained 800,000 images, 124 million lines of text, 5 million lines of synthesized script, and 75,000 dynamic document layouts.

According to Eyorokon, the future of Detextron is likely to evolve and expand, “The need for such a tool isn’t exclusive to the government, rather it’s applicable within private and public sectors, too. I see Detextron eventually becoming a self-service web application that anyone can use. That way, it becomes almost agnostic to the documents an end user provides.”

Detextron is one of many artificial intelligence projects conducted by Parallax Advanced Research scientists that deliver innovative solutions to academic, industry and government clients across the United States. Learn more about Parallax’s novel research and development projects by visiting www.parallaxresearch.org.

###

About Parallax Advanced Research

Parallax is an advanced research institute that tackles global challenges by accelerating innovation and developing technology and solutions through strategic partnerships with government, industry and academia across Ohio and the Nation. Together with academia, Parallax accelerates innovation that leads to new breakthroughs. Together with government, Parallax tackles critical global challenges and delivers new solutions. Together with industry, Parallax develops groundbreaking ideas and speeds them to market.