Textract

So Amazon Textract is used to detect and analyze text (extract relevant information) contained within input documents. It offers the detection of text and the relationship between different parts of the text. You might have a receipt or invoice, it will be able to detect all the relevant items, like prices and products, dates, as well as any interaction between those different elements. So it might know for example that a particular product line has a specific price and this has a specific element of tax.

It also generates metadata, so for example where that text occurs.

For particular types of documents it offers specific types of analysis: for generic documents it might be able to identify names, addresses and birthdates, but then for receipts it might be able to identify prices, vendors, line items and dates. For identity documents it can also do abstraction of certain fields: you might have drivers licenses which offer a driver license ID and then passports which offer passport IDs and the product is capable of assessing both of these and abstracting that in to a document ID field.

Textract can be used from the console UI or from the APIs and so it can be integrated with any applications that you either develop or architect. And can also be integrated with other AWS products and services including other machine learning products.

Input document types:

JPEG
PNG
PDF
TIFF

The outputs is:

extracted text
the structure of that text
any analysis which can be performed

Relevant information can be any of generic documents, identity documents or receipts or invoices.

Now for most documents the product is capable of operating in a synchronous way, so real time. For large documents these are processed in an asynchronous way.

Textract is pay-per-usage, but it does offer custom pricing available for large volumes.