Text Service

Images and media are a core part of the DLCS platform, but searchability and discoverability of content can be enhanced greatly by incorporating text handling into the DLCS platform. The DLCS offers a number of core text-based services.

Benefits of the DLCS Text Service

The DLCS Text Service provides for any typeset/typewritten images:

  1. Full text (per image)
  2. Full text (for multi-page objects)
  3. Coordinates (for search hit highlighting, or annotation)
  4. Integration with IIIF Content Search via the Search Service.
  5. Integration with natural language processing and entity extraction services via the Semantic Extraction Service.

Simply registering image-based content via the ingest and orchestration service of the DLCS will make the text on that image-based content, and information extracted from that content available to drive a full range of search and discovery services.

Technical Overview

starsky-components-and-connections

Text Server: Image level services

Text to Search Service indexer

After indexing (see DLCS Ingest Architecture) the supplied or created text metadata (hOCR, Alto, or plaintext), the plaintext for the image is placed onto an Amazon SQS queue to be processed by the indexer.

Message format:

{
 "image_uri": "plaintext"
}

In this way, any image and associated text data processed via the Text Service becomes searchable via the Search Service.

Coordinates service

DLCS services that make use of text — the Search Service, and the Semantic Extraction Service — rely on the Text Server’s Coordinates service to return information about the coordinate location of text that can be used to create annotations on images, or highlight text in search results.
Given an image URI and the character offset of word tokens within the plaintext of that image, the coordinates service will return the coordinates of a box surrounding the work or phrase that was supplied. This service is used by the search server and semantic extraction server which can create annotations or highlight results.
Example request:

{
 "images": [{
 "imageURI": "https://dlcs.io/iiif-img/50/1/0dc92f88-bf92-4d2f-9172-42451e5c14d6",
 "width": "1024",
 "positions": [
 [
 "1175",
 "1178",
 "1186",
 "1190",
 "1195",
 "1201",
 "1206"
 ]
 ],
 "height": "768"
 }]
}

This request is asking for the coordinate of a single instance 6-word phrase within the given URI. The positions are the character offsets within the image for the first letter of each of the words in the phrase and the supplied width and height give the coordinate space the full image would be displayed at (though the client may then crop that image).
Example response:

{
 "images": [{
 "image_uri": "https://dlcs.io/iiif-img/50/1/0dc92f88-bf92-4d2f-9172-42451e5c14d6",
 "phrases": [
 [{
 "count": 1,
 "xywh": "836,395,22,6"
 }, {
 "count": 6,
 "xywh": "106,405,469,8"
 }]
 ]
 }]
}

This response gives two bounding boxes, meaning the phrase wraps over two lines. The counts indicate that 1 word appears on the first line and the remaining six appear on the second. The returned x, y, width and height boxes have been scaled from the image that was used for OCR based on the width and height supplied in the request.

Returned results can be use to generate annotations on images or highlights for search results.

Plaintext service

The plaintext service returns the normalised plaintext of a single image. This is all the words captured by the text metadata, separated by a single space. This output is used as the input to other services, for example the semantic extraction service. The response format is similar to the message format for the feed described above.

Line-level transcription annotations

This service returns returns the text and bounding box of each line of text from within an image. This output is used by river-annotations to provide line level annotations that can be used to provide text highlighting, or to feed a future crowd-sourcing service.

Confidence

Some OCR formats include information about the confidence the OCR software has in the quality of the OCR. Currently the Text Server can return confidence information for hOCR text sources, which is the formatted generated by the internal OCR service.

This service exposes the transcription confidence, where available, for images that have been processed by the text pipeline. The service accepts a list of image URIs and returns the confidence, as a percentage, for each one.

This service can be used to identify typewritten text within larger corpora of images, or as part of quality assurance and testing.

Text Server: ‘IIIF’ level services

Multi-page plaintext (IIIF manifest)

By wrapping the image-level service, the river-plaintext service accepts a IIIF manifest URI which it downloads and examines. The image-level service is called for the first image of each canvas in the first sequence, and the results streamed back to the client, effectively providing a text transcription of the entire manifest.

The service can be used, for example, to return full-text for an entire book.

Roadmap: PDF Generation

The combination of plain text with coordinates, and images can be used to generate PDFs from ingested DLCS content.

For more detail on the DLCS roadmap, see: DLCS Roadmap.

Roadmap: HTML Generation

Returns a static HTML text version of the content of a DLCS object.

For more detail on the DLCS roadmap, see: DLCS Roadmap.

More Information