@@ -1,36 +1,36 @@
---
-title: Extract text from images
+title: Extract text from images by using AI enrichment
titleSuffix: Azure AI Search
description: Use Optical Character Recognition (OCR) and image analysis to extract text, layout, captions, and tags from image files in Azure AI Search pipelines.
author: HeidiSteen
ms.author: heidist
ms.service: azure-ai-search
ms.topic: how-to
-ms.date: 01/10/2024
+ms.date: 10/15/2024
ms.custom:
- devx-track-csharp
- ignite-2023
---
-# Extract text and information from images in AI enrichment
+# Extract text and information from images by using AI enrichment
Through [AI enrichment](cognitive-search-concept-intro.md), Azure AI Search gives you several options for creating and extracting searchable text from images, including:
+ [OCR](cognitive-search-skill-ocr.md) for optical character recognition of text and digits
+ [Image Analysis](cognitive-search-skill-image-analysis.md) that describes images through visual features
+ [Custom skills](#passing-images-to-custom-skills) to invoke any external image processing that you want to provide
-Through OCR, you can extract text from photos or pictures containing alphanumeric text, such as the word "STOP" in a stop sign. Through image analysis, you can generate a text representation of an image, such as "dandelion" for a photo of a dandelion, or the color "yellow". You can also extract metadata about the image, such as its size.
+By using OCR, you can extract text from photos or pictures containing alphanumeric text, such as the word *STOP* in a stop sign. Through image analysis, you can generate a text representation of an image, such as *dandelion* for a photo of a dandelion, or the color *yellow*. You can also extract metadata about the image, such as its size.
This article covers the fundamentals of working with images, and also describes several common scenarios, such as working with embedded images, custom skills, and overlaying visualizations on original images.
-To work with image content in a skillset, you'll need:
+To work with image content in a skillset, you need:
> [!div class="checklist"]
> + Source files that include images
> + A search indexer, configured for image actions
> + A skillset with built-in or custom skills that invoke OCR or image analysis
-> + A search index with fields to receive the analyzed text output, plus output field mappings in the indexer that establish association.
+> + A search index with fields to receive the analyzed text output, plus output field mappings in the indexer that establish association
Optionally, you can define projections to accept image-analyzed output into a [knowledge store](knowledge-store-concept-intro.md) for data mining scenarios.
@@ -41,13 +41,13 @@ Image processing is indexer-driven, which means that the raw inputs must be in a
+ Image analysis supports JPEG, PNG, GIF, and BMP
+ OCR supports JPEG, PNG, BMP, and TIF
-Images are either standalone binary files or embedded in documents (PDF, RTF, and Microsoft application files). A maximum of 1000 images can be extracted from a given document. If there are more than 1000 images in a document, the first 1000 are extracted and then a warning is generated.
+Images are either standalone binary files or embedded in documents, such as PDF, RTF, or Microsoft application files. A maximum of 1,000 images can be extracted from a given document. If there are more than 1,000 images in a document, the first 1,000 are extracted and then a warning is generated.
Azure Blob Storage is the most frequently used storage for image processing in Azure AI Search. There are three main tasks related to retrieving images from a blob container:
+ Enable access to content in the container. If you're using a full access connection string that includes a key, the key gives you permission to the content. Alternatively, you can [authenticate using Microsoft Entra ID](search-howto-managed-identities-data-sources.md) or [connect as a trusted service](search-indexer-howto-access-trusted-service-exception.md).
-+ [Create a data source](search-howto-indexing-azure-blob-storage.md) of type "azureblob" that connects to the blob container storing your files.
++ [Create a data source](search-howto-indexing-azure-blob-storage.md) of type *azureblob* that connects to the blob container storing your files.
+ Review [service tier limits](search-limits-quotas-capacity.md) to make sure that your source data is under maximum size and quantity limits for indexers and enrichment.
@@ -58,11 +58,11 @@ Azure Blob Storage is the most frequently used storage for image processing in A
After the source files are set up, enable image normalization by setting the `imageAction` parameter in indexer configuration. Image normalization helps make images more uniform for downstream processing. Image normalization includes the following operations:
+ Large images are resized to a maximum height and width to make them uniform.
-+ For images that have metadata on orientation, image rotation is adjusted for vertical loading.
++ For images that have metadata that specifies orientation, image rotation is adjusted for vertical loading.
Metadata adjustments are captured in a complex type created for each image. You can't opt out of the image normalization requirement. Skills that iterate over images, such as OCR and image analysis, expect normalized images.
-1. [Create or Update an indexer](/rest/api/searchservice/indexers/create) to set the configuration properties:
+1. [Create or update an indexer](/rest/api/searchservice/indexers/create-or-update) to set the configuration properties:
```json
{
@@ -80,23 +80,23 @@ Metadata adjustments are captured in a complex type created for each image. You
1. Set `dataToExtract` to `contentAndMetadata` (required).
-1. Verify that the `parsingMode` is set to default (required).
+1. Verify that the `parsingMode` is set to *default* (required).
- This parameter determines the granularity of search documents created in the index. The default mode sets up a one-to-one correspondence so that one blob results in one search document. If documents are large, or if skills require smaller chunks of text, you can add Text Split skill that subdivides a document into paging for processing purposes. But for search scenarios, one blob per document is required if enrichment includes image processing.
+ This parameter determines the granularity of search documents created in the index. The default mode sets up a one-to-one correspondence so that one blob results in one search document. If documents are large, or if skills require smaller chunks of text, you can add the Text Split skill that subdivides a document into paging for processing purposes. But for search scenarios, one blob per document is required if enrichment includes image processing.
-1. Set `imageAction` to enable the *normalized_images* node in an enrichment tree (required):
+1. Set `imageAction` to enable the `normalized_images` node in an enrichment tree (required):
+ `generateNormalizedImages` to generate an array of normalized images as part of document cracking.
- + `generateNormalizedImagePerPage` (applies to PDF only) to generate an array of normalized images where each page in the PDF is rendered to one output image. For non-PDF files, the behavior of this parameter is similar as if you had set "generateNormalizedImages". However, note that setting "generateNormalizedImagePerPage" can make indexing operation less performant by design (especially for large documents) since several images would have to be generated.
+ + `generateNormalizedImagePerPage` (applies to PDF only) to generate an array of normalized images where each page in the PDF is rendered to one output image. For non-PDF files, the behavior of this parameter is similar as if you had set `generateNormalizedImages`. However, setting `generateNormalizedImagePerPage` can make indexing operation less performant by design (especially for large documents) since several images would have to be generated.
1. Optionally, adjust the width or height of the generated normalized images:
- + `normalizedImageMaxWidth` (in pixels). Default is 2000. Maximum value is 10000.
+ + `normalizedImageMaxWidth` in pixels. Default is 2,000. Maximum value is 10,000.
- + `normalizedImageMaxHeight` (in pixels). Default is 2000. Maximum value is 10000.
+ + `normalizedImageMaxHeight` in pixels. Default is 2,000. Maximum value is 10,000.
- The default of 2000 pixels for the normalized images maximum width and height is based on the maximum sizes supported by the [OCR skill](cognitive-search-skill-ocr.md) and the [image analysis skill](cognitive-search-skill-image-analysis.md). The [OCR skill](cognitive-search-skill-ocr.md) supports a maximum width and height of 4200 for non-English languages, and 10000 for English. If you increase the maximum limits, processing could fail on larger images depending on your skillset definition and the language of the documents.
+ The default of 2,000 pixels for the normalized images maximum width and height is based on the maximum sizes supported by the [OCR skill](cognitive-search-skill-ocr.md) and the [image analysis skill](cognitive-search-skill-image-analysis.md). The [OCR skill](cognitive-search-skill-ocr.md) supports a maximum width and height of 4,200 for non-English languages, and 10,000 for English. If you increase the maximum limits, processing could fail on larger images depending on your skillset definition and the language of the documents.
+ Optionally, [set file type criteria](search-blob-storage-integration.md#PartsOfBlobToIndex) if the workload targets a specific file type. Blob indexer configuration includes file inclusion and exclusion settings. You can filter out files you don't want.
@@ -113,7 +113,7 @@ Metadata adjustments are captured in a complex type created for each image. You
### About normalized images
-When `imageAction` is set to a value other than "none", the new *normalized_images* field contains an array of images. Each image is a complex type that has the following members:
+When `imageAction` is set to a value other than *none*, the new `normalized_images` field contains an array of images. Each image is a complex type that has the following members:
| Image member | Description |
|--------------------|-----------------------------------------|
@@ -123,10 +123,10 @@ When `imageAction` is set to a value other than "none", the new *normalized_imag
| originalWidth | The original width of the image before normalization. |
| originalHeight | The original height of the image before normalization. |
| rotationFromOriginal | Counter-clockwise rotation in degrees that occurred to create the normalized image. A value between 0 degrees and 360 degrees. This step reads the metadata from the image that is generated by a camera or scanner. Usually a multiple of 90 degrees. |
-| contentOffset | The character offset within the content field where the image was extracted from. This field is only applicable for files with embedded images. The *contentOffset* for images extracted from PDF documents is always at the end of the text on the page it was extracted from in the document. This means images appear after all text on that page, regardless of the original location of the image in the page. |
+| contentOffset | The character offset within the content field where the image was extracted from. This field is only applicable for files with embedded images. The `contentOffset` for images extracted from PDF documents is always at the end of the text on the page it was extracted from in the document. This means images appear after all text on that page, regardless of the original location of the image in the page. |
| pageNumber | If the image was extracted or rendered from a PDF, this field contains the page number in the PDF it was extracted or rendered from, starting from 1. If the image isn't from a PDF, this field is 0. |
- Sample value of *normalized_images*:
+ Sample value of `normalized_images`:
```json
[
@@ -151,16 +151,16 @@ This section supplements the [skill reference](cognitive-search-predefined-skill
1. Add templates for OCR and Image Analysis from the portal, or copy the definitions from the [skill reference](cognitive-search-predefined-skills.md) documentation. Insert them into the skills array of your skillset definition.
-1. If necessary, [include multi-service key](cognitive-search-attach-cognitive-services.md) in the Azure AI services property of the skillset. Azure AI Search makes calls to a billable Azure AI services resource for OCR and image analysis for transactions that exceed the free limit (20 per indexer per day). Azure AI services must be in the same region as your search service.
+1. If necessary, [include a multi-service key](cognitive-search-attach-cognitive-services.md) in the Azure AI services property of the skillset. Azure AI Search makes calls to a billable Azure AI services resource for OCR and image analysis for transactions that exceed the free limit (20 per indexer per day). Azure AI services must be in the same region as your search service.
-1. If original images are embedded in PDF or application files like PPTX or DOCX, you'll need to add a Text Merge skill if you want image output and text output together. Working with embedded images is discussed further on in this article.
+1. If original images are embedded in PDF or application files like PPTX or DOCX, you need to add a Text Merge skill if you want image output and text output together. Working with embedded images is discussed further on in this article.
Once the basic framework of your skillset is created and Azure AI services is configured, you can focus on each individual image skill, defining inputs and source context, and mapping outputs to fields in either an index or knowledge store.
> [!NOTE]
-> See [REST Tutorial: Use REST and AI to generate searchable content from Azure blobs](cognitive-search-tutorial-blob.md) for an example skillset that combines image processing with downstream natural language processing. It shows how to feed skill imaging output into entity recognition and key phrase extraction.
+> For an example skillset that combines image processing with downstream natural language processing, see [REST Tutorial: Use REST and AI to generate searchable content from Azure blobs](cognitive-search-tutorial-blob.md). It shows how to feed skill imaging output into entity recognition and key phrase extraction.
-### About inputs for image processing
+### Inputs for image processing
As noted, images are extracted during document cracking and then normalized as a preliminary step. The normalized images are the inputs to any image processing skill, and are always represented in an enriched document tree in either one of two ways:
@@ -224,11 +224,11 @@ In a skillset, Image Analysis and OCR skill output is always text. Output text i
}
```
-1. [Create or update a search index](/rest/api/searchservice/indexes/create) to add fields to accept the skill outputs.
+1. [Create or update a search index](/rest/api/searchservice/indexes/create-or-update) to add fields to accept the skill outputs.
- In the following fields collection example, "content" is blob content. "Metadata_storage_name" contains the name of the file (make sure it is "retrievable"). "Metadata_storage_path" is the unique path of the blob and is the default document key. "Merged_content" is output from Text Merge (useful when images are embedded).
+ In the following fields collection example, *content* is blob content. *Metadata_storage_name* contains the name of the file (set `retrievable` to *true*). *Metadata_storage_path* is the unique path of the blob and is the default document key. *Merged_content* is output from Text Merge (useful when images are embedded).
- "Text" and "layoutText" are OCR skill outputs and must be a string collection in order to the capture all of the OCR-generated output for the entire document.
+ *Text* and *layoutText* are OCR skill outputs and must be a string collection in order to the capture all of the OCR-generated output for the entire document.
```json
"fields": [
@@ -284,7 +284,7 @@ In a skillset, Image Analysis and OCR skill output is always text. Output text i
1. [Update the indexer](/rest/api/searchservice/indexers/create-or-update) to map skillset output (nodes in an enrichment tree) to index fields.
- Enriched documents are internal. To externalize the nodes in an enriched document tree, set up an output field mapping that specifies which index field receives node content. Enriched data is accessed by your app through an index field. The following example shows a "text" node (OCR output) in an enriched document that's mapped to a "text" field in a search index.
+ Enriched documents are internal. To externalize the nodes in an enriched document tree, set up an output field mapping that specifies which index field receives node content. Enriched data is accessed by your app through an index field. The following example shows a *text* node (OCR output) in an enriched document that's mapped to a *text* field in a search index.
```json
"outputFieldMappings": [
@@ -313,15 +313,15 @@ POST /indexes/[index name]/docs/search?api-version=[api-version]
}
```
-OCR recognizes text in image files. This means that OCR fields ("text" and "layoutText") are empty if source documents are pure text or pure imagery. Similarly, image analysis fields ("imageCaption" and "imageTags") are empty if source document inputs are strictly text. Indexer execution emits warnings if imaging inputs are empty. Such warnings are to be expected when nodes are unpopulated in the enriched document. Recall that blob indexing lets you include or exclude file types if you want to work with content types in isolation. You can use these setting to reduce noise during indexer runs.
+OCR recognizes text in image files. This means that OCR fields (*text* and *layoutText*) are empty if source documents are pure text or pure imagery. Similarly, image analysis fields (*imageCaption* and *imageTags*) are empty if source document inputs are strictly text. Indexer execution emits warnings if imaging inputs are empty. Such warnings are to be expected when nodes are unpopulated in the enriched document. Recall that blob indexing lets you include or exclude file types if you want to work with content types in isolation. You can use these settings to reduce noise during indexer runs.
-An alternate query for checking results might include the "content" and "merged_content" fields. Notice that those fields include content for any blob file, even those where there was no image processing performed.
+An alternate query for checking results might include the *content* and *merged_content* fields. Notice that those fields include content for any blob file, even those where there was no image processing performed.
### About skill outputs
-Skill outputs include "text" (OCR), "layoutText" (OCR), "merged_content", "captions" (image analysis), "tags" (image analysis):
+Skill outputs include `text` (OCR), `layoutText` (OCR), `merged_content`, `captions` (image analysis), `tags` (image analysis):
-+ "text" stores OCR-generated output. This node should be mapped to field of type `Collection(Edm.String)`. There's one "text" field per search document consisting of comma-delimited strings for documents that contain multiple images. The following illustration shows OCR output for three documents. First is a document containing a file with no images. Second is a document (image file) containing one word, "Microsoft". Third is a document containing multiple images, some without any text (`"",`).
++ `text` stores OCR-generated output. This node should be mapped to field of type `Collection(Edm.String)`. There's one `text` field per search document consisting of comma-delimited strings for documents that contain multiple images. The following illustration shows OCR output for three documents. First is a document containing a file with no images. Second is a document (image file) containing one word, *Microsoft*. Third is a document containing multiple images, some without any text (`"",`).
```json
"value": [
@@ -350,23 +350,23 @@ Skill outputs include "text" (OCR), "layoutText" (OCR), "merged_content", "capti
]
```
-+ "layoutText" stores OCR-generated information about text location on the page, described in terms of bounding boxes and coordinates of the normalized image. This node should be mapped to field of type `Collection(Edm.String)`. There's one "layoutText" field per search document consisting of comma-delimited strings.
++ `layoutText` stores OCR-generated information about text location on the page, described in terms of bounding boxes and coordinates of the normalized image. This node should be mapped to field of type `Collection(Edm.String)`. There's one `layoutText` field per search document consisting of comma-delimited strings.
-+ "merged_content" stores the output of a Text Merge skill, and it should be one large field of type `Edm.String` that contains raw text from the source document, with embedded "text" in place of an image. If files are text-only, then OCR and image analysis have nothing to do, and "merged_content" is the same as "content" (a blob property that contains the content of the blob).
++ `merged_content` stores the output of a Text Merge skill, and it should be one large field of type `Edm.String` that contains raw text from the source document, with embedded `text` in place of an image. If files are text-only, then OCR and image analysis have nothing to do, and `merged_content` is the same as `content` (a blob property that contains the content of the blob).
-+ "imageCaption" captures a description of an image as individuals tags and a longer text description.
++ `imageCaption` captures a description of an image as individuals tags and a longer text description.
-+ "imageTags" stores tags about an image as a collection of keywords, one collection for all images in the source document.
++ `imageTags` stores tags about an image as a collection of keywords, one collection for all images in the source document.
The following screenshot is an illustration of a PDF that includes text and embedded images. Document cracking detected three embedded images: flock of seagulls, map, eagle. Other text in the example (including titles, headings, and body text) was extracted as text and excluded from image processing.
-:::image type="content" source="media/cognitive-search-concept-image-scenarios/state-of-birds-screenshot.png" alt-text="Screenshot of three images in a PDF" border="true":::
+:::image type="content" source="media/cognitive-search-concept-image-scenarios/state-of-birds-screenshot.png" alt-text="Screenshot of three images in a PDF." border="true":::
-Image analysis output is illustrated in the JSON below (search result). The skill definition allows you to specify which [visual features](cognitive-search-skill-image-analysis.md#skill-parameters) are of interest. For this example, tags and descriptions were produced, but there are more outputs to choose from.
+Image analysis output is illustrated in the following JSON (search result). The skill definition allows you to specify which [visual features](cognitive-search-skill-image-analysis.md#skill-parameters) are of interest. For this example, tags and descriptions were produced, but there are more outputs to choose from.
-+ "imageCaption" output is an array of descriptions, one per image, denoted by "tags" consisting of single words and longer phrases that describe the image. Notice the tags consisting of "a flock of seagulls are swimming in the water", or "a close up of a bird".
++ `imageCaption` output is an array of descriptions, one per image, denoted by `tags` consisting of single words and longer phrases that describe the image. Notice the tags consisting of *a flock of seagulls are swimming in the water*, or *a close up of a bird*.
-+ "imageTags" output is an array of single tags, listed in the order of creation. Notice that tags repeat. There's no aggregation or grouping.
++ `imageTags` output is an array of single tags, listed in the order of creation. Notice that tags repeat. There's no aggregation or grouping.
```json
"imageCaption": [
@@ -401,27 +401,27 @@ Image analysis output is illustrated in the JSON below (search result). The skil
## Scenario: Embedded images in PDFs
-When the images you want to process are embedded in other files, such as PDF or DOCX, the enrichment pipeline extracts just the images and then pass them to OCR or image analysis for processing. Image extraction occurs during the document cracking phase, and once the images are separated, they remain separate unless you explicitly merge the processed output back into the source text.
+When the images you want to process are embedded in other files, such as PDF or DOCX, the enrichment pipeline extracts just the images and then passes them to OCR or image analysis for processing. Image extraction occurs during the document cracking phase, and once the images are separated, they remain separate unless you explicitly merge the processed output back into the source text.
[**Text Merge**](cognitive-search-skill-textmerger.md) is used to put image processing output back into the document. Although Text Merge isn't a hard requirement, it's frequently invoked so that image output (OCR text, OCR layoutText, image tags, image captions) can be reintroduced into the document. Depending on the skill, the image output replaces an embedded binary image with an in-place text equivalent. Image Analysis output can be merged at image location. OCR output always appears at the end of each page.
The following workflow outlines the process of image extraction, analysis, merging, and how to extend the pipeline to push image-processed output into other text-based skills such as Entity Recognition or Text Translation.
-1. After connecting to the data source, the indexer loads and cracks source documents, extracting images and text, and queuing each content type for processing. An enriched document consisting only of a root node (`"document"`) is created.
+1. After connecting to the data source, the indexer loads and cracks source documents, extracting images and text, and queuing each content type for processing. An enriched document consisting only of a root node (*document*) is created.
-1. Images in the queue are [normalized](#get-normalized-images) and passed into enriched documents as a [`"document/normalized_images"`](#get-normalized-images) node.
+1. Images in the queue are [normalized](#get-normalized-images) and passed into enriched documents as a [document/normalized_images](#get-normalized-images) node.
-1. Image enrichments execute, using `"/document/normalized_images"` as input.
+1. Image enrichments execute, using `"/document/normalized_images"` as input.
-1. Image outputs are passed into the enriched document tree, with each output as a separate node. Outputs vary by skill (text and layoutText for OCR, tags and captions for Image Analysis).
+1. Image outputs are passed into the enriched document tree, with each output as a separate node. Outputs vary by skill (text and layoutText for OCR; tags and captions for Image Analysis).
1. Optional but recommended if you want search documents to include both text and image-origin text together, [Text Merge](cognitive-search-skill-textmerger.md) runs, combining the text representation of those images with the raw text extracted from the file. Text chunks are consolidated into a single large string, where the text is inserted first in the string and then the OCR text output or image tags and captions.
The output of Text Merge is now the definitive text to analyze for any downstream skills that perform text processing. For example, if your skillset includes both OCR and Entity Recognition, the input to Entity Recognition should be `"document/merged_text"` (the targetName of the Text Merge skill output).
1. After all skills have executed, the enriched document is complete. In the last step, indexers refer to [output field mappings](#output-field-mappings) to send enriched content to individual fields in the search index.
-The following example skillset creates a `"merged_text"` field containing the original text of your document with embedded OCRed text in place of embedded images. It also includes an Entity Recognition skill that uses `"merged_text"` as input.
+The following example skillset creates a `merged_text` field containing the original text of your document with embedded OCRed text in place of embedded images. It also includes an Entity Recognition skill that uses `merged_text` as input.
### Request body syntax
@@ -492,7 +492,7 @@ The following example skillset creates a `"merged_text"` field containing the or
}
```
-Now that you have a merged_text field, you can map it as a searchable field in your indexer definition. All of the content of your files, including the text of the images, will be searchable.
+Now that you have a `merged_text` field, you can map it as a searchable field in your indexer definition. All of the content of your files, including the text of the images, will be searchable.
## Scenario: Visualize bounding boxes
@@ -617,7 +617,7 @@ for value in values:
# you now have an image to work with
```
-Similarly to return an image, return a base64 encoded string within a JSON object with a `$type` property of `file`.
+Similarly to return an image, return a base64 encoded string within a JSON object with a `$type` property of *file*.
```python
def base64EncodeImage(image):
@@ -634,11 +634,11 @@ def base64EncodeImage(image):
}
```
-## See also
+## Related content
+ [Create indexer (REST)](/rest/api/searchservice/indexers/create)
+ [Image Analysis skill](cognitive-search-skill-image-analysis.md)
+ [OCR skill](cognitive-search-skill-ocr.md)
-+ [Text merge skill](cognitive-search-skill-textmerger.md)
-+ [How to define a skillset](cognitive-search-defining-skillset.md)
-+ [How to map enriched fields](cognitive-search-output-field-mapping.md)
++ [Text Merge skill](cognitive-search-skill-textmerger.md)
++ [How to create a skillset](cognitive-search-defining-skillset.md)
++ [Map enriched output to fields](cognitive-search-output-field-mapping.md)