Search within Word, Excel and Office Documents using Azure Search

img1

Azure cognitive services provide various feature like language translation and search. This blog delves deeper on how to set up azure to search contents within various data sources like Office documents like word, excel, powerpoint which are on Azure blog. 

Besides some configuration on Azure portal; Search involoves generating indexes first which is a three step process

  1. Identifying the blob data source container to search and setting it up. This defines WHERE to search
  2. Defining the Index. This defines WHAT to search. This include setting up “fields” to search and its properties
  3. Defining the Indexer. This defines HOW the search is set up or its configurations. This has feature set up like which file extensions whic must be ignored while saerching like image files (.png, ,jpg). It also defines indexing scehdule.

The above can be done by using REST based API calls including using tools like Postman or in SDKs available in various programming languages which includes .Net via Nuget packages Microsoft.Azure.Search and Microsoft.Azure.Management.Search

So lets start by setting up Azure Portal

  1.  Log in portal and create a new search service. Please note that the name you use will be used as unique url while calling REST APIs. Choose your subscription, location and Pricing Tier1
  2. After the service is created; navigate to keys and copy the Primary admin key. This key is the identification for all the incoming API calls to this search service.2
  3. After the above configuration; lets move on to create search indexes; which as mentioned above is a three step process:

Assuming the service name is blsearchpoc. All the url will be blsearchpoc. If you have named service as aomething else; go ahead and change url accordingly.

Identifying the blob data source container to search and setting it up

This defines WHERE to search

First calling via postman: (Don’t forget to replace the placehoders with your own setup)

url: https://blsearchpoc.search.windows.net/datasources?api-version=2016-09-01

headers:

api-key :”Primary_admin_key_COPIED_FROM_AZURE_PORTAL
Content-Type: “application/json”

Body:

{
 "name" : "blob-datasource",
 "type" : "azureblob",
 "credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=BLOB_ACCOUNT_NAME;AccountKey=BLOB_ACCOUNT_KEY;" },
 "container" : { "name" : "container-name", "query" : "directory_name_within_container" }
}

in C#

create a search client

private SearchServiceClient CreateSearchServiceClient()
{
string searchServiceName = “YOUR SERVICE NAME”;
string adminApiKey = “API KEY”;
SearchServiceClient serviceClient = new SearchServiceClient(searchServiceName, new SearchCredentials(adminApiKey));
return serviceClient;
}

// 1. Define the blob data source

string dataSourceName = string.Format(“blobdatasource-{0}”, folderName.Replace(“/”, “”));

string connectionString = ConfigurationManager.AppSettings[“StorageConnectionString”];

if (string.IsNullOrEmpty(connectionString))

{

throw new Exception(“No connection string for storage!! Contact admin”);

}

DataSourceCredentials cred = new DataSourceCredentials(connectionString);

DataContainer datacontainer = new DataContainer(containerName, folderName);

DataSource dataSource = new DataSource(dataSourceName, DataSourceType.AzureBlob, cred, datacontainer);

Defining the Index

This defines WHAT to search. This include setting up “fields” to search and its properties. You may set the fields of the index, their datatype, wheather each fiels is sortble and unique or not, etc. These fields are useful while retrieving search results back.

url: https://blsearchpoc.search.windows.net/datasources?api-version=2016-09-01

headers:

api-key :"Primary_admin_key_COPIED_FROM_AZURE_PORTAL"
Content-Type: "application/json"

Body:

{
 "name" : "my-target-index",
 "fields": [
 { "name": "id", "type": "Edm.String", "key": true, "searchable": false },
 { "name": "content", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
 ]
}

the above in C# with some extra fields setup:

// 2. Define the index
indexName = “my-target-index”;
Index index = new Index(indexName, new List() { new Field() { Name= “id”, Type=DataType.String, IsKey=true, IsSearchable=false } ,
new Field() { Name=”content”, Type=DataType.String, IsSearchable=true, IsFilterable=false, IsSortable=true, IsFacetable=false},
new Field() { Name=”uniquefilename”, Type=DataType.String, IsSearchable=true, IsFilterable=false, IsSortable=true, IsFacetable=false},
new Field() { Name=”filedisplayname”, Type=DataType.String, IsSearchable=true, IsFilterable=false, IsSortable=true, IsFacetable=false},
new Field() { Name=”uniquefileId”, Type=DataType.String, IsSearchable=true, IsFilterable=false, IsSortable=true, IsFacetable=false},
new Field() { Name=”filename”, Type=DataType.String, IsSearchable=true, IsFilterable=false, IsSortable=true, IsFacetable=false} });
serviceClient.Indexes.CreateOrUpdate(index);

Defining the Indexer

This defines HOW the search is set up. The indexers defines how often the content is indexed; what file type are ignored for indexing and most importantly it maps datasource and indexname created in steps 1 & 2 above.

url: https://blsearchpoc.search.windows.net/indexers?api-version=2016-09-01

headers:

api-key :"Primary_admin_key_COPIED_FROM_AZURE_PORTAL"
Content-Type: "application/json"

Body:

{
 "name" : "my-blob-indexer",
 "dataSourceName" : "blob-datasource",
 "targetIndexName" : "my-target-index",
 "schedule" : { "interval" : "PT5M" }
}

If the index have some custom field names then mapping must be done to ensure indexer knows how to fetch these values from the blob.

The below code is assuming that each of the blob file have meta_data associated with it which is mapped to fieldname.

// 3. Define the indexer, its schedule with datasource and index
IndexingSchedule schedule = new IndexingSchedule(TimeSpan.FromMinutes(5.0));
IndexingParameters parameters = new IndexingParameters();
parameters.ExcludeFileNameExtensions(new string[] { “.png,.jpeg,.jpg,.bmp,.gif,.mp3” });
parameters.DoNotFailOnUnsupportedContentType();
string indexerName = string.Format(“indexer-{0}”, folderName.Replace(“/”, “”));
List fieldMappings = new List();
{
FieldMapping fieldMapping = new FieldMapping(sourceFieldName: “metadata_uniquefilename”, targetFieldName: “uniquefilename”);
fieldMappings.Add(fieldMapping);
}
{
FieldMapping fieldMapping = new FieldMapping(sourceFieldName: “metadata_filedisplayname”, targetFieldName: “filedisplayname”);
fieldMappings.Add(fieldMapping);
}
{
FieldMapping fieldMapping = new FieldMapping(sourceFieldName: “metadata_storage_name”, targetFieldName: “filename”);
fieldMappings.Add(fieldMapping);
}
Indexer indexer = new Indexer(indexerName, dataSourceName, indexName, schedule: schedule, parameters: parameters, fieldMappings: fieldMappings);
serviceClient.Indexers.CreateOrUpdate(indexer);

// adding metadata while adding blob

private static void AddBlobMetadata(CloudBlockBlob blockBlob, string fileDisplayName, string uniqueFileId, Boolean IsDeleted)
{
//Add some metadata
blockBlob.Metadata[“filedisplayname”] = fileDisplayName;
blockBlob.Metadata[“uniquefileId”] = uniqueFileId;
blockBlob.Metadata[“isDeleted”] = IsDeleted.ToString();

//Set the container’s metadata.
blockBlob.SetMetadata();
}

You can see all the Datastorage, indexes and indexers on the azure site by navigating to the search service –> Overview.

3

You can use azure portal to search within docs as well by using “search options”

Since “search” call is GET; you can easily all it using postman: Remeber to search

https://blsearchpoc.search.windows.net/indexes/my-target-index/docs?api-version=2016-09-01&search=ANY_SEARCH_TERM_TO_BE_USED

In C#:

public dynamic Search(string indexToSearch, string searchTerm)
{
dynamic searchResult = new ExpandoObject();

SearchServiceClient serviceClient = CreateSearchServiceClient();// searchServiceName, adminApiKey);
string indexName = indexToSearch;

ISearchIndexClient indexClient = serviceClient.Indexes.GetClient(indexName);
var results = indexClient.Documents.Search(searchTerm);
//Console.WriteLine(“Search result count: {0}”, results.Results.Count);
searchResult.SearchResultCount = results.Results.Count;
List searchResultItems = new List();
foreach (var result in results.Results)
{
dynamic searchResultItem = new ExpandoObject();
// Console.WriteLine(“{0}, {1}, {2}”, result.Document, result.Highlights, result.Score);
if (result.Document.ContainsKey(“content”))
{
//Console.WriteLine(“Content: {0}”, result.Document[“content”]);
searchResultItem.Content = result.Document[“content”];
}
if (result.Document.ContainsKey(“filename”))
{
//Console.WriteLine(“Content: {0}”, result.Document[“content”]);
searchResultItem.FileName = result.Document[“filename”];
}
if (result.Document.ContainsKey(“uniquefilename”))
{
//Console.WriteLine(“Content: {0}”, result.Document[“content”]);
searchResultItem.Uniquefilename = result.Document[“uniquefilename”];
}
if (result.Document.ContainsKey(“uniquefileId”))
{
//Console.WriteLine(“Content: {0}”, result.Document[“content”]);
searchResultItem.Uniquefilename = result.Document[“uniquefileId”];
}
searchResultItems.Add(searchResultItem);
}
searchResult.SearchResultItems = searchResultItems;
return searchResult;
}

Azure uses Lucene indexing behind the scenes. We can even levarage Lucene indexes out of the box if Azure search is not required out of the box.. More on it in next blog…

 

Leave a comment