13 July 2004

W3C Releases Public Working Draft for Full-Text Searching of XML Text and Documents

The W3C has released the Public Working Draft for Full-Text Searching of XML Text and Documents (link to "The Cover Pages" article on the announcement). This draft is entitled XQuery 1.0 and XPath 2.0 Full-Text. To quote "The Cover Pages" article:


As defined by the draft, "full-text queries are performed on text which has been tokenized, i.e., broken into a sequence of words, units of punctuation, and spaces." New full-text search facility is implemented by extending the XQuery and XPath languages to support a new "FTContainsExpr" expression and a new "ft:score" function.

Expressions of the type FTSelection are composed of:(1) words or combinations of words that are the search strings to be found as matches; (2) Match options such as case sensitivity or an indication to use stop words; (3) Boolean operators that allow composition of an FTSelection from simpler FTSelections; (4) Positional constraints such as indication of match distance or window.

The new Full-Text Working Draft endeavors to meet search requirements specified in an updated companion draft XQuery 1.0 and XPath 2.0 Full-Text Use Cases. This document provides use cases designed to "illustrate important applications of full-text querying within an XML query language. Each use case exercises a specific functionality relevant to full-text querying. An XML Schema and sample input data are provided; each use case specifies a query applied to the input data, a solution in XQuery, a solution in XPath (when possible), and the expected results."

Full-text query designed as an extension of XQuery and XPath will support several kinds of searches not possible using simple substring matching. It allows precision querying of XML documents containing "highly-structured data (numbers, dates), unstructured data (untagged free-flowing text), and semi-structured data (text with embedded tags).

Language-based query and token-based searches are also supported; for example, find all the news items that contain a word with the same linguistic stem as the English word "mouse" — which finds occurrences of both "mouse" and "mice" together with possessive forms.

Tokenization serves as the basis for full-text search in the W3C draft. Words, spaces, and punctuation are distinguished. A "word is defined as any character, n-gram, or sequence of characters returned by a tokenizer as a basic unit to be queried; consecutive words need not be separated by either punctuation or space, and words may overlap; a phrase is a sequence of ordered words which can contain any number of words." This model "enables functions and operators which work with the relative positions of words (e.g., proximity operators). It uniquely identifies sentences and paragraphs in which words appear. Tokenization also enables functions and operators which operate on a part or the root of the word, e.g., wildcards, stemming."

The W3C XQuery and XSL Working Groups invite public comment on the two full-text query drafts.


Putting this within the broader XQuery context:

"The mission of the XML Query Project is to provide flexible query facilities to extract data from real and virtual documents on the World Wide Web, therefore finally providing the needed interaction between the Web world and the database world. Ultimately, collections of XML files will be accessed like databases. The ambitious task of the XML Query (XQuery) Working Group is therefore to develop the first world standard for querying web documents, following the incredibly successful discussion started at the QL'98 event. However, the XML Query (XQuery) project is all-around, and also includes in its efforts not only the standard for querying XML documents, but also the next-generation standards for doing XML selection (XPath2), for doing XML serialization, for doing Full-Text Search, for providing a possible functional XML Data Model, and for providing a standard set of functions and operators for manipulating web data..." [from the XQuery page]

The development of robust searching techniques within XML documents is a crucial underlying technology for many evolving areas of distributed computing and eBusiness/eCommerce. It will be interesting to see if document-centric XML-based approaches mature the same way that relational database approaches have over the past 20 years to become a core compent of nearly all systems. However, trying to capture complex natural language concepts like alternative irregular plurals automatically in search queries may be pushing things further than can be actually be achieved in real world implementations at present.

Posted by mofoghlu at July 13, 2004 10:45 AM | TrackBack