Automatic Recognition of Dates, Email Addresses, and Credit Card Numbers

dtSearch 7.40 includes an option to automatically recognize dates, email addresses, and credit card numbers in text during indexing. 

Dates

Date recognition looks for anything that appears to be a date, using English-language months (including common abbreviations) and numerical formats.  Examples of date formats that are recognized include:

January 15, 2006
15 Jan 06
2006/01/15
1/15/06
1-15-06
The fifteenth of January, two thousand six

To search for a date, put "date()" around the date expression or range.  For example, to find any of the expressions above near the word "apple", search for:

    date(jan 15 2006) w/10 apple

To search for a range of dates near the word "apple", search for:

    date(jan 10 2006 to jan 20 2006) w/10 apple

A field search for a date expression would be expressed like a field search for a word:

    DateField contains date(jan 10 2006 to jan 20 2006)

Unterminated ranges are not supported, so to search for any date after or before a particular date, enter a bounded range with a maximal or minimal value for the bounds.   The maximum value for a year is 2900, and the minimum value is 1000.  Example:

    DateField contains date(jan 10 2006 to jan 1 2900)

Email Addresses

Email address recognition looks for text that follows the syntax for a valid email address (example:  sales@dtsearch.com).  This makes it possible to search for a specific email address regardless of the alphabet settings for the @ and . characters, as well as any other punctuation that may be present in an email address.  Also, this makes it possible to use the word listing functions in dtSearch to enumerate all email addresses in a document collection.

To search for an email address, put "mail()" around the address.  The * and ? wildcard expressions are supported inside the () marks.  Examples:

    mail(sales@dtsearch.com)
    mail(s*@dtsearch.com)

Credit Card Numbers

Credit card number recognition looks for any sequence of numbers, that appears to satisfy the criteria for a valid credit card number issued by one of the major credit card issuers.  Credit card numbers are recognized regardless of the pattern of spaces or punctuation embedded in the number.  Examples:

    1234-5678-1234-5678
    1234567812345678
    1234 5678 1234 5678

Numerical tests used by the credit card issuers for card validity are used to exclude sequences of numbers that are not credit card numbers.  However, these tests are not perfect and so the credit card number recognition feature may pick up some numbers that are not really credit card numbers.

To search for a credit card number, put "creditcard()" around the number.  Example:

    creditcard(1234*)

Enabling automatic recognition of dates, email addresses, and credit card numbers

In dtSearch Desktop, click Options > Preferences > Indexing Options, and check the box to "Automatically recognize dates in text."

In the dtSearch Engine API, set the flag dtsoTfRecognizeDates in Options.TextFlags.

Currently there is no option to separately control whether dates, email addresses, and credit card numbers are recognized.

Word lists

To list all dates, credit card numbers or email addresses in an index, you can use the word listing functions in dtSearch Desktop (Index > List Index Contents...).  In the dtSearch Engine API, you can use ListIndexJob (.NET) or DListIndexJob (C++).

The same syntax used in search requests works in the listing functions, so if you generate a list using "creditcard(*)", you will get a list of all credit card numbers in the index.

Effect on performance

Indexing will be slower with the recognition feature enabled. 

Searching for dates, email addresses, and credit card numbers can be substantially faster because you can search for a single unique expression instead of having to search for many different variations.  For example, a single search for:

    creditcard(1234123412341234)

will find that credit card number regardless of the presence of spaces or punctuation between the numbers.   To cover just the most common variations on credit card number formats would require a much more complex search request that would take more processing time.  Similarly, it will be much faster to search for:

    date(January 15, 2005)

than to search for the many ways this date could be expressed in text.

What about phone numbers, social security numbers, etc.?

Currently these are not recognized, although we may add this in a future version.   There is a trade-off between completeness and false positives that gets worse as more types of numerical data are recognized.  Credit card numbers can be verified to some extent, while telephone numbers and social security numbers cannot, so adding support for these types of numbers will generate many more false positives.