dtSearch is a Unicode Consortium Official Gold Sponsor of two Unicode search symbols.

 

Unicode Support

  • dtSearch supports hundreds of different international languages through Unicode.

    USA Business Radio: Globalizing Text Search    

    USA Business Radio: Globalizing Text Search

  • In addition to Unicode support, dtSearch offers extensive alphabet customization options.
  • See Unicode FAQ for more technical information.

Searching Unicode

  • Most search options are language-neutral. From USA Business Radio: Globalizing Text Search:
    • Of course, numeric-oriented searching like numeric range, date range, credit card identification, file hash value generation and search are generally language neutral.
    • And most word-oriented search types are as well, including unstructured natural language searching, Boolean and/or/not searching, phrase searching, proximity searching, metadata-specific searching, searches for specific Unicode emojis, etc.
    • Even fuzzy searching adjustable from 1 to 10 to sift through minor spelling or OCR errors works regardless of the underlying Unicode language.
  • dtSearch now has a drop-down to select the noise word list from over 25 European languages prior to building an index. (The noise word list is "hard-wired" into an index. Adjusting the noise word list for a different language can be helpful if you are indexing a large collection of data in a particular language.)
  • dtSearch now also has a drop-down for stemming selection (like applies, applied, applying in a search for apply in English) at the time of search for over 25 European language.

Chinese, Japanese and Korean Text With No Word Breaks

  • Some Chinese, Japanese, and Korean text does not include word breaks. Instead, the text appears as lines of characters with no spaces between the words.
  • Because there are no spaces separating the words on each line, dtSearch sees each line of text as a single long word.
  • To make this type of text searchable, enable automatic insertion of word breaks around Chinese, Japanese, and Korean characters, so each character will be treated as single word.
  • dtSearch Desktop/Network: In Options > Preferences > Letters and Words, check the box to “Insert word breaks between Chinese, Japanese, and Korean characters in text.”
  • dtSearch Developer API: set dtsoTfAutoBreakCJK in Options.TextFlags.

Language Analyzer API Integration

  • In addition to the extensive alphabet customization options available across the dtSearch product line, the dtSearch Engine also includes a Language Analyzer API that can be used to integrate morphological analyzers and custom or dictionary-based word breakers into the dtSearch Engine indexing process.
  • The dtSearch Engine also includes an API for substituting a non-English language thesaurus for the existing English-language one.

Basis Technology’s Rosette® Linguistics Platform Integration

  • The Rosette Linguistics Platform helps unlock the meaning of unstructured text by determining the language, and identifying the basic linguistic features and structure. Relying on code that is unique to each particular language, Rosette results in highly accurate Chinese, Japanese, Korean, and other international language morphological analysis.
  • The Rosette Linguistics Platform integrates with dtSearch search functionality through the dtSearch Engine’s Language Analyzer API.  Essentially, the dtSearch Engine API passes blocks of Unicode text to the Rosette Linguistics Platform and accepts back words to index.
  • For more details on how the two products work together, including a chart detailing the different steps involved in the dtSearch Engine and Rosette API integration, please see dtSearch and Rosette Full-Featured International Search PDF white paper.
Back To Top