ddc-concordance - a search engine for linguists
Introduction to DDC-Concordance
DDC-Concordance is an open source (LGPL) search engine developed specially to meet the needs of linguistic researchers. The following properties in particular are relevant:
- Sentence-based or document-based searches
- Statistical queries, not approximations
- additionally to classical search engine properties like boolean operators (AND, OR, NOT), left and right truncation and distance search operators, ddc-concordance also can search for word forms. E.g. a search for "child" will find all documents containing wordforms like child, children etc. This functionality is currently available for english, german and russian.
- ddc-concordance can index metadata from xml documents
- words can be indexed with searchable annotations, especially word forms, lemmas, part of speech-tags and semantic categories
- Interval searches (targeted and symmetrical e.g. NEAR and FOLLOWED_BY)
- searching for phrases
- relevance ranking operator for documents
- ddc-concordance is fast. Indexing of a 100 million words corpus takes approximately 1.5 hours. The first ten hits for simple queries are shown in about 0.2 seconds.
- ddc-concordance can handle huge corpora because of its distributed clustering architecture. The largest known corpus is about 1 billion tokens, but we haven't reached a limit yet.
- there is client software for perl, php, python, C/C++ available (developer stuff) but also ready-to-use command line clients and a simple cgi script
Download the software
You can download the ddc-concordance software and some extensions here:
http://sourceforge.net/projects/ddc-concordanceAre you doing something interesting with ddc-concordance? Big site deployments, interesting use cases? Tell us about it!
Thanks for using our product!