Already a member?

Sign In

Structuring Unstructured Data Using Controlled Vocabularies

Presenter 1
Johann Schaible
GESIS - Leibniz Institute for the Social Sciences

The Data Documentation Initiative (DDI) is a metadata specification expressed in XML for describing data from e.g. social sciences. DDI metadata allows collecting, processing, analyzing, discovering, distributing and archiving data. The current version is DDI 3, which includes controlled vocabularies, so data sets are categorized, which leads to structured data sets and hence additional information. In practice in DDI 3 there is still a lot of unstructured data stored in uncategorized plain text fields, especially when converted from DDI 2. This means the document contains less information than it should, but to categorize those plain text fields manually would be too inefficient and error-prone. In this paper, we present a solution for categorizing the free texts automatically. This solution is based on the Recommind Mindserver, which uses sophisticated text mining algorithms to categorize the text based on a training sample. This way, the manual task of categorization can be automated, enriching documents in which this metadata is missing.

Presentation File: 
  • IASSIST Quarterly

    Publications Special issue: A pioneer data librarian
    Welcome to the special volume of the IASSIST Quarterly (IQ (37):1-4, 2013). This special issue started as exchange of ideas between Libbie Stephenson and Margaret Adams to collect

    more...

  • Resources

    Resources

    A space for IASSIST members to share professional resources useful to them in their daily work. Also the IASSIST Jobs Repository for an archive of data-related position descriptions. more...

  • community

    • LinkedIn
    • Facebook
    • Twitter

    Find out what IASSISTers are doing in the field and explore other avenues of presentation, communication and discussion via social networking and related online social spaces. more...