|
IASSIST/CSS 1998 Conference Abstracts |
Towards Multiple-Media Survey and Census Data
Stephen E. Fienberg, Carnegie Mellon University Maurice Falk University Professor of Statistics and Social ScienceSampling and survey methodology have made great strides in the past 25 years and many of the recent advances have been linked to model-based approaches to survey analysis. But the physical and political environments in which data are gathered are in enormous ferment and, at the same time, new forms of data are emerging as alternatives to the traditional numerical responses that survey methodologists have dutifully encoded for use in statistical analyses. Survey data sets of the future might well consist, either through direct collection or forms of record linkage, of combinations of traditional numbers, text, images, sound, and even symbolic summaries. New statistical methods will be needed to deal with such mixed media, and the new data and methods will raise new issues regarding the design and collection of survey data as well as their dissemination, including concerns of confidentiality and disclosure limitation. The time to begin thinking about such issues is now.
Statistical Disclosure Limitation Methodology: In Introduction to
Current Thinking
Stephen E. Fienberg, Carnegie Mellon University
Maurice Falk University Professor of Statistics and Social
Science
For many years, pledges of confidentiality to respondents in
censuses and surveys were interpreted in an absolute fashion and
agency statisticians created conservative rules for disclosure
avoidance which they believed would prevent disclosure of confidential
information. During the past twenty-five years, the field of
disclosure protection has undergone a "statistical transformation" and
begun to utilize the advances that have occurred within the field of
statistics itself. This talk provides an overview of the statistical
issues that are related to the evolving area of statistical disclosure
limitation methodology.
Confidentiality and Data Access - The Rationale for and Implementation
of Policies for Restricted Access
Vigdis Kvalheim, Assistant Director
Norwegian Social Science Data
Services
A basic requirement for high quality empirical research is access to
high quality data. The potential scope of data available from academic and
official administrative sources is enormous. Nevertheless, effective use
of existing data is constrained by lack of access to these resources,
particularly disaggregated data.
There are various reasons for this, most important in this context, the
legal requirements restricting data dissemination.
Statistical agencies as well as academic data archives have two main
options for protecting the confidentiality of released data; providing
restricted data or providing restricted access. The first option entails
restricting the content of data sets or files to be released. The second
entails imposing conditions on who may have access, for what purpose and
so forth.
This paper argues for treating research as a special case, and thus for
developing models for restricted access as instruments to remove some of
the barriers to data access. The emphasis is on national legislation as a
key barrier to data access, and on infrastructures for data sharing within
the academic as well as the public sector. Keywords: Self-regulation,
partnership and co-operation as long-term strategy for the research
community in the future 'battle' between privacy and access.
A Cooperative Mode of Organization for Social Sciences Computing: A
Case Study
Nancy McDermott and Tom Flory
University of Wisconsin, Madison
As a target, the most workable scale of a computing organization for
the social sciences is hard to hit. Institutional policy on research
computing support may be prescriptive or laissez-faire; it may favor a
centralized utility or every department for itself. At the same time,
technical innovations have from time to time suddenly changed the
terms of the efficiency comparison between distributed and centralized
services.
If social scientists beyond the boundaries of departments perceive a
community of need for a common computing environment -- hardware,
software, networking services, user support -- what kind of
organizational structures will permit a proper degree of attention to
their special needs, achieve a critical funding mass, and even be
flexible in adapting to changing scale requirements? This paper looks
at a solution adopted by the Social Science Computing Cooperative at
the University of Wisconsin-Madison.
Decentralization and the Onslaught of Technology
Tom Phelan
University of California, Los Angeles
Over the last decade, the role of central computing
organizations on campus has come under scrutiny. Due to the rapid
increase of desktop computing capabilities, and in response to a
perception that local computing, data access, and help desk needs were
not being adequately met by central computing units, smaller local
computing groups have proliferated at many American universities. At
the beginning of this evolution, many central groups ignored the
development of these decentralized units, and were content to see the
workload generated by PCs, mini-computers, and data management
offloaded to other groups. As in-demand technology itself becomes
increasingly decentralized, however, the role of centralized units,
and the division of computing resources within the campus, is being
closely examined at many universities. The relationship between
central and non-central units is still evolving and has become
increasingly complex.
A Computer Center Decenter(ral)ized: A Comparative Perspective
on Social Science Computing at Washington University and the
University of North Texas
Jonathan Rapkin, Washington University
Karl Ho, University of North Texas
This paper offers insight into the way social scientists' computing
needs are met through comparing two different models. In our
experience, a great deal of social scientists specialized needs
revolve around methods for providing statistical support to students
and faculty, as well as the dissemination of data. The two models
contrasted represent centralized and decentralized models of providing
such resources. In both instances, these models were established as
byproducts of the evolution of computing in general at our various
universities and
At the University of North Texas, such services are provided
through a centralized office responsible for providing such services
campus-wide. At Washington University, on the other hand, such
services are provided by several small facilities. Each college is
responsible for its' own computing needs, hence there is an emphasis
on decentralized facilities. The key advantages of the centralized
model include improved compatability due to standardization, improved
access in terms of hours and who can use the facility, improved
managability in terms of budgeting, policy, technical administration,
etc. and the ability to reach critical mass in terms of providing a
high level of service to all departments, and leveraging volume
discounts on software. The key advantages of the decentralized model
include a sense of ownership among the department(s), ability to meet
discipline-specific needs, and establishing a convenient gathering
place for furthering interaction among graduate students and faculty.
At this point in time we do not advocate either model. Each has
clear advantages and disadvantages which must be taken into account
when planning the best means of meeting the discipline-specific needs
of social scientists within the administrative framework of each
university. We outline the pros and cons of the two models and
explore how such models might be adapted in the future to better serve
the evolving needs of social scientists.
A Proposal for a Virtual Digital Data Library
This talk is a proposal for a Virtual Data Library based on current
low cost Internet technology. Such a data library would provide data
from sources physically distributed across Federal, State and Local
governments. Common access tools for searching, documenting,
tabulating, graphing, and mapping would be provided to any data source
participating in the system.
Cavan Capps has worked with data bases and networks for over 17
years. Throughout his career he has labored on issues related to data
as a economist, or as a software engineer. Currently he is the
project manager of the Data FERRET component of the Census DADS
effort. Data FERRET provides intelligent internet access to many
survey record data bases. A few of the surveys include the Current
Population Survey, the Survey of Income and Program Participation, the
Health and Nutrition Examination Survey.
The role of the Web in the provision of national data and information
services: The MIDAS experience
Julia Chruszcz
University of Manchester Computing
MIDAS is a JISC designated national data centre for the UK higher
education community providing on-line access and support for a range of
large and complex datasets, such as censuses, surveys and time series
databanks. In this context, MIDAS is part of the developing JISC funded
National Distributed Electronic Resource which is seeking to promote and
extend access to electronic information and services to the entire UK
higher education community. The expectations of users have changed
considerably and we have had to rethink how we deliver data and
information to the researcher's desktop. It is not enough to promote
awareness of the data resources and their potential applications in
teaching and research. We also have to convince the users that their time
is being used efficiently, that they can easily identify the data that
they want, extract it and put it into a suitable format for secondary
analysis. For us this means creating appropriate interfaces for the data -
simple enough for a once-off selection and versatile enough for more
sophisticated use. This paper addresses the influence of the Web and the
expectations of its users on the services provided by MIDAS. We shall
describe some the new interfaces to data and information which will be of
particular interest to social scientists, in both research and teaching.
Julia Chruszcz is Head of National Services at Manchester Computing and is
responsible for the strategic management of the MIDAS national datasets
service and other national computing services provided by Manchester
Computing for the UK higher education community.
Gesine: Integrated retrieval on heterogeneous social science databases
via the World Wide Web
Peter Mutschke, Marcus Schommler, Siegfried Schomisch, Udo Riege, Juergen
Krause
Informationszentrum Sozialwissenschaften
In the age of the World Wide Web the integration of distributed
heterogeneous databases is still an unsolved problem. Moreover, especially
in the field of social science, the global information market leads to an
increasing need of high value and complex information on social science
research and the structure of certain research fields.
The aim of the project GESINE at the social science information center
(Bonn) is to develop a retrieval system for the World Wide Web allowing an
integrated access to several social science databases of the Gesellschaft
Sozialwissenschaftlicher Infrastruktureinrichtungen (GESIS) which offers
German language social science information services to the scientific
community.
The specific goal of this project is finding suitable retrieval methods
and presentation styles for very heterogeneous material, e.g.
bibliographical records on literature and research projects, survey data
and texts. Particularly concerning social science information, there is
still no concept of integration of text and data during information
retrieval. Therefore, one of the major tasks of the project is the
evaluation of modern indexing and ranking methods and, finally, the
implementation of an adequate domain-related retrieval model maintaining
large social science fact databases and text corpora.
The technical basis of the prototype implemented so far is a relational
database (Oracle) which allows via its text retrieval facilities (Context
option) a combined search in structured and unstructured records. By
means of the Oracle WebServer we are able to allow a direct access to the
Oracle database through the World Wide Web. The WebServer technology
enables dynamic generation of HTML-documents regarding a certain database
request. The prototype presented offers a differentiated search for social
science literature and project documents via the Internet.
On Cultural Lags and Communications Technology
Dana Fisher
University of Wisconsin-Madison
The Internet continues to grow at an exponential rate. With this
growth, organizations are discussing the positive and negative
implications of Internet usage both on a personal level and in the
workplace. After the defeat of the Communications Decency Act,
the issue of regulation of the newest form of communication technology
continues to threaten the diffusion of the technology. This
paper contextualizes some of the main issues of discussion regarding the
Internet. By applying an adjusted version of William
Ogburn's theory of cultural lags (1964), I analyze the diffusion,
regulation, and eventual socialization of communication
technologies. This lag explains some of the most seriously debated
potential implications of Internet diffusion: privacy, community
and democratization. The paper will look at the diffusion of the
telephone, television and finally the Internet in order to frame it as the
newest in a series of communication technologies.
Dana R. Fisher is a sociologist who specializes in both the implications
of communications technology and sustainable development. Presently, she
is editing International Organizations and the Internet: the United
Nations in the Next Century for United Nations University Press. She has
designed and implemented international networks focusing on the
environment and security in Asia while serving as Researcher/Program
Coordinator at the Nautilus Institute for Security and Sustainable
Development. During her tenure at the Institute, she wrote and presented
her research on the Information Age and the growing Global Information
Infrastructure (GII) around the world. Presently, her research looks at
both the concept of sustainability and the implications of new
communication technologies. Prior to her work at Nautilus, she served as
an energy lobbyist and computer system administrator for environmental
NGOs in Washington, DC. She has done extensive research on the Japanese
environmental movement in Japan and the United States. She is presently
in the PhD program in Sociology/Rural Sociology at the University of
Wisconsin-Madison.
Data Protection and Privacy in the United States and Europe
Juri Stratford and Jean Stratford
Government Documents Librarian, Shields Library
and Director of Research Services, Institute of Governmental
Affairs
The rapid expansion in electronic communications and commerce over the
past several years has raised concerns in the United States over
personal privacy in an online environment. These concerns have
captured the attention of the public, the media, and policy-makers,
and there is new interest in the United States in explicit policies
protecting the privacy of electronic transactions and personal
information. In the fall of 1997, the Clinton administration
announced plans to pursue legislation protecting the privacy of
personal medical records. This proposal continues a pattern of
multiple policies, directed at subject-specific information, or at
various levels of government (e.g. both federal and state level
legislation). Our paper would provide an overiew of U.S. legislation
and current initiatives and contrast this with the European
approach. In Europe, there is legislation at the national and regional
(supra-national) level which recognizes privacy as a basic human right
and provides a framework for protecting and providing access to
personal data of all types. Special attention will be given to the
policies as they relate to access to personal data for research
purposes.
Data Protection in the United States
Thomas Brown
U. S. National Archives and Records
Administration
The paper outlines the current state of data protection laws in the
United States, i.e. those laws which impose requirements and restrictions
on the collection, maintenance, and use of personally identifiable data by
non-governmental organizations and businesses. The United States has
traditionally taken a laissez faire approach to data protection, imposing
restrictions in only a limited number of circumstances. Advancing
technology, however, may demand that the United States expand the
scope of data protection. In this regard, several initiatives, especially
in the area of protecting personal medical information, are under
consideration. Such initiatives are two-edged swords in that they may
pose potential dangers for secondary data use while not addressing the
dangers to personal privacy.
A Digital Archive for New Jersey Environmental Data
Ronald C. Jantz and Linda Langschied
Rutgers, The State
University
Traditionally, librarians have organized and provided access to print
information sources and provided the necessary user training to
effectively use information tools. As we enter the age of digital
libraries, this mission and service orientation offers new
opportunities and challenges to provide access to information that has
been relatively inaccessible.
This paper will describe a project undertaken by the authors at
Alexander Library, Rutgers University, with grant funds provided by
the New Jersey Department of Environmental Protection, and in
collaboration with Rutgers University's Ecopolicy Center. Our
challenge was to provide a single source of access to the vast amounts of
environmental information, both digital and non-digital,
that are created by government agencies, consultants, and non-profit
organizations. Examples of this information include "fugitive"
or "gray" literature, elusive master's and doctoral theses, and digital
maps with associated data layers that have been created with
GIS tools. The dispersed environmental documents of a state can offer a
wealth of information to its citizens, policy-makers and
institutions, but only if the information can be made more readily
accessible.
This project and the prototype database demonstrate not only the new
roles that librarians are undertaking, but also the new tools that can be
used in digital libraries. Our challenges in this project were to:
Use of standards, advanced computer tools, off-the-shelf software, and
marketing techniques will be discussed as new areas of opportunity for
librarians. Also, the authors will discuss the importance of establishing
partnerships to implement digital library initiatives, to attract funding,
and to assure successful project outcomes through the collaborative
efforts of interested parties across the state.
Ron Jantz is the Data Librarian at Alexander Library, Rutgers
University. Linda Langschied is Information Services Librarian at
Alexander Library.
Project EconData: Dutch Data Service for Economic Data
Albert Bots
Netherlands Institute for Scientific Information
Services
In July 1996 NIWI's Steinmetz Archive started the project EconData to
establish a Dutch data service for economic data. This service
will be integrated with the current activities of the archive. EconData
builds on previous feasibility studies conducted by the
Economic and Social Institute (ESI) in Amsterdam and the Economic
Institute Tilburg (EIT). Both of these studies have been funded
by the Netherlands Organization for Scientific Research (NWO). For
EconData the Steinmetz Archive receives additional funding
from NWO. This grant follows on a recommendation by the Social Science
Council (SWR) of the Royal Netherlands Academy of
Arts and Sciences (KNAW). EconData aims at broadening the scope of the
Steinmetz Archive. New services will be established to
support economic research, including macro-economics, business economics
and economic modeling. In addition the more
traditional functions of a data archive, EconData puts strong emphasis on
data brokerage. The data service will act as an intermediary
between suppliers of economic data and data users. This will include
suppliers of international data and users of Dutch data abroad.
For this purpose the project plan includes the establishment of an online
register of available data sets, irrespective of whether these
data sets are available from the Steinmetz Archive or from other sources.
EconData will be evaluated in the summer of 1998.
In the paper first attention will be paid to the background of the
project. Among others the main results of the preliminary studies will
be shown. Next the project plan with the corresponding activities will be
described and the results so far will be presented. Further
remarks will be made about some specific topics like the attitude of data
owners towards the registration of their data files and the use
of existing data sets for education and training.
Albert Bots is the project manager of EconData. Besides he is active
as lecturer at the department of Economics of the Free University,
Amsterdam. He gives lessons in business modeling and information systems.
Albert Bots is econometrician and is one the authors of the aforementioned
feasibility study by EIT.
Data Dissemination in an Electronic World: The Essential Role of Metadata
Ernie S. Boyko
Library and Information Centre, Statistics Canada
Statistical offices have long recognized the importance of metadata
as a way of identifying which data and information are available and
as a means of informing users about the methods, sources and concepts
underpinning them. In a time when most information was disseminated
on paper, metadata systems consisted of catalogues, indexes, technical
appendices, and user guides.
The increasing use and popularity of electronic media and systems has
led to an increasing demand for more detailed data in a form that can
be manipulated. Low cost storage and delivery tools have led to large
volumes of data being presented to user. As well, technological
advances make it possible to disseminate more complex data such as
public use microdata files to a broader audience. Finally, innovative
programs such as Canada's Data Liberation Initiative have brought
data, traditionally only available to "expert" users, into mainstream
research and teaching.
All of these changes, taken together, make enhanced metadata an
invaluable component of the data dissemination activity. This
evolving environment has prompted Statistics Canada to initiate a
major metadata project. This paper will outline the aim of the
project, its scope, and approaches being evaluated. In particular, it
will concentrate on innovative approaches to collect and integrate
metadata within Statistics Canada in order to facilitate data use not
only in the traditional expert user communities, but also as
mainstream tools for teaching, research and public access.
The impacts of the electronic explosion have been felt well beyond the
work of statistical agencies. This has prompted metadata projects and
approaches in other domains. This paper will identify some
relationships between the metadata work of statistical agencies and
initiatives in other areas, especially libraries and social science
data archives. The ultimate question in this regard is whether or not
there are emerging global metadata standards for finding, evaluating,
using, and managing information. And, if there are, how does the work
of statistical agencies relate to these emerging standards?
Implementing a Statistical Metadata Repository at the U.S. Census
Bureau
Daniel Gillman, Samuel N. Highsmith, Jr., Martin V.
Appel
US Bureau of the Census
This paper describes the results of continuing research at the U.S.
Bureau of the Census (BOC) into the content, design, population, query,
maintenance, and implementation of a statistical metadata repository and
the tools to use it. The goals of the research are many, but the ultimate
goal is to create a production statistical metadata repository and the
associated tools for the agency.
In support of this goal a multi-dimensional effort has been launched.
The major parts of this effort include the development of detailed models
for describing the content and organization of a statistical metadata
repository; building an agency standard for statistical metadata;
development of tools for the collection, registration, and query of
metadata; and the integration of a repository into other statistical
information systems. This paper will briefly describe the models and the
BOC statistical metadata standard. Collecting the metadata to populate a
repository is not easy. Survey designers and analysts often create
metadata only as an afterthought. When asked about the importance of
metadata, the designers and analysts always say that it is important.
Then, they say they don't have the time or resources to enter it into a
repository. Effective tools will allow them to enter metadata without
appreciable extra effort. Success is achieved when the users of the
repository perceive it as an indispensable part of their work. Metadata
repository tools are divided into several types: population (or
collection), registration, crosswalk, maintenance, and query. This paper
will focus on the population, registration, and crosswalk kinds.
Population or collection tools allow the user to enter metadata into the
repository. They can be batch loading tools for entering many records at
once, or they can be interactive. Each type of tool has the capability of
gathering information common to all objects in the repository; the process
called registration. Registration allows users to view the repository as a
card catalog. Special rules need to be in place for registration to work
properly. Crosswalk tools allow users to view or capture metadata in
several different formats, especially the formats specified in metadata
standards such as GILS, FGDC, DDI, etc.
The tools mentioned above are described in detail in the paper. Also,
the software used to tie the tools, the repository, and other systems
together is described. The complete package is still under development,
but plans exist to move the entire package to a production staff
soon.
Next Generation Tools for Data
Dissemination - The Example of NESSTAR
NESSTAR (Networked Social Science Tools and Resources) is a joint
project between the Norwegian Social Science Data Services (NSD), UK
Data Archive and the Danish Data Archive (DDA). The aim of the project
is to develop a common gateway on the Internet to the data holdings of
several social science data archives in Europe. By means of NESSTAR,
users all over the world will be able to:
The system will include advanced user authentication procedures to prevent
unauthorised use of data.
The NESSTAR system is building upon the emerging documentation
standard from the Data Documentation Initiative and will support the
XML-version of this standard. Tools to convert metadata from existing
standards to the new one will be an integrated part of NESSTAR. The
system is designed as a three-level client server application mainly
developed in Java.
The paper describes the technical and organisational sides of
NESSTAR, and discuss some of the "political" consequences of such a
system for the archive world.
This paper is on the various layers involved in the integration of
computer technologies into existing and newly developed social
sciences classes. Much of the presentation will rely on case studies
at Carleton College during the years of 1994-1998. For example, in the
social sciences I have worked with faculty in Economics, Political
Science, Sociology & Anthropology disciplines as well as cross
disciplinary programs such as Educational Studies and American
Studies. Their pedagogical needs range widely from presentation of
data base management to web page construction to paperless classes and
simple software use on a variety of platforms (Mac, Win 3.x & 95,
VMS).
The paper will contain discussion on both the institution's response
and my role as the Academic Computing Coordinator to meet the
curricular needs of the faculty. As evidence I will use records of our
technical development from the inception of our department (Academic
Computing & Networking Services, FY 1993) as well as interviews with
faculty in my division. Further, I will intersperse my own
developmental process with how to best present new technologies as
well as utilize old ones. How I came to better understand each
individual faculty's goals with their classes as well as understand
the logical limits vs. the pedagogical limits of a proposed teaching
strategy.
Paula Lackie has been the Academic Computing Coordinator for the Social
Sciences at Carleton College since 1993 and has been a social scientist
engaged in technical support since 1987.
Teaching Strategy and Assignment Design: Assessing Quality and
Validity of Information Via the Web
Jean Shackelford and Dot S. Thompson
Bucknell University
In his review of Internet guru Paul Gilster's Digital Literacy
journalist John Moran observes that "unlike previous media, the Internet
imposes new demands on readers to become their own editors and critics.
Most information now available on the web^ comes devoid of clues as to
whether it is true and unbiased. Those who master this new form of
literacy will reap huge benefits form the news and background available on
the Internet. Those who do not will remain awash in half-truths, outright
deception, and fraud," John Moran 1997). While Moran may be overstating
the case, it is clear that, as educators, we need to help students more
critically assess the sties they visit and the information and news they
find on the web. By carefully designing the problem of how to better help
students assess and think critically about the material they have accessed
at a particular web site may be resolved.
Critics such as philosopher David Rothenberg have pointed out that the
web had reduced the "quality of the writing and the originality of the
research papers. Rothenberg reported that his class had "fallen victim to
the latest easy way of writing a paper: doing their research on the
World-Wide Web" (David Rothenberg 1997) Although the web brings particular
problems to students in the ease with which they find appropriate
information, the reality is that students need prompting to evaluate all
information sources. To help students learn to evaluate and to think more
critically about the material they find a structured three-part assignment
was developed for a first semester seminar. A librarian's perspective
contributed to the planning of the assignment and in the formation of
research skills. The first component required identifying on-line, web, or
gopher sources. At least ten sources with descriptions of the strengths or
weaknesses of each site were required as well as the kind and amount of
web information on the topic. The second component was similar except that
all sources were to be traditional library sources. A concluding summary
comparing the two approaches helped students to recognize the importance
of quality and differences of information available.
The results from this particular structured assignment and potentially
others indicate that students feel comfortable and learn how to assess
reliable from unreliable information. Their interest in the topic
stays high and the level of research was good for first semester
students. We will discuss integrating th instruction of web-based
traditional library resources and offer a model of assignment design.
Jean Shackelford is a professor of economics and associate editor of
Feminist Economics. The fifth edition of her co-authored book,
Economics: A Tool for Critically Understanding Society, was recently
published by Addison Wesley Longman.
Dot S. Thompson is a reference librarian specializing in economics
and management. She is also Bucknell University's designated
representative for ICPSR.
The History Data Service - Using Technology to Enhance Access
Sheila Anderson, Cressida Chappell and Oscar Struijve
The Data Archive, University of Essex
The Data Archive employs a strategy for using and employing new
technologies in an innovative and user-sensitive way to enhance and
improve access to its resources. Within the Archive, the History Data
Service is developing a user-centered, needs-driven programme to
enhance and increase access to its collection of historical data
materials by an innovative use of the potential of the World Wide Web.
The core aim of this work is to develop a multi-leveled system which
provides web access to as wide a range of material as is possible
within limited resources. We aim to:
We are confident that this strategy will encourage and enhance use
of and experimentation with the collection. This paper will describe
the work the HDS is undertaking in this area, including a discussion
of the selection of the materials for inclusion in this system.
On-Line Technology for Enhanced Secondary Analysis of Public Opinion
Survey Data
Rich Clark
Roper Center for Public Opinion Research
Online services are becoming increasingly important not only for access
to but for secondary analysis of survey research data. I will review
current online public opinion sources and discuss the impact that online
technologies are having on the way secondary analysis of survey research
is currently performed. Next, I will present a model for where online
services can go in the future given the technology that is available
today. I believe that the Internet is currently under-exploited
for its capacity to aid secondary analysis. On that note, I will examine
the potential of making survey data more easily available online to all
potential users. This entails varying the format and depth of data so that
users find sources suitable to their needs. It also entails the use of
desktop technology to store and analyze survey research data and making
that technology, or the applications that are developed through that
technology, available to other users via computer networks, primarily via
the Internet.
Global Access to Data Resources: Where's the Metadata?
Mark A. Carrozza & Steven R. Howe
University of Cincinnati
The University of Cincinnati has recently purchased an HP 330FX Optical
Storage Jukebox to store its social science data collection. The Jukebox,
with 330GB of direct online storage, is connected to a Windows NT file
server that provides both Novell NetWare and Microsoft Networking access
to university researchers through the campus' wide area network. The same
data sets are available through FTP clients and WWW browsers on the
Internet.
While the UC system compares very favorably to almost any other
resources for access to secondary data, it offers a dramatic example of
how software resources to facilitate access to secondary data continues to
lag behind hardware innovations. As storage devices and direct access
methods such as these become more common place, data archivists must
contend both with increased variation in the access methods, and the
bibliographic reference material (metadata) available. This paper
addresses these concerns.
The combination of increased storage capacity, the low cost of both
hardware and software, and dramatically improved global access via the
World Wide Web has made the desktop computer the environment of choice for
all but a few social science researchers. Data archives have had to
respond by making data more accessible for the desktop computer users.
Ten years ago data users sat at terminals or desktop computers
connected to mainframes by slow modems and submitted batch jobs that
involved mounting tapes. Five years ago, the same users had purchased PC's
with CD-ROM drives and began to access the data on CD-ROMs. Too often,
however, these CD resources were merely copies of the original mainframe
tapes.
Application software resources for data management and analysis have
also improved. While SPSS and SAS are still widely used, there is a wide
variety of additional application software resources, ranging from
spreadsheets with a far greater range of capabilities than just a few
years ago, to vastly simpler programming languages (e.g., Visual Basic),
to well designed, easy-to-use relational DBMS systems, to utilities such
as DBMSCopy.
Through the decade long movement from mainframe tape to local CD-ROM to
massive online-storage available via global networks there has been
consistently lagging development in the area of bibliographic reference
material (or metadata) for the studies being archived. Foremost in the
archivist minds is both the use of metadata for creating comprehensive
catalogs of data holdings and for creating user-friendly interfaces to
data.
Five years ago the 'state of the art' was seen in examples such as the
U.S. Census Bureau's 'GO' and EXTRACT software, NHIS's SETS, and custom
extract software for such studies as the NLS. Since then there have been
improvement in both the catalogue procedures and the machine-readable
documentation available to the social science researcher, but little has
changed in the availability of metadata generating and packaging programs
that will meet the needs of the social science research community.
Mark A. Carrozza, M.A. is the Data Manager and Network Administrator
for the Institute for Policy Research at the University of Cincinnati and
Director of the UC Southwest Ohio Regional Data Center. Data management
responsibilities at UC include data acquisition and archiving, and
training in the use of secondary data for research and instruction.
Metadata for datasets as Digital Information Objects of Desire: Identifiers as the Linchpin in the Chain.
The main purpose of this paper is to argue the case for ISSN-based
identifiers for social science datasets. As may be implied, this case
has been built on experiences working in a project to achieve
'co-operative action on serials and articles' (CASA: an European Union
Telematics for Libraries project). The argument will be presented
within a schema for metadata standards for inter-operability:
descriptors, identifiers, classifiers, locators, formats and transport
protocols. It will be argued that there are four important 'demand
verbs' in the information economy: discover, locate, request, access.
In order to have cost-effective transition along this chain there must
be agreement on the identity of the object being sought, invoking the
verb 'verify'. The system of identifiers used for journals and other
periodicals will be examined and a proposal made based on this, and
two other schemes: the S.I.C.I. (Z39.56) and the D.O.I. Both of these
latter two schemes are promoted by commercial publishers, and part of
the motivation for this presentation is to ask what is required to
ensure that the digital library being built for our knowledge industry
can protect itself from unwanted consequences from the global
information economy.
Metadata both matters and depends upon metaphor. Throughout our
history as IASSIST, our members have sought, on behalf of users of
research data, to realise dreams of finding aids on 'what exists' and
union catalogue of 'who holds what'. During the first ten years that
history, in the attempt to reconcile the testing demands of
ill-published product from the research process and almost continuous
changes in computer-dependency, we have drawn on the metaphor and
language of 'bibliographic control', inherited from the library
profession, and variously mixed with insights from the archival
profession and from that of social science. During the most recent
ten years, often through participation of IASSIST members in
activities outside that of social science datasets, we have been
grappling with the meaning of metadata and the demands for
inter-operability within the wider context of the growth of the
Internet and the global information economy. If we regard the serial
as a complex information, in which the real information object of
desire is the serial article, or even the information object contained
therein, then this may provide the metaphor we require to identify
datasets. This could provide part of the information infrastructure we
require for our 'virtual' union catalogues and for our finding aids.
Interoperability - Just More Jargon or a Whole New World for the Arts
and Humanities?
Sheila Anderson
University of Essex, Data Archive
The establishment of the Arts and Humanities Data Service in the UK in
1996 heralded the introduction of a range of services long taken for
granted among the social science community. Modeled as distributive
service, the AHDS Executive, along with its five service providers
provides a full range of data library/data archival services in the areas
of history, performing arts, visual arts, literary and linguistic texts,
and archaeology. Among the services to be offered is an integrated
catalogue which will describe the astonishingly wide range of
electronic resources available to the arts and humanities communities. The
problem facing the AHDS was how to establish a catalogue that could take
into account the diversity of materials and approaches inherent
in the disciplines served by the AHDS and still produce a system that
enabled end-users to easily and simply locate materials of interest. This
paper will describe the steps the AHDS has taken to ensure that its
catalogue has the necessary interoperability whilst not losing sight of
the richness and diversity of the resources it seeks to describe, by an
innovative and practical application of the Dublin Core for the content of
the catalogue records and Z39.50 protocols to drive the technical issues.
Can Library and Data Archive Meet in Active Support of Research in the
Social Sciences? The Case of ILSES
Dr. R. E. de Vries, Researcher, Electronic Services,
Netherlands Institute for Scientific Information Services
(NIWI)
Data material collected for empirical research has traditionally been
computer stored and electronically distributed by data archives and data
libraries. Whereas publications from the same research were kept,
referenced and given access to by libraries. As content providers data
archives could not extend their services with relevant book and journal
collections, cross referencing and lending of printed material. Libraries
could not give access to data related to published research or had the
means to expand bibliographic references to also point at data as machine
readable outcome of the research process. A situation where data and books
are separately referenced without consistent cross linking, have to be
searched for in separate catalogues and are given access to by different
authorities and with different facilities, has consequences for any one
embarking upon new research or in general needing social scientific
information. It is not possible to start with general literature searches
in libraries and easily trace back publications to the empirical research
and collected data that is at the heart of it. Neither can data archive
catalogues (even when expanded with bibliographies) help with book and
article searches starting from particular data collecting efforts.
Properly linking data and publications would need metadata standards that
take such relationships into account and coordinated efforts between
authors (proper citation of data sources or writing such metadata directly
themselves), the library world (referencing with cross linking in new
metadata formats) and the data archives (likewise referencing with cross
linking). Part of those efforts would also have to be a common catalogue
search facility or some form of easy access from one catalogue to
information in the other. World Wide Web techniques for linking electronic
resources on the Internet but also new metadata initiatives that
explicitly hold linking information to related (electronic) resources,
have the potential to finally bring data and book together again for
searching and retrieval.
A few Internet related projects will be mentioned that already
demonstrate first attempts in this direction. ILSES as Integrated Library
and Survey-Data Extraction Service, a system of tools and (Internet)
facilities, will be expanded upon as a current project funded within the
Library Programme of the European Commission and addressing the same goal
of integrating publication and data. ILSES accommodates both content
providers (libraries and data archives) and end-users. Finally the paper
addresses the relative strength and weaknesses of both ILSES and
aforementioned other models of achieving some form of integration. An
attempt will be made to look ahead at (near) future scenario's, especially
in the light of recent metadata developments.
Meeting the Needs of Academic Librarians in the Distribution of
Electronic Social Science Data
Providing access to electronic social science data can be a
challenging task. For the past seven years Sociometrics has been
offering electronic collections of data archives on CD-ROM and now on
the web. The focus of our efforts has been to make electronic
social science data as user-friendly possible. This focus has
produced promising results for distribution of individual data sets
and archives. However, when we assessed the distribution of our
entire data collection, we realized we were missing an important
audience -- academic libraries. Distribution of Electronic Social
Science Data to academic libraries would, we believed, overcome the
high cost involved in distributing to individual users and provide
greater access to quality social science data. While we had been
marketing our entire data collection to academic libraries, we had
received little response. We concluded that perhaps our end-user focus
was not a selling point for academic librarians and academic
collections development specialists. To test this hypothesis, we
developed a "three-pronged approach" to survey academic librarians and
collections developers in the San Francisco Bay Area. Using a
combination of questionnaires, web surveys, and in-person interviews,
we were able to determine many of the concerns of this population.
The primary concerns of academic librarians included:
The results of our research have led Sociometrics to develop a
Librarian Toolkit and specialized packaging for the distribution of
our Electronic Data Library. The Librarian Toolkit is a multimedia
overview of the features of Sociometrics Electronic Data Library,
including details about the studies contained in each archive
(including selection process and data preparation), descriptions of
the uses of the data library by specific groups, an instructional
overview, and full citations of the studies for cataloging purpose.
We also changed our packaging of the data library by replacing all
paper documentation with electronic (pdf) versions, including quick
reference sheets for distribution to library patrons, providing a
cross-archive searchable index of all study abstracts for quick
reference and selections, and developing a special campus-wide
multi-user licensing agreement.
The California Digital Library: Implications for Data Files
Collections
Daniel Tsang, University of California, Irvine
The recent creation of the California Digital Library at the University
of California promises to radically change the delivery of electronic
resources to library users in California. It aims at a "creation of a
single statewide digital collection" to serve the University's information
needs. While the initial focus has been on science and technology, the
library expects to focus on social science resources in the future. This
paper explores the implications of a system-wide electronic library as it
affects data file collections across the various UC campuses. One of the
goals, however, of the CDL is to reach out to the community, in
collaborative efforts, to make electronic resources available to the
public. How will this work in practice? This paper explores that and other
questions and looks at collaborative models elsewhere that could guide the
growth of the CDL as it eventually tackles social science data
collections.
Daniel Tsang has been data files librarian and a social science
bibliographer at University of California, Irvine, since 1986.
Data Liberation, Bridges to Cross
Richard Boily
University of Quebec at Rimouski
In Canada, use of statistical data (microdata files and major
databases) for teaching and research is an important and increasing
phenomena which doesn't seem to loose strength in the near future. This
situation is certainly a major consequence of the Data Liberation
Initiative (DLI), a partnership between Statistics Canada, several federal
departments and Canada's academic community established in 1996. The idea
of providing affordable access to Canadian information results of a
cooperative effort among the Humanities and Social Science Federation of
Canada (HSSFC), the Canadian Association of Research Libraries (CARL), the
Canadian Association of Public Data Users (CAPDU), and the Canadian
Association of Small University Libraries (CASUL). Less than two years
after its launch, more than 50 universities have joined the consortium, a
clear indication of a true willingness to get data more available.
This situation illustrates the fact that high costs were an obstacle to
data availability, especially in small universities where the lack of a
minimum number of students makes the costs/benefits ratio of buying data
higher. However, there are still many obstacles to a true liberation of
the use of numerical data. If some Canadian universities have a long
history in data services (Carleton University's Data Centre was
celebrating its 30th anniversary in 1996), such a tradition does not exist
everywhere, especially in small universities.
To maximise use of data files, important efforts at the educational
level must still be made, on one hand at the reference staff level, on the
other hand at the customer level, including the professors themselves.
Use of data implies a good knowledge of extraction and analysis
instruments. How these tools can be made accessible to customers who are
not able to manipulate data files but for which there is a definite need?
How can we satisfy different needs for different types of users? How
can data be included in the academic curriculum? How can data librarians
play their educational role and how this role can be balanced with the
responsibility of the professors? Fortunately, interesting answers are
being unfolding.
Richard Boily is a librarian at the Universite du Quebec a Rimouski.
His main responsibilities are information access for official
publications, and data information service. For the latter, he is the
Data Liberation Initiative (DLI) official representative for the
university. He is also a member of the Working Group on Data of the
Conference of Rectors and Principals of Quebec Universities.
He has an undergraduate degree in Biology from Laval
University, a Master's degree in Public Policy Analysis, also from Laval,
and a Master's degree in Library and Information Science from the
University of Montreal.
Relational Processes in a Hierarchical World: The WWW as an
Impediment to Information Acquisition.
An increasing volume of materials are created every year only in
digital formats. The range of these documents varies from personal
web pages to resource or user guides. Similarly, the quality of these
"web-published" sources run the gamut from diatribes written by
single-issue fanatics to the latest, cutting edge research by eminent
scholars. All of these digital publications are arranged in various
trees, featuring often complex hierarchical structures. All to often,
ones ability to find a document depends on knowing the exact route to
follow from the top of the tree down to the document needed.
Librarians and other researchers are well aware of the numerous
"finding guides" and random URLs that are passed around. For a user
who does not know the correct path to follow, the process can be
frustrating as well as futile.
This structure of the World Wide Web, a dense forest of hierarchically
arranged branches and documents is in direct contrast to the
relational method by which most people acquire knowledge. Libraries
are organized is such a way to facilitate the associational accumulation
of materials. Books are cataloged based on a classification system
that permits locating books of a particular topic in the same area
as other books and journals carrying the same classification scheme.
Card Catalogs and OPACs allow for subject and keyword searches that
collect citations to topically related materials that may fall within
other classification schema.
The various classification schema, card catalogs, and OPACs are
essentially metadatabases. They provide data for leading readers to
other data (books and periodicals). These metadata are created in
highly controlled systems for cataloging and sharing cataloging
information. This cataloging process, however, cannot keep up
with the burgeoning number of web documents. The challenge for
digital collections, including data libraries, is how to provide
metadata that will lead users more quickly to their resources.
This presentation will review some of the recent developments
in metadata practice, or perhaps pre-practice is the better term.
It will then discuss at greater length the emerging Dublin Core
standards and their modifications through the Warwick Framework.
Finally, it will offer a model for using Dublin Core to build
a searchable index of documents.
Global Access and Local Support to the Processes of European
Integration in Central and Eastern Europe through Global
Networking
The proposed paper will deal with these and various other related issues
of interconnection between the global networking and
processes of European integration.
Dusan Soltes is a Senior Lecturer for European Integration as
well as MIS at the Faculty of Management of the Comenius University of
Bratislava. In addition he has been a long- term UN Expert with numerous
assignments to various developing countries of Asia and Africa and has
been an external advisor to the Deputy Prime Minister on European Affairs
and International Relations and founder of the Department of European
Integration at the Office of Government of the Slovak Republic and its
first director (1995-6).
Searching Commodity Classification Trade Data with Ordinary
Language
An important social science database is the U.S. Census Bureau's U.S.
Imports and Exports numeric datasets on CD-ROM. The
following table shows the amount of U.S. automobile imports for the past
several years:
Concurrent Session 2B: Technology in the
Classroom
Wednesday, 3:30 PM
Concurrent Session 3A: Locating and Linking Diverse
Information Resources
Thursday, 9:00 AM
Please see http://www.niwi.knaw.nl
Concurrent Session 3B: Impact of Technology on
Libraries.
Thursday, 9:00 AM
Concurrent Session 3C: Searching the Web: General
Strategies and Special Topics
Thursday, 9 AM
| PASS MTR VEH, SPARK IGN ENG, NOT OV 1,000 CC | ||||
| General Imports | Imports for Consumption | |||
|---|---|---|---|---|
| Year | Quantity | Customs Value | Quantity | Customs Value |
| 1991 | 173,597 | $783,208,626 | 173,097 | $779,772,191 |
| 1992 | 166,951 | $736,087,145 | 171,134 | $738,847,548 |
| 1993 | 200,043 | $904,605,255 | 204,215 | $907,734,708 |
| 1994 | 178,562 | $753,516,749 | 178,562 | $753,516,749 |
Yet if one does a commodity search using the word "automobiles" on the commonly used WWW database (http://govinfo.kerr.orst.edu/impexp.html) one finds no results. Moreover, if one does the search using the word "cars," one obtains the misleading result "Railway or Tramway Stock, etc." A searcher interested in this database must be aware that the general classification heading for this commodity group is "Tractors, Vehicles for Pass, Goods, Special Purposes" and the particular classification for cars is "Passenger Motor Vehicles, Spark Ignition Engine" as above. Other examples of obscure classification are: "Bovine Animals" instead of "Cows" and "Equine" instead of "Horses".
This paper describes a project to map from ordinary language queries
searches into specialized classification schemes such as the International
Harmonized Commodity Classification or the U.S. Standard Industrial
Classification. The project aim is to develop "Entry Vocabulary Modules"
for searching unfamiliar metadata.
Sustainable Development Indicators Databank
Students of sustainable development often require access to
comparable time-series environment and development indicators. The
interdisciplinary nature of the topic requires that researchers gather
indicators from multiple statistical compendia published by a variety
of governmental, inter-governmental, and non-governmental
organizations. Unfortunately, the mechanics of this common research
task requires an inordinate amount of time and energy.
The first challenge is to identify which compendia contain which
indicators. Library catalogs do not identify the contents of
compendia on a variable-by-variable basis. While good reference works
exist for this purpose, students must make a significant effort to
seek them out and learn how to use them effectively. A student
seeking a handful of indicators will likely find them in three of four
separate compendia.
The second challenge is learning to use the extraction software
that accompanies each compendia. Each compendium comes with its own,
often idiosyncratic software which takes time to master.
The third, and often most difficult, challenge is reformatting the
data extracted from multiple compendia into a common format for
analysis. One compendia may produce a file with a column for each
year, while another produces a file with a column for each country,
while a third produces a file with a column for each indicator. All
three compendia will likely use different coding schemes for country
names. The task of converting these different formats into a common
integrated dataset requires significant programming skills and time.
The combined result of these three challenges is that many students
simply change research topics to less interdisciplinary topics for
which all of the required data resides in a single compendia.
In order to encourage this type interdisciplinary research, the
Harvard Environmental Information Center is constructing a Sustainable
Development Indicators Databank that will provide world wide web
access to data from multiple compendia and deliver data files using a
common format and coding structure. Public domain data will be
accessible worldwide. Access to proprietary data will be restricted
to Harvard affiliates.
The initial focus of this effort will be for comparative, national
scale, annual, time-series indicators. A proof-of-concept edition of
the Databank is now availble that provides access to the World Bank's
World Development Indicators 1997 on CD-ROM. Additional datasets will
be incorporated this coming summer.
Using HTML to Document a Panel Survey
Data is only as good as its documentation, and no more so than in the
case of a panel survey in which constant, and time-dependent changes
to the data contents and structure must be described and explained if
efficient use of the resource is to be made.
The British Household Panel Study (BHPS) is an annual panel survey of
some 10,000 individuals in 5,500 households, collected by the ESRC
Research Centre on Micro-Social Change. Six waves of data have now
been released, which up until wave five were accompanied soley by
paper documentation and/or its wordprocessed document source. All
documentation is now also produced in HTML format, and can be accessed
on the WWW (http://www.irc.essex.ac.uk/bhps/doc), and plans are in
place to adapt it to provide file: base medium as well.
Although the HTML version of the documentation follows the basic
structure of the printed version, HTML's hyper-text facilities have
been used to the fullest extent to document and permit rapid
navigation through its complexities.
The presentation will describe the BHPS, discuss the general and
specific problems attendant upon its use, and describe the design,
limitations and systems in place to automate production of the HTML
documentation itself, as well as the plans for its future.
The Heart Health in Canada CD-ROM; Data as Program - Using
Standardized Metadata to Link Research, Policy and Action
The presentation discusses the Heart Health in Canada CD-ROM, which has
been created as a research, promotion and policy vehicle for the Canadian
Heart Health Initiative. The CD-ROM uses metadata standards to pull
together data created from ten, independent provincial surveys, and to
link the data and codebooks with research reports and policy documents in
a population health framework. The data gathering and dissemination
activities are seen as integral components of the associated health
promotion programs, which are themselves situated within a broader health
determinants and population health context. The latter is achieved by
providing a metabase to facilitate data access and comparative analyses
for 150 key data sets in the population health, social and economic domain
in Canada. The CD-ROM enables students and researchers to drill down to
the underlying data from fact sheets and research reports; to browse,
search and select questions and variables of interest; to obtain extracts
automatically in SPSS, SAS, NSDStat+ or TPL format from data sets that are
licensed locally; to print customized inventories and codebooks; and to
build custom libraries of questions and data extracts of relevance to
local research interests and mandates.
The Dutch Data Documentation Initiative
Since September 1997 the Netherlands Historical Data Archive (NHDA)
and the Dutch social science data archive, Steinmetz Archive (STAR)
are fused in the department Data Archives of a new institute, the
Netherlands institute for Scientific Information (NIWI). The first
collectively project that is carried out is the Dutch Data
Documentation Initiative. The central aim of this project is to
update and integrate data archiving procedures and documentation
standards of NHDA and STAR in the following sense:
It is the intention to come to a situation in which all data
documentation activities ( data registration, data acquisition, data
description) are carried out by both data archives by using the same
system, in principle. This does not mean that the documentation
standards and procedures of NHDA and STAR will be 100% identical. The
preservation of electronic information for future use entails the need
to preserve the context, structure and contents of the data. Both
archives share methodological and technological problems, but the ways
in which they are solved also show variations, because context,
contents and structure of the electronic files that are processed by
the two types of data archives are different. In the paper the first
results of the project will be presented. The way the study
description scheme is modified compared with the DDI.
Marion Wittenberg is sociologist and works at NIWI, Netherlands
Institute for Scientific Information Services. She is responsible for
the acquisition of data and for the documentation standards of the
Steinmetz Archive and she participates in the Dutch Data Documentation
Initiative project.
A New System for Web-based Documentation, Analysis, and Distribution
of Survey Data
For the past two years, the Computer-assisted Survey Methods Program at
the University of California in Berkeley has been developing software for
the documentation, analysis, and distribution of survey data on the World
Wide Web. The currently available procedures include the following:
Since the documentation for the data files can be made very accessible,
and since the data analysis procedures are very simple to use,
this type of data archive is ideal for many applications. It provides a
means of introducing students to data analysis without them
having to spend a great deal of time on technical details. It is also a
way of improving public access to policy and public opinion
data, and it provides statistics on demand for users of data libraries. To
facilitate user access in various countries, the interface can be
set up in any language readily displayed by a Web browser. Some current
applications can be viewed at the following URL:
http://csa.berkeley.edu:7502
Thomas Piazza is the Manager of Statistical Services at the Survey
Research Center and the Computer-assisted Survey Methods
Program at the University of California in Berkeley. He has been involved
with the design, analysis, and documentation of surveys
for more than 20 years.
Social Science Data Analysis over the Internet: Design and
Development Issues
Design considerations, remote analysis issues, and progress on
Sociometrics "Multivariate Interactive Data Analysis System" (MIDAS)
will be presented. The goal of the service is to provide broad access to
interactive data analysis, via the Internet, of over 150 health and
social science databases containing over 150,000 variables from seven
national data archives. The service will include search & retrieval
programming and variable-level and study-level links to over 31,000
pages of supporting documentation, such as original instruments and
User's Guides in Portable Document Format. A custom interface will allow
users to easily interact with the system through any popular html
browser. Online data analytic procedures will include weighted and
unweighted frequencies, percentiles, and measures of dispersion and
central tendency, as well as two-way tables with measures of
association. Users will be able to define case subsets and filters for
analysis. Output can then be downloaded or printed. Users will have the
ability to download entire datasets and documentation in SAS and SPSS
compatible formats, as well as the ability to define variable subsets or
case subsets for user-customized dataset downloads.
Eric L. Lang, Ph.D. is a Principal Research Scientist and Director
of the Research Support Group at Sociometrics Corporation in Los Altos,
California. He directed the development of "Socionet", Sociometrics'
commercial social science WWW server. His Ph.D. is in Social Psychology
(Univ. of Michigan) and his interests include data archive development
and services, the Internet, and social science methodology.
Delivering Data to Undergraduate Classes
Introductory undergraduate courses in statistical methods and
survey analysis often fail to instruct students in the variety of data
available and the skills required to locate, obtain and incorporate
existing data into research projects. Typically such classes use a
packaged set of data such as a single year cross-section from the
General Social Survey (GSS). Working with Faculty and Teaching
Assistants, the Geospatial and Statistical Data Center has developed a
suite of Web-based data extraction and analysis tools that provide
access to a variety of public, commerical and locally produced data
sets. This talk will explore several of these and will focus on the
instructional and programming requirements of such services.
Establishing a Data Resource Centre
The following paper will outline the process of establishing a Data
Resource Centre (DRC) at the University of Guelph. Issues such as
targeted audience, teaching needs, research needs, levels of service,
staffing, hardware, software, security and delivery tools will be
discussed.
Prior to the fall of 1996 Guelph was in a situation similar to many
other research/teaching institutions. There was no formal procedures
in place for acquiring, distributing and analyzing data in an
electronic format. It was the responsibility of individual faculty,
researchers and students to develop the necessary skills. There was
limited statistical support and there was overlap in acquiring data
resources, as well as a duplication of efforts with respect to the use
of electronic information. In the fall of 1996 a pilot project was
started at the University of Guelph to consolidate the delivery of
electronic data resources. The project was a joint venture between
Computing and Communication Services and the Library. Staff were
seconded from both service providers and centralized facilities were
established. After a very successful pilot, the DRC became a full
service facility in the spring of 1997. Discussions are well under
way to develop this service into a seamless, shared resource between
the University of Guelph, University of Waterloo and Wilfrid Laurier
University.
The paper looks at the motivation behind the DRC, how information on
demand was gathered, and who the target audience is. Certain goals and
objectives were set and the paper looks at how these goals are being
achieved and some of the obstacles encountered. A large portion of
the efforts in the DRC are centered around the development of a web
retrieval system . A perl script has been developed to interface with
SAS, which allows an enormous variety of data to be easily mounted,
distributed and analysed on-line. In the 14 months since the first
iteration of the script over 200 surveys have been mounted. The paper
will also look at how the service is being integrated into the library
and how staff are being trained to use the interface, as well as
prepare data to be mounted on the web.
For more information please refer to http://drc.uoguelph.ca.
Bo Wandschneider has been responsible for the DRC project at the
University of Guelph since its implementation in December 1996. Prior
to that he was Computing Coordinator in the Department of Economics at
the University of Guelph (1986 - 1996). His educational and research
background is in the area of applied economics . For more information
see: http://www.uoguelph.ca/~bo
From Scissors to Pentiums: Where We have been and Where We are Going in
Qualitative Data Analysis
Wendy Wright
Department of Social Sciences Computing, UCLA
Wendy Wright will give a brief orientation to the field of qualitative
data analysis discussing historical developments relevant to
the field, types of qualitative data analysis software available, benefits
and pitfalls of using qualitative data analysis software, and
why it is important for Social Sciences Computing departments at
universities to recognize the growing field of qualitative data
analysis.
Wendy Wright is Manager of Planning and Development for Social
Sciences Computing at UCLA. In addition, she is completing
a doctorate in Medical Anthropology at UCLA. Her dissertation is on
cervical cancer in Mexican-American women.
(wright@ssc.ucla.edu)
Qualitative Data Analysis Software as a Tool Rather than Dictator of
Process
Raymond Maietta
Indiana University
This paper will focus on the importance of an analyst's personal
research style. This style should dictate when and why a software
package's features contribute to an examination of qualitative data. Many
qualitative data analysis programs are inherently flexible, but not
intuitively flexible. Often novice users allow program features (rather
than their own research goals) to guide their analyses. As a consultant, I
often see how persistent misunderstandings of the logic of NUD*IST
software lead to misguided use of the program. For example, NUD*IST's
omnipresent node explorer and power search functions tempt users to data
reduction. Alternatively, I suggest strategoes for NUD*IST instruction
that emphasize flexible, fluid interaction with qualitative data. Ray
Maietta received his Ph.D. from the State University of New York at Stony
Brook in 1996. His dissertation, "Lost in the Shuffle: In Search of
Wayward Friendship," was a qualitative analysis of friendship in a
southwest US suburb. Currently, he is an NIMH postdoctoral fellow in the
department of sociology at Indiana University. His research interests are
on interpersonal relationships and sociology of culture and he is a
trainer and consultant in the use of NUD*IST.
The Use of Computer-Assisted Software Programs in Anaylzing Qualitative
Data: Methodological Implications
Sharlene Hesse-Biber
Boston College
Sharlene Hesse-Biber will discuss the methodological controversies
surrounding the use of computer software programs to analyze qualitative
data. She examines several issues: I: The issue of art versus technology;
II: Blurring the line between Quantitative and Qualitative Data; III:
Issues of Validity and Reliability. She discusses recent cutting edge
computer software which analyzes multi-media data including images, video
and audio discs and tapes and addresses the methodological issues involved
in analyzing multi-media data.
Sharlene Hesse-Biber is professor of sociology at Boston College.
She has published widely in the field of computers and qualitative data
analysis. Her most recent co-authored articles " Users' Experiences with
Qualitative Data Analysis Software," and "New Developments in Video
Ethnography and Visual Sociology--Analyzing Multimedia Data
Qualitatively," appeared in Social Science Computer Review. Dr.
Hesse-Biber is co-developer of the computer software program,
HyperRESEARCH, which analyzes qualitative text and multi-media qualitative
data.
(http://www.reasearchware.com)
Issues in Principled Choice of Qualitative Data Analysis Software
Eben Weitzman
University of Massachusetts, Boston
This paper will address two issues: A principled approach to choosing a
software package; and some comments on the state and direction of the
field. These issues are intertwined. Choice should be based, not on an
abstract notion of which is the "best" or "most powerful" program, but on
a careful matching of program abilities, requirements, and constraints to
your individual dataset, analytic needs, and personal aptitudes and style.
A number of observations will be offered on the current, emerging, and
hoped-for state of the field with respect to its support for the varieties
of such needs among researchers. Eben A. Weitzman received his Ph.D. in
social and organizational psychology from Columbia University and is
currently Assistant Professor, Graduate Programs in Dispute Resolution,
University of Massachusetts Boston, and Research Associate at the
International Center for Cooperation and Conflict Resolution. His
interests are in organizational development, cross-cultural conflict,
conflict resolution, intergroup relations, and qualitative research
methods, and he is the senior author of Computer Programs for Qualitative
Data Analysis (Sage, 1995), with the late Matthew B. Miles.
(weitzmane@umbsky.cc.umb.edu)
Stacy Horn founded ECHO in 1989 as a virtual salon of New York City,
similar in many respects to THE WELL in California, but quite distinct
in its own organizational culture. The genesis of ECHO and the ensuing
problems of running a virtual salon and managing arising conflicts
between its members are described in Stacy Horn's recently published
book 'Cyberville'. Taking issue with a common assertion, Stacy Horn
maintains that people's true characters and personalities take
precedent over attempts of role-playing and of assuming a fake
identity. Consequently, cyberspace will not dramatically alter the
essence of human interactions, rather it just adds another channel of
communication simply increasing the frequency of human interaction as
we have known it all along.
This panel addresses the current Federal policies and practices for
the public archiving and use of data generated as a result of
extramural research programs. The approaches of several agencies vary
in the extent to which 1) policies are established pra ice or in
development, 2) archiving is an expected product of research awards,
3) the scope of research data included, 4) the expected schedule of
archiving, 5) procedures for supporting data archives, 6) programs
encouraging secondary data analysis, and ) the mechanisms for
reviewing and evaluating agency policies.
CASWEB: A Web-based interface to UK Census area statistics
James Harris
University of Manchester
This paper will address the digital dissemination of census data in
the UK. A number of important weaknesses in the 1991 model of data
access are identified and solutions explored in the context of the
CASWEB experimental Web-based interface to Census area
statistics. This project, which is funded under the Economic and
Social Research Council 2001 Census Programme, is being carried out in
close consultation with the UK Census Offices and the academic census
user community. The system comprises a large relational database
accessed via an intuitive Web interface consisting of both text based
menus and desktop mapping functionality embedded within the user’s
web browser. The interface allows the user to dynamically select,
subset, crosstabulate and interrogate census counts and associated
metadata. The map-based front-end places spatially-referenced census
data within its geographical context and incorporates a number of
spatial data resources including the census boundaries and digital map
data.
The system has been implemented across a range of development
environments and various platform/software combinations have been
evaluated during the course of the project. The Web interface uses a
combination of HTML forms, server-side scripting and proprietary
server software to pass SQL queries to the database via CGI and the
Web server API. The implications and advantages of employing advanced
methods for user interaction and data retrieval will be discussed and
the presentation will include a live on-line demonstration of the
interface.
Taking Web-Based Data Services to the Classroom and Beyond: The ISLAND
Model
Brian Kroeker, University of British Columbia
Some data extraction programs have suffered from two important factors
so far, these being:
The data user is not a single type of person with a single level of
skill, nor are all users equal in terms of training and experience, nor in
what they wish to do with datasets.
The same is also true of anyone who maintains a data extraction system
for others. In most libraries, staff time is a quantity there is never
enough of, so the creation of a data extraction system must take into
account this fact by making such a system uncomplicated yet powerful,
flexible yet requiring relatively few highly technical computer skills
in order to quickly update and add features to such a system.
This paper explores the ISLAND data extraction system at UBC, and
identifies actions taken in the creation of this system to overcome
some of these difficulties.
Brian Kroeker has been a Programmer/Analyst at the University of
British Columbia for nine years. His major goal over the past few years
has been adapting data files for access on the World Wide Web with a focus
towards effective interface design.
Preservation of Electronic Records: The Roper Center Experience,
1946-1998
Marc Maynard
Roper Center for Public Opinion Research,
University of Connecticut
During the past several decades, data archives have faced a multitude
of obstacles in carving their role as integral parts of the research
community. These challenges include such things as balancing the
needs and expectations of users with the realities of collection condition
and variety of data formats; developing criteria for data acquisition;
keeping up with new media formats; and training and retaining staff with
the unique amalgam of skills and talents needed by data librarians.
Challenges of this nature have been overcome, with varying levels of
success, by data libraries throughout the world.
The advent of the Internet and the current focus of new technological
developments on networking and data access have added new
challenges to the operation and, potentially, the existence of data
archives. The argument for the continued existence of physical data
archives is undergoing scrutiny in these days of "virtual archives."
When any scholar, research institute, commercial firm or interest group
can host a "virtual archive" on the World Wide Web, what are the
incentives and motivations for maintaining social science data archive
facilities? What are the incentives for data producers to archive their
materials at established libraries? What are the incentives for
researchers to utilize archive services, when (at least some) data can
be found elsewhere?
The Roper Center has been and continues to be faced with the challenges
mentioned above. While these issues are new in the context of the
Internet, this paper seeks to re-examine them by looking at the early
years of the Roper Center and the development and growth of both it's data
collections (regarding substantive and technical issues) and it's mission.
The commercial nature and age of it's collections makes the Center a
unique enterprise and one from which much can be learned.
Marc Maynard is Assistant Director for Technical Services at The
Roper Center for Public Opinion Research, University of Connecticut.
Archives Choice between Museum and Data Library -- The Danish National
Archives Adaption to the Information Technology Progress in Denmark
Lars Kristian Larsen and Lise Qwist Nielsen
Danish National Archives
The rapid development of technology use in the Danish
central administration has forced the the Danish National Archives to
adapt to this technology progress, and to make a choice between becoming a
museum of paper archives or to develop into a modern archive, handling any
type of archives that Danish authorities are using now or will use in the
future. The Danish National Archives chose the Information Technology
Strategy, and became by this choice an active component in the Danish
information technology progress. The Danish government has presented a
solid platform for adaption to the challenge of the IT modernization. -
Firstly with "Information Society 2000", the policy paper that was to be
mplemented by a new Ministry of Science. The policy paper has become a
guide for many central- and municipal authorities using modern information
technology as a tool in modernizing the administration. In this process
many authorities are trying to implement and prepare the use of electronic
communication and archiving. - Secondly a 1996/97-revision of the archival
legislation gave the Danish National Archive the authority and
responsibility to prepare the future handling of electronic archives. With
this paper it is our ambition to present the adaption strategy of The
Danish National Archives to information technology in an archival
perspective by:
Lise Qwist Nielsen (Master of Library and Information Science) is
working as an Archivist in the Danish National Archive, dealing with
appraisal and selection of electronic archives from the public sector Lars
Kristian Larsen (M.A. of Political Science) is working as an Archivist in
the Danish National Archive IT departments section, dealing with standard
and method development.
Email as Record: Challenges to Traditional Archival and Records
Management of Electronic Records
Mark Conrad, Center for Electronic Records
US National Archives & Records Administration
The National Archives has preserved and made available selected
electronic records of the U.S. Federal Government for over two
decades. Most of the records that have been accessioned have been
statistical data sets or files from simple database applications.
Today, however, agencies in the U.S. Government are using computers to
produce a greater variety of increasingly complex electronic records.
These new records pose serious challenges to traditional archival and
records management practices. In this presentation I want to look at
records created by e-mail systems to illustrate some of these
challenges.
The National Archives has initiated multiple projects to select,
preserve, and provide access to several million messages from e-mail
systems in Federal agencies. This presentation will report on the
challenges we have encountered in carrying out these projects. Some of
the challenges these records pose are: the sheer volume of records to
be processed; the difficulty in sorting the wheat from the chaff; the
complexities of redacting restricted information; the difficulty in
identifying and migrating all the component parts of e-mail messages
as systems become obsolete; the challenge of helping researchers find
relevant records from a corpus of several million messages.
Mark Conrad is an archivist currently working in the Life Cycle
Management Division of the National Archives and Records
Administration (U.S.). He has been working with archival electronic
records for the past seven years.
Applying Parallel Processing to Social Science Data: Pushing the
Limits
Albert F. Anderson & Paul H. Anderson
Public Data Queries,
Inc.
Computing and information system technologies have had a dramatic
impact on the management and analysis of social science data. Massive
census and survey data sets that just a few years ago could be handled
only in large, centralized mainframe environments are now routinely
analyzed on desktop workstations and PCs. One consequence of the increased
computing power and storage capabilities has been the capability to handle
increasingly large data sets. However, as the data sets available to
researchers and policy makers have moved from kilobytes through megabytes
to gigabytes and, soon, terabytes, of data, the demands for processing
power have stayed ahead of the capabilities of technology.
Many of the data management and analysis tasks that face social
scientists are inherently parallel. Typically, the same processing steps
are applied to each of perhaps millions of data records. As a consequence,
traditional handling of large data sets has been constrained as much by
the input capabilities of the available computing systems as their
processing power. Paralleling the input/output (I/O) data stream along
with data processing tasks has been demonstrated to dramatically decrease
the processing time required to handle data sets ranging to hundred of
millions of records. Thus, parallel processing is evolving as the next
technology to be exploited to the advantage of social scientists.
The authors have more than five years experience in applying parallel
computing systems to the management and analysis of social science data.
Over that time, processing speeds have increased from a few megahertz
(MHz) to hundreds of MHz, hard disk storage from megabytes (MB) to
gigabytes (GB), disk access from kilobytes per second (KBPS) to megabytes
per second (MBPS), and random access memories from kilobytes to megabytes
and now gigabytes. More power is available on the desktop today than was
available in the largest mainframe systems of a few years ago. The
consequence has been that tasks that once took days can now be
accomplished in seconds.
The authors are developing a commercial system, PDQ-Explore, capable of
providing interactive access to data sets as large as full national
censuses. The effort has involved the design and implementation of a
system highly optimized for handling social science data and analytic
tasks. It has required balancing performance across the various subsystems
(storage, I/O, processing, memory, and inter-processor communications) of
the parallel systems. This paper presents a summary review of past
progress, outlines the strategies used to achieve maximum performance
for specific tasks, and focuses on the current challenges in applying
parallel processing power to social science applications. These challenges
vary from conceptually simple but challenging procedures such as
determining median household income by state, race, and family structure
from national census microdata to more complex resampling techniques and
iterative fitting of models to large data sets.
Support for this work comes in major part from Small Business
Innovation Research (SBIR) and Technology Transfer Research (STTR) grants
from the National Institute of Child Health and Human Development (NICHD)
and the National Institute on Aging. For information about PDQ, Inc, and
PDQ-Explore, see: http://www.pdq.com .
Albert F. Anderson is currently the Director of Research for Public
Data Queries, Inc. (PDQ), a family-owned company in Ann Arbor, Michigan.
He has a Ph.D. in sociology from Iowa State University. He retired from
the University of Michigan in 1996 following 25 years as the co-head
of the data processing section at the Population Studies Center. Paul H.
Anderson is currently a Vice President and Director
of Technological Development at PDQ. He has a master's degree in computer
engineering from the University of Michigan and has ten years of
experience designing and implementing academic, research, and commercial
computing applications on single and multiple processor platforms.
Dynamic Exploratory Data Analysis: The Users Requirements
The growth of computing power has, generally, not led to
development in statistical techniques for the analysis of large an
complex datasets but to a downsizing of computer power needed. For
example, a quick scan of most applied socio-economic journals shows
that most of the quantitative techniques utilised have varied little
in the computing power required from those used thirty years ago.
In the current situation there is much opportunity for new
non-parametric techniques to be developed which will allow for the
graphical and spatial analysis of complex and large datasets through
interactive tools.
Such analysis however places new requirements on data librarians
but in providing data access to datasets not necessarily held locally,
local support in terms of understanding the limitations of the
datasets and the provision of software which allows such analysis.
This paper considers, through practical examples drawn from
experience of trying to develop such techniques for UK and RoI data,
the challenges this approach to data analysis raises for IASSIST/SSCA
members.
Derek Bond is a Senior Lecturer in the Ulster Business School
and Director of the Northern Ireland Regional Research Laboratory.
Concurrent session 4A: Applying Metadata
Standards
Thursday, 10:30 AM
Concurrent Session 4B: Designing and Delivering
Data on the Web, Part One
Thursday, 10:30 AM
Concurrent Session 4C: Getting Quality in Qualitative
Data Analysis.
Thursday, 10:30 AM
Lunch Speaker: People are People -- Even
online where no one can see them
Thursday, 12:00 Noon
Plenary Panel: Archiving Data from Government Supported Research:
Policies, Practices, and Possibilities
Friday, 9:00 AM
Concurrent session 5A:
Designing and Delivering Data on the Web, Part Two.
Friday, 10:30 AM
Concurrent Session 5C: Analysis of Large Data Sets
Friday, 10:30 PM