Already a member?

Sign In
Syndicate content

Digital Repositories

The International Workshop on Social Science Data Archives, held in Taiwan, sponsored by IASSIST

The International Workshop on Social Science Data Archives, sponsored by IASSIST, was held on September 15 in Conference Room II, Research Center for Humanities and Social Science (RCHSS), Academia Sinica, Taipei, Taiwan. The invited speakers included Prof. Dr. Christof Wolf from GESIS – Leibniz-Institute for the Social Sciences, Dr. Yukio Maeda and Dr. Kaoru Sato from Social Science Japan Data Archive (SSJDA), University of Tokyo, and Dr. Won-ho Park, Dr. Seokho Kim from Korea Social Science Data Archive (KOSSDA), Seoul National University.

The finalized workshop agenda is listed in the following. We also had Dr. Ruoh-rong Yu introduce the Survey Research Data Archive of Taiwan. The topics of the presentations covered data curation, preservation, and dissemination services provided by each data archive. 

09:00~09:30

Registration

09:30~09:40

Opening Remarks

Dr. Ching-Ching Chang
Chair Professor
Department of Advertising
National Cheng-Chi University, Taiwan

Morning Session

Session Chair: Dr. Ching-ching Chang

09:40~10:20

Curating, Preserving, and Disseminating Social Science Micro Data at Social Science Japan Data Archive

Dr. Yukio Maeda
Professor
Institute of Social Science
University of Tokyo, JAPAN

10:20~11:00

Introduction to Korea Social Science Data Archive

Dr. Won-ho Park
Associate Professor
Department of Political Science and International Relations
Seoul National University, KOREA

Dr. Seokho Kim
Associate Professor
Department of Sociology
Seoul National University, KOREA

Dr. In Chol Shin
Senior Researcher
Korea Social Science Data Archive
Seoul National University Asia Center, KOREA

11:00~11:20

Tea Break

 

11:20~12:00

Introduction to Survey Research Data Archive of Taiwan

Dr. Ruoh-Rong Yu
Research Fellow and Executive Director
Center for Survey Research
Research Center for Humanities and Social Sciences, Academia Sinica, TAIWAN

12:00~14:00

Lunch

 

Afternoon Session

Session Chair: Dr. Chyi-In Wu. (Research Fellow, Institute of Sociology, Academia Sinica)

14:00~15:00

Services for Survey Data: The GESIS Perspective

Dr. Christof Wolf
President
GESIS – Leibniz Institute for the Social Sciences, GERMANY

15:00~15:20

Closing Remarks

 

The registration of the workshop started May 1, 2017. The registration fee was NT$200, which included conference printed materials, lunch and light refreshments.  69 researchers attended the workshop. Most of the attendants were local scholars, while others were from Thailand, Turkey or other countries.

In the opening remarks, Dr. Chang stressed the importance of data archives, and gave a brief introduction to the speakers of the morning sessions.

The speaker of first session, Dr. Maeda, introduced the development and current practice of SSJDA. In addition, he also introduced some other data centers in Japan, including Leviathan Data Bank, Rikkyo University Data Archive, and Research Centre for Information and Statistics of Social Science at Hitotsubashi University.

SSJDA was started in 1998, with deposits amounting to 2,018 datasets. Its main collections include the Japanese General Social Surveys, Japanese Life Course Panel Surveys, Japanese Election Studies, National Family Research of Japan, Working Persons Survey, and Elementary School Students Survey. Researchers affiliated with academic institutions and graduate students can get access to SSJDA datasets for academic purposes. Applicants should sign an agreement (pledge) and get permission from PI in advance. Under the supervision of professors, undergraduate students are allowed to access certain data for paper writing. Such usage is classified as educational use, instead of research use. Some datasets are for research use only, and are not available for educational use.

SSJDA also offers several seminars on data usage and a one-week seminar on quantitative analysis every year. SSJDA built a desktop application for managing metadata based on the DDI lifecycle named Easy DDI Organizer (EDO). EDO can be used to edit metadata, import metadata and variable information from statistical software, and export documents. It is a useful tool for researchers, data users, and data archives. However, this system only has a Japanese version.

The second speaker was Dr. Park from KOSSDA. KOSSDA is Korea’s leading data archive, with expertise in the collection, dissemination, and promotion of research materials through various academic events and methodology education programs. Started in 1983 as a non-profit social science library, KOSSDA began to collect survey data in 2003, and moved to Seoul National University Asia Center in 2015.

KOSSDA collects survey data, statistical tables, qualitative interviews and narrative history data, documents, observation records, and other kinds of data produced by research institutes and individuals. KOSSDA also establishes digital databases, and provides access to the data. Its main collections include the Korean General Social Survey, ISSP Annual Topical Module Survey, Gallup Omnibus Survey, etc. KOSSDA has translated 250 survey datasets to English, including their questionnaires and codebooks.

KOSSDA is now rebuilding its website to enhance its data searching function and to improve web design. KOSSDA offers methodology training programs, data fairs, and a research paper competition every year.

After a 20-minute tea break, the presentation on SRDA kicked off. The speaker, Dr. Yu, is the Executive Director of the Center for Survey Research at Academia Sinica. SRDA was established in 1994. There are now eleven full-time workers in SRDA, including two IT staff members. The data archived by SRDA include survey data, census data, and in-house value-added data.

SRDA curates academic survey data such as the Taiwan Social Change Survey, Panel Study of Family Dynamic, Taiwan Social Image Survey, Taiwan Youth Project, Taiwan Education Panel Survey, and Taiwan’s Election and Democratization Study. In addition, SRDA also curates government survey data including the Manpower Survey, Manpower Utilization Survey, Woman’s Marriage, Fertility and Employment Survey, Survey of Family Income and Expenditure, Digital Opportunity Survey for Individuals and Households, Survey on Workers’ Living and Employment Conditions, etc. The number of datasets dissimilated by SRDA exceeds 2,800, in which 315 datasets have English versions.

A membership scheme is adopted by SRDA. Academia Sinica members are researchers at Academia Sinica. Regular members are faculty, researchers, students, or research assistants at colleges or research institutes. There are now about 2,302 members. A member can get access to most of the archived data by direct downloading from the SRDA website.

SRDA members can also apply for data with restricted access. The restricted datasets can be used via on-site access or remote access. All services provided by SRDA are now free of charge.

SRDA offers workshops, webinars, and on-campus lectures to promote data usage. In addition, SRDA maintains some social media websites such as a Facebook fan page, Youtube Channel, and SRDA blog.

SRDA has been constructing a data-based bibliography for years. Since 2016, SRDA has begun to register DOI via da|ra. One task in progress is to construct a data integration platform for Taiwan Social Change Survey data of various years. Other main tasks include enlarging data storage, broadening membership, remodeling the website, developing data management plans, and constructing an evaluation scheme for data disclosure risk.

In the afternoon session, Dr. Chi-in Wu was the chair. The presenter, Dr. Wolf, introduced the development and current progress of GESIS. Relative to data archives of Asia countries, the budget and personnel of GESIS are very large. GESIS was founded in 1960, and the data archive for social science is one of the five research departments of GESIS. There are about 70 staff members in the data archive for social science, belonging to seven teams.

GESIS currently has about 6,000 datasets, which mainly focus on migration, election, values and attitudes, and social behavior. ISSP, CSES, EVS, and ALLBUS are some well-known social science survey programs. It is easy for PIs to upload datasets through the Datorium system, which is a self-deposit service for sharing data.

Dr. Wolf stressed the importance of DOI (Digital Object Identifiers), and introduced the DOI registration service built by GESIS  da|ra. Da|ra has 576,297 registered DOI names and 88 data providers worldwide, including ICPSR, SRDA, etc. In addition to hosting da|ra, GESIS is devoted to developing international standards for data documentation and data archiving, and providing training and consulting services to researchers.

In the presentation, Dr. Wolf also talked about the secure data center of GESIS. The secure data center enables researchers to access sensitive, and weakly anonymized data. It is a locked room without internet. Users have to sign contracts in advance. Any inputs and outputs are required to be checked for disclosure risk. In the future, the secure data center will establish a remote access system, which can provide secure access to the data curated in CESSDA.

A business meeting was kicked off on the next day (September 16). Besides the guests from GESIS, KOSSDA and SSJDA, participants of the meeting included researchers at the Center for Survey Research, and all the staff of SRDA. The agenda was as below.

Development of Consortium of European Social Science Data Archives (CESSDA)

Christof Wolf (GESIS)

Connections among SSJDA, KOSSDA and SRDA in Recent Years

Ruoh-rong Yu (SRDA)

Possible Future Collaboration among Data Archives

All Participants

There have been frequent connections among KOSSDA, SSJDA and SRDA in recent years. Conferences and/or workshops were hosted in rotation in 2008, 2012, 2014, 2015, 2016, and 2017.

In 2016, KOSSDA organized an international conference with invited guests from SSJDA at the University of Tokyo (Japan), CNSDA at Renmin University (China), and SRDA at Academia Sinica (Taiwan). In this conference, a consensus was reached to develop a regional association of data archives in Asian countries, namely the Networks of Asian Social Science Data Archive (NASSDA).

The main purpose of the business meeting this year was to discuss possible future collaboration among data archives in Asia countries. The brief conclusions are listed in the following:

  1. To build a joint data catalogue for the archives involved.
  2. To construct web linkages and brief introduction among archives.
  3. To have a contact person for each data archive for future cooperation.

NASSDA members will hold annual workshop or conferences on a rotating basis. Further collaboration will be discussed in the near future. 

IQ 40:1 Now Available!

Our World and all the Local Worlds
Welcome to the first issue of Volume 40 of the IASSIST
Quarterly (IQ 40:1, 2016). We present four papers in this issue.
The first paper presents data from our very own world,
extracted from papers published in the IQ through four
decades. What is published in the IQ is often limited in
geographical scope and in this issue the other three papers
present investigations and project research carried out at
New York University, Purdue University, and the Federal
Reserve System. However, the subject scope of the papers
and the methods employed bring great diversity. And
although the papers are local in origin they all have a strong
focus for generalization in order to spread the information
and experience.


We proudly present the paper that received the 'best
paper award' at the IASSIST conference 2015. Great thanks
are expressed to all the reviewers who took part in the
evaluation! In the paper 'Social Science Data Archives: A
Historical Social Network Analysis' the authors Kristin R.
Eschenfelder (University of Wisconsin-Madison), Morgaine
Gilchrist Scott, Kalpana Shankar, and Greg Downey
are reporting on inter-organizational influence and
collaboration among social science data archives through
data of articles published in IASSIST Quarterly in 1976
to 2014. The paper demonstrates social network analysis
(SNA) using a web of 'nodes' (people/authors/institutions)
and 'links' (relationships between nodes). Several types
of relationships are identified: influencing, collaborating,
funding, and international. The dynamics are shown in
detail by employing five year sections. I noticed that from
a reluctant start the amount of relationships has grown
significantly and archives have continuously grown better
at bringing in 'influence' from other 'nodes'. The paper
contributes to the history of social science data archives and
the shaping of a research discipline.


The paper 'Understanding Academic Patrons’ Data Needs
through Virtual Reference Transcripts: Preliminary Findings
from New York University Libraries' is authored by Margaret
Smith and Jill Conte who are both librarians at New York
University, and Samantha Guss, a librarian at University
of Richmond who worked at New York University from
2009-14. The goal of their paper is 'to contribute to the
growing body of knowledge about how information
needs are conceptualized and articulated, and how this
knowledge can be used to improve data reference in an
academic library setting'. This is carried out by analysis of
chat transcripts of requests for census data at NYU. There is
a high demand for the virtual services of the NYU Libraries
and there are as many as 15,000 annual chat transactions.
There has not been much qualitative research of users'
data needs, but here the authors exemplify the iterative
nature of grounded theory with data collection and analysis
processes inextricably entwined and also using a range of
software tools like FileLocator Pro, TextCrawler, and Dedoose.
Three years of chat reference transcripts were filtered down
to 147 transcripts related to United States and international
census data. The unique data provides several insights,
shown in the paper. However, the authors are also aware of
the limitations in the method as it did not include whether
the patron or librarian considered the interaction successful.
The conclusion is that there is a need for additional librarian
training and improved research guides.


The third paper is also from a university. Amy Barton, Paul
J. Bracke, Ann Marie Clark, all from Purdue University,
collaborated on the paper 'Digitization, Data Curation,
and Human Rights Documents: Case Study of a Library
Researcher-Practitioner Collaboration'. The project
concerns the digitization of Urgent Action Bulletins of
Amnesty International from 1974 to 2007. The political
science research centered on changes of transnational
human rights advocacy and legal instrumentation, while
the Libraries’ research related to data management,
metadata, data lifecycle, etcetera. The specific research
collaboration model developed was also generalized for
future practitioner-librarian collaboration projects. The
project is part of a recent tendency where academic
libraries will improve engagement and combine activities
between libraries and users and institutions. The project
attempts to integrate two different lifecycle models thus
serving both research and curatorial goals where the
central question is: 'can digitization processes be designed
in a manner that feeds directly into analytical workflows
of social science researchers, while still meeting the
needs of the archive or library concerned with long-term
stewardship of the digitized content?'. The project builds
on data of Urgent Action Bulletins produced by Amnesty
International for indication of how human rights concerns
changed over time, and the threats in different countries
at different periods, as well as combining library standards
for digitization and digital collections with researcher-driven
metadata and coding strategies. The data creation
started with the scanning and creation of the optical
character recognized (OCR) version of full text PDFs for text
recognition and modeling in NVivo software. The project
did succeed in developing shared standards. However, a
fundamental challenge was experienced in the grant-driven
timelines for both library and researcher. It seems to me that
the expectation of parallel work was the challenge to the
project. Things take time.


In the fourth paper we enter the case of the Federal Reserve
System. San Cannon and Deng Pan, working at the Federal
Reserve Bank in Kansas City and Chicago, created a pilot
for an infrastructure and workflow support for making the
publication of research data a regular part of the research
lifecycle. This is reported in the paper 'First Forays into
Research Data Dissemination: A Tale from the Kansas City
Fed'. More than 750 researchers across the system produce
yearly about 1,000 journal articles, working papers, etcetera.
The need for data to support the research has been
recognized, and the institution is setting up a repository
and defining a workflow to support data preservation
and future dissemination. In early 2015 the internal Center
for the Advancement of Research and Data in Economics
(CADRE) was established with a mission to support, enhance,
and advance data or computationally intensive research,
and preservation and dissemination were identified as
important support functions for CADRE. The paper presents
details and questions in the design such as types of
collections, kind and size of data files, and demonstrates
influence of testers and curators. The pilot also had to
decide on the metadata fields to be used when data is
submitted to the system. The complete setup including
incorporated fields was enhanced through pilot testing and
user feedback. The pilot is now being expanded to other
Federal Reserve Banks.


Papers for the IASSIST Quarterly are always very welcome.
We welcome input from IASSIST conferences or other
conferences and workshops, from local presentations or
papers especially written for the IQ. When you are preparing
a presentation, give a thought to turning your one-time
presentation into a lasting contribution. We permit authors
'deep links' into the IQ as well as deposition of the paper in
your local repository. Chairing a conference session with
the purpose of aggregating and integrating papers for a
special issue IQ is also much appreciated as the information
reaches many more people than the session participants,
and will be readily available on the IASSIST website at
http://www.iassistdata.org.


Authors are very welcome to take a look at the instructions
and layout: http://iassistdata.org/iq/instructions-authors.

Authors can also contact me via e-mail: kbr@sam.sdu.dk.
Should you be interested in compiling a special issue for
the IQ as guest editor(s) I will also be delighted to hear
from you.


Karsten Boye Rasmussen
June 2016
Editor

Looking Back/Moving Forward - Reflections on the First Ten Years of Open Repositories

Open Repositories conference celebrated its first decade by having four full days of exciting workshops, keynotes, sessions, 24/7 talks, and development track and repository interest group sessions in Indianapolis, USA. All the fun took place in the second week of June. The OR2015 conference was themed "Looking Back/Moving Forward: Open Repositories at the Crossroads" and it brought over 400 repository developers and managers, librarians and library IT professionals, service providers and other experts to hot and humid Indy.

Like with IDCC earlier this year, IASSIST was officially a supporter of OR2015. In my opinion, it was a worthy investment given the topics covered, depth and quality of presentations, and attendee profile. Plus I got to do what I love - talk about IASSIST and invite people to attend or present in our own conference.

While there may not be extremely striking overlap with IASSIST and OR conferences, I think there are sound reasons to keep building linkages between these two. Iassisters could certainly provide beneficial insight on various RDM questions and also for instance on researchers' needs, scholarly communication, reusing repository content, research data resources and access, or data archiving and preservation challenges. We could take advantage of the passion and dedication the repository community shows in making repositories and their building blocks perfect. It's quite clear that there is a lot more to be achieved when repository developers and users meet and address problems and opportunities with creativity and commitment.

 

While IASSIST2015 had a plenary speaker from Facebook, OR had keynote speakers from Mozilla Science Lab and Google Scholar. Mozilla's Kaitlin Thaney skyped a very interesting opening keynote (that is what you resort to when thunderstorms prevent your keynote speaker from arriving!) on how to leverage the power of the web for research. Distributed and collaborative approach to research, public sharing and transparency, new models of discovery and freedom to innovate and prototype, and peer-to-peer professional development were among the powers of web-enabled open science.
 
Anurag Acharya from Google gave a stimulating talk on pitfalls and best practices on indexing repositories. His points were primarily aimed at repository managers fine-tuning their repository platforms to be as easily harvestable as possible. However, many of his remarks are worth taking into account when building data portals or data rich web services. On the other, hand it can be asked if it is our job (as repository or data managers) to make things easy for Google Scholar, or do we have other obligations that put our needs and our users first. Often these two are not conflicting though. What is more notable from my point of view was Acharya's statement that Google Scholar does not index other research outputs (data, appendixes, abstracts, code…) than articles from the repositories. But should it not? His answer was that it would be lovely, but it cannot be done efficiently because these resources are not comprehensive enough, and it would not possible for example to properly and accurately link users to actual datasets from the index. I'd like to think this is something for IASSIST community to contemplate.

Open Researcher and Contributor ID (ORCID) had a very strong presence in OR2015. ORCID provides an open persistent identifier that distinguishes a researcher from every other researcher, and through their API interfaces that ID can be connected to organisational and inter-organisational research information systems, helping to associate researchers and their research activities. In addition to a workshop on ORCID APIs there were many presentations about ORCID integrations. It seems that ORCID is getting close to reaching a critical mass of users and members, allowing it to take big leaps in developing its services. However, it still remains to be seen how widely it will be adopted. For research data archiving purposes having a persistent identifier provides obvious advantages as researchers are known to move from one organisation to another, work cross-nationally, and collaborate across disciplines.

Many presentations at least partly addressed familiar but ever challenging research data service questions on deposits, providing data services for the researcher community and overcoming ethical, legal or institutional barriers, or providing and managing a trustworthy digital service with somewhat limited resources. Check for example Andrew Gordon's terrific presentation on Databrary, a research-centered repository for video data. Metadata harmonisation, ontologies, putting emphasis on high quality metadata and ensuring repurposing of metadata were among the common topics as well, alongside a focus on complying with standards - both metadata and technical.

I see there would be a good opportunity and considerable common ground for shared learning here, for example DDI and other metadata experts to work with repository developers and IASSIST's data librarians and archivists to provide training and take part in projects which concentrate on repository development in libraries or archives.

Keynotes and a number of other sessions were live streamed and recorded for later viewing. Videos of keynotes and some other talks and most presentation slides are available already, rest of the videos will be available in the coming weeks.

A decade against decay: the 10th International Digital Curation Conference

The International Digital Curation Conference (IDCC) is now ten years old. On the evidence of its most recent conference, is in rude health and growing fast.

IDCC is the first time IASSIST decided to formally support another organisational conference. I think it was a wise investment given the quality of plenaries, presentations, posters, and discussions.

DCC already has available a number of blogs covering the substance of sessions, including an excellent summary by IASSIST web editor, Robin Rice. Presentations and posters are already available, and video from plenary sessions will soon be online.

Instead I will use this opportunity to pick-up on hanging issues and suggestions for future conferences.

One was apportionment of responsibility. Ultimately, researchers are responsible for management of their data, but they can only do so if supporting infrastructure is in place to help them. So, who is responsible for providing that: funders or institutions? This theme emerged in the context of the UK’s Engineering and Physical Sciences Research Council who will soon enforce expectations identifying the institution as responsible for supporting good Research Data Management.

Related to that was a discussion on the role of libraries in this decade. Are they relevant? Can they change to meet new challenges? Starting out as a researcher who became a data archivist and is now a librarian, I wouldn’t be here if libraries weren’t meeting these challenges. There’s a “hush” of IASSIST members also ready to take issue with the suggestions libraries aren’t relevant or not engaged with data, in fact they did so at our last conference.

Melissa Terras, (UCL) did a fantastic job presenting [PDF] work in the digital humanities that is innovative in not only preserving, but rescuing objects – and all done on small change research budgets. I hope a future IDCC finds space for a social sciences person to present on issues we face in preservation and reuse. Clifford Lynch (CNI) touched on the problems of data reuse and human subjects, which remained one of the few glancing references to a significant problem and one IASSIST members are addressing. Indeed, thanks must go to a former president of this association, Peter Burhill (Edinburgh) who mentioned IASSIST and how it relates to the IDCC audience on more than one occasion.

Finally, if you were stimulated by IDCC’s talk of data, reuse, and preservation then don’t forget our own conference in Minneapolis later this year.

Feedback on Data Storage

I posted the following question to the listserv:

"I'm in the early days of exploring what I and our library can do for our faculty and grad students. In my case I'm particularity interested in the social sciences.

It seems there are three main choices:

1. ICPSR(or other domain-specific site)

2. Dataverse with my own school's branding

3. Local, campus funded storage through an Institutional Repository or something else that can handle larger amounts of data.


Our university is kind of in the vast middle
as far as flagship state universities go in budgets and research activity.

What are the pros and cons of these archiving choices? What would best suit a non-wealthy institution? Which requires more training and expertise?"

From the very informative feedback I received from my IASSIST colleagues, I concluded that it is best to keep open to all kinds of possibilities. I was probably naïve in my initial hope that there would be one solution on which I could train my energies. However that is not the case. Different solutions may be best for different factors, including the data in question, local staff skills,  and library budgets.

There were many voices that supported the domain-specific repository idea represented by ICPSR. Researchers can get exposure to colleagues in their areas of expertise. There is no need to reinvent the wheel if the expertise and the longevity that ICPSR can provide are out there. In addition, ICPSR is launching “openICPSR,” a new open access repository for researchers and institutions that need to comply with Federal requirements to make data publicly available.  Data deposited in "openICPSR" will be discoverable in the ICPSR catalog, but not restricted to ICPSR members -- anyone will be able to download.  ICPSR staff will edit the metadata appearing in the catalog, and depositors can commission full curation of their collections (e.g. full codebooks, variable-level metadata for searching) by ICPSR staff. In addition to accepting individual projects, openICPSR will also offer packages to meet institutional needs.  They are planning at least two options: 1) A multiple deposit option whereby an entity can purchase several project deposits (fees will be discounted for member institutions), and 2) A branded repository page that will list datasets under an institution's own logo and color scheme.

Many others outlined the Dataverse picture. If you can get a good match between what your campus needs and what Dataverse can provide, this can be a crucial part of an overall solution.  Dataverse has ease of entry through a self-service deposit structure, not to mention that the price is right (free)! Many institutions are starting with pilot projects in order to assess the labor impact on the library. A few librarians noted that there are issues of long-term storage, sustainability, and metadata uniformity that can arise with Dataverse.

Some respondents hastened to add that Dataverse will be offering improved services.  Dataverse is extending support for additional metadata standards in various scientific domains including biomedical ontologies, astronomy and updating to DDI codebook 2.5 (in the future, support for DDI Lifecycle). They are also extending the search, data exploration and analysis for tabular datasets (with histograms, cross-tabs, enhance descriptive stats, model selection). In addition they are also extending Data/Metadata API and data deposit API, and rich ingest for additional data types. 

Local solutions, including formal Institutional Repositories (IRs) and other storage services through a variety of campus resources did not emerge as a popular topic in the posts I received. One librarian commented on the resources in personnel and money that may be needed in IRs to deliver strong service for larger deposits.

Steve McGinty

Social Sciences Librarian

University of Massachusetts - Amherst

White Paper Urges New Approaches to Assure Access to Scientific Data

Press release posted on behalf of Mark Thompson-Kolar, ICPSR.

12/12/2013:  (Ann Arbor, MI)—More than two dozen data repositories serving the social, natural, and physical sciences today released a white paper recommending new approaches to funding sharing and preservation of scientific data. The document emphasizes the need for sustainable funding of domain repositories—data archives with ties to specific scientific communities.

“Sustaining Domain Repositories for Digital Data: A White Paper,” is an outcome of a meeting convened June 24-25, 2013, in Ann Arbor. The meeting, organized by the Inter-university Consortium for Political and Social Research (ICPSR) and supported by the Alfred P. Sloan Foundation, was attended by representatives of 22 data repositories from a wide spectrum of scientific disciplines.

Domain repositories accelerate intellectual discovery by facilitating data reuse and reproducibility. They leverage in-depth subject knowledge as well as expertise in data curation to make data accessible and meaningful to specific scientific communities. However, domain repositories face an uncertain financial future in the United States, as funding remains unpredictable and inadequate. Unlike our European competitors who support data archiving as necessary scientific infrastructure, the US does not assure the long-term viability of data archives.

“This white paper aims to start a conversation with funding agencies about how secure and sustainable funding can be provided for domain repositories,” said ICPSR Director George Alter. “We’re suggesting ways that modifications in US funding agencies’ policies can help domain repositories to achieve their mission.”

Five recommendations are offered to encourage data stewardship and support sustainable repositories: 

  •  Commit to sustaining institutions that assure the long-term preservation and viability of research data
  • Promote cooperation among funding agencies, universities, domain repositories, journals, and other stakeholders 
  •  Support the human and organizational infrastructure for data stewardship as well as the hardware
  •  Establish review criteria appropriate for data repositories
  • Incentivize Principal Investigators (PIs) to archive data

While a single funding model may not fit all disciplines, new approaches are urgently needed, the paper says.

“What’s really remarkable about this effort—the meeting and the resulting white paper—has been the consensus across disciplines from astronomy to archaeology to proteomics,” Alter said. “More than two dozen domain repositories from so many disciplines are saying the same thing: Data sharing can produce more science, but data stewards must know the needs of their scientific communities.”

This white paper is a must read for anyone who wants to understand the role of scientific domain repositories and their critical role in the advancement of science. It can be downloaded at http://datacommunity.icpsr.umich.edu

 

The Inter-university Consortium for Political and Social Research (ICPSR), based in Ann Arbor, MI, is the largest archive of behavioral and social science research data in the world. It advances research by acquiring, curating, preserving, and distributing original research data. www.icpsr.umich.edu

The Alfred P. Sloan Foundation is a philanthropic, not-for-profit grantmaking institution based in New York City. Established in 1934, the Foundation makes grants in support of original research and education in science, technology, engineering, mathematics, and economic performance. www.sloan.org

###

re3data.org and OpenAIRE sign MoU during Open Access Week; new re3data.org features

Last month, OpenAIRE (Open Access Infrastructure for Research in Europe) and re3data.org signed a Memorandum of Agreement to “work jointly to facilitate research data registration, discovery, access and re-use” in support of open science.  OpenAIRE is an infrastructure for open access that works to track and measure research output (originally designed to monitor EU funding activities).  re3data.org is an online listing of research data repositories.

re3data.org and OpenAIRE will exchange metadata in order for OpenAIRE to “integrate data repositories indexed in the re3data.org registry and in turn return information about usage statistics for datasets and inferred links between data and publications.”

For more information, see the OpenAIRE press release on the MoU.

In addition, re3data.org is now mentioned in Nature's Scientific data's deposition policy, which encourages the registration of repositories with the service, as well as a collaboration with BioSharing.

In addition, re3data.org has made other recent enhancements, including:

Now users can browse re3data.org repositories by:

  1. subject
  2. content type
  3. country

Furthermore, a re-designed the repository record now groups information into the categories of: general, institutions, terms, and standards.  They have added many more repositories in the past few months, so check it out!

The Role of Data Repositories in Reproducible Research

Cross posted from ISPS Lux et Data Blog

These questions were on my mind as I was preparing to present a poster at the Open Repositories 2013 conference in Charlottetown, PEI earlier this month. The annual conference brings the digital repositories community together with stakeholders, such as researchers, librarians, publishers and others to address issues pertaining to “the entire lifecycle of information.” The conference theme this year, “Use, Reuse, Reproduce,” could not have been more relevant to the ISPS Data Archive. Two plenary sessions bookended the conference, both discussing the credibility crisis in science. In the opening session, Victoria Stodden set the stage with her talk about the central role of algorithms and code in the reproducibility and credibility of science. In the closing session, Jean-Claude Guédon made a compelling case that open repositories are vital to restoring quality in science.

My poster, titled, “The Repository as Data (Re) User: Hand Curating for Replication,” illustrated the various data quality checks we undertake at the ISPS Data Archive. The ISPS Data Archive is a small archive, for a small and specialized community of researchers, containing mostly small data. We made a key decision early on to make it a "replication archive," by which we mean a repository that holds data and code for the purpose of being used to replicate and verify published results.

The poster presents ISPS Data Archive’s answer to the questions of who is responsible for the quality of data and what that means: We think that repositories do have a responsibility to examine the data and code we receive for deposit before making the files public, and that this data review involves verifying and replicating the original research outputs. In practice, this means running the code against the data to validate published results. These steps in effect expand the role of the repository and more closely integrate it into the research process, with implications for resources, expertise, and relationships, which I will explain here.
First, a word about what data repositories usually do, the special obligations reproducibility imposes, and who is fulfilling them now. This ties in with a discussion of data quality, data review, and the role of repositories.

Data Curation and Data Quality

A well-curated data repository is more than a place to put data. The Digital Curation Center (DCC) explains that data curation means ensuring data are accessible to designated users for first time use and reuse. This involves a set of curatorial practices – maintaining, preserving and adding value to digital research data throughout its lifecycle – which reduces threat to the long-term research value of the data, minimizes the risk of its obsolescence, and enables sharing and further research. An example of a standard-setting curation process is the Inter-university Consortium for Political and Social Research (ICPSR). This process involves organizing, describing, cleaning, enhancing, and preserving data for public use and includes format conversions, reviewing the data for confidentiality issues, creating documentation and metadata records, and assigning digital object identifiers. Similar data curation activities take place at many data repositories and archives.

These activities are understood as essential for ensuring and enhancing data quality. Dryad, for example, states that its curatorial team “works to enforce quality control on existing content.” But there are many ways to assess the quality of data. One criterion is verity: Whether the data reflect actual facts, responses, observations or events. This is often assessed by the existence and completeness of metadata. The UK’s Economic and Social Research Council (ESRC), for example, requests documentation of “the calibration of instruments, the collection of duplicate samples, data entry methods, data entry validation techniques, methods of transcription.” Another way to assess data quality is by its degree of openness. Shannon Bohle recently listed no less than eight different standards for assessing the quality of open data on this dimension. Others argue that data quality consists of a mix of technical and content criteria that all need to be taken into account. Wang & Strong’s 1996 article claims that, “high-quality data should be intrinsically good, contextually appropriate for the task, clearly represented, and accessible to the data consumer.” More recently, Kevin Ashley observed that quality standards may be at odds with each other. For example, some users may prize the completeness of the data while others their timeliness. These standards can go a long way toward ensuring that data are accurate, complete, and timely and that they are delivered in a way that maximizes their use and reuse.

Yet these procedures are “rather formal and do not guarantee the validity of the content of the dataset” (Doorn et al). Leaving aside the question of whether they are always adhered to, these quality standards are insufficient when viewed through the lens of “really reproducible research.” Reproducible science requires that data and code be made available alongside the results, to allow regeneration of the published results. For a replication archive, such as the ISPS Data Archive, the reproducibility standard is imperative.

Data Review

The imperative to provide data and code, however, only achieves the potential for verification of published results. It remains unclear as to how actual replication occurs. That’s where a comprehensive definition of the concept of “data review” can be useful: At ISPS, we understand data review to mean taking that extra step – examining the data and code received for deposit and verifying and replicating the original research outputs.

In a recent talk, Christine Borgman pointed out that most repositories and archives follow the letter, not the spirit, of the law. They take steps to share data, but they do not review the data. “Who certifies the data? Gives it some sort of imprimatur?” she asks. This theme resonated at Open Repositories. Stodden asked: “Who, if anyone, checks replication pre-publication?” Chuck Humphrey lamented the lack of an adequate data curation toolkit and best practices regarding the extent of data processing prior to ingest. And Guédon argued that repositories have a key role to play in bringing quality to the foreground in the management of science.

Stodden’s call for the provision of data and code underlying publication echoes Gary King’s 1995 definition of the “replication standard” as the provision of, “sufficient information… with which to understand, evaluate, and build upon a prior work if a third party could replicate the results without any additional information from the author.” Both call on the scientific community to take up replication for the good of science as a matter of course in their scientific work. However, both are vague as to how this can be accomplished. Stodden suggested at Open Repositories that this activity is community-dependent, often done by students or by other researchers continuing a project, and that community norms can be adjusted by rewarding high integrity, verifiable research. King, on the other hand, argues that “the replication standard does not actually require anyone to replicate the results of an article or book. It only requires sufficient information to be provided – in the article or book or in some other publicly accessible form – so that the results could in principle be replicated” (emphasis added in italics). Yet, if we care about data quality, reproducibility, and credibility, it seems to me that this is exactly the kind of review in which we should be engaging.

A quick survey of various stakeholders in the research data lifecycle reveals that data review of this sort is not widely practiced:

  • Researchers, on the whole, do not do replication tests as part of their own work, or even as part of the peer review process. In the future, they may be incentives for researchers to do so, and post-publication crowd-sourced peer review in the mold of Wikipedia, as promoted by Edward Curry, may prove to be a successful model.
  • Academic institutions, and their libraries, are increasingly involved in the data management process, but are not involved in replication as a matter of course (note some calls for libraries to take a more active role in this regard).
  • Large or general data repositories like Dryad, FigShare, Dataverse, and ICPSR provide useful guidelines and support varying degrees of file inspection, as well as make it significantly easier to include materials alongside the data, but they do not replicate analyses for the purpose of validating published results. Efforts to encourage compliance with (some of) these standards (e.g., Data Seal of Approval) typically regard researchers responsible for data quality, and generally leave repositories to self-regulate.
  • Innovative services, such as RunMyCode, offer a dissemination platform for the necessary pieces required to submit the research to scrutiny by fellow scientists, allowing researchers, editors, and referees to “replicate scientific results and to demonstrate their robustness.” RunMyCode is an excellent facilitator for people who wish to have their data and code validated; but it relies on crowd sourcing, and does not provide the service per se.
  • Some argue that scholarly journals should take an active role in data review, but this view is controversial. A document produced by the British Library recently recommended that, “publishers should provide simple and, where appropriate, discipline-specific data review (technical and scientific) checklists as basic guidance for reviewers.” In some disciplines, reviewers do check the data. The F1000 group identifies the “complexity of the relationship between the data/article peer review conducted by our journal and the varying levels of data curation conducted by different data repositories.” The group provides detailed guidelines for authors on what is expected of them to submit and ensures that everything is submitted and all checklists are completed. It is not clear, however, if they themselves review the data to make sure it replicates results. Alan Dafoe, a political scientist at Yale, calls for better replication practices in political science. He places responsibility on authors to provide quality replication files, but then also suggests that journals encourage high standards for replication files and that they conduct a “replication audit” which will “evaluate the replicability and robustness of a random subset of publications from the journal.”

The ISPS Data Archive and Reproducible Research

This brings us to the ISPS Data Archive. As a small, on-the-ground, specialized data repository, we are dedicated to serious data review. All data and code – as well as all accompanying files – that are made public via the Archive are closely reviewed and adhere to standards of quality that include verity, openness, and replication. In practice it means that we have developed curatorial practices that include assessing whether the files underlying a published (or soon to be published) article, and provided by the researchers, actually reproduce the published results.

This requires significant investment in staffing, relationships, and resources. The ISPS Data Archive staff has data management and archival skills, as well as domain and statistical expertise. We invest in relationships with researchers and learn about their research interests and methods to facilitate communication and trust. All this requires the right combination of domain, technical and interpersonal skills as well as more time, which translates into higher costs.

How do we justify this investment? Broadly speaking, we believe that stewardship of data in the context of “really reproducible research” dictates this type of data review. More specifically, we think this approach provides better quality, better science, and better service.

  • Better quality. By reviewing all data and code files and validating the published results, the ISPS Data Archive essentially certifies that all its research outputs are held to a high standard. Users are assured that code and data underlying publications are valid, accessible, and usable.
  • Better science. Organizing data around publications advances science because it helps root out error. “Without access to the data and computer code that underlie scientific discoveries, published findings are all but impossible to verify” (Stodden et al.) Joining the publication to the data and code combats the disaggregation of information in science associated with open access to data and to publications on the Web. In effect, the data review process is a first order data reuse case: The use of research data for research activity or purpose other than that for which it was intended. This places the Archive as an active partner in the scientific process as it performs a sort of “internal validity” check on the data and analysis (i.e., do these data and this code actually produce these results?).

    It’s important to note that the ISPS Data Archive is not reviewing or assessing the quality of the research itself. It is not engaged in questions such as, was this the right analysis for this research question? Are there better data? Did the researchers correctly interpret the results? We consider this aspect of data review to be an “external validity” check and one which the Archive staff is not in a position to assess. This we leave to the scientific community and to peer review. Our focus is on verifying the results by replicating the analysis and on making the data and code usable and useful.

  • Better service. The ISPS Data Archive provides high level, boutique service to our researchers. We can think of a continuum of data curation that progresses from a basic level where data are accepted “as is” for the purpose of storage and discovery, to a higher level of curation which includes processing for preservation, improved usability, and compliance, to an even higher level of curation which also undertakes the verification of published results.

This model may not be applicable to other contexts. A larger lab, greater volume of research, or simply more data will require greater resources and may prove this level of curation untenable. Further, the reproducibility imperative does not neatly apply to more generalized data, or to data that is not tied to publications. Such data would be handled somewhat differently, possibly with less labor-intensive processes. ISPS will need to consider accommodating such scenarios and the trade-offs a more flexible approach no doubt involves.

For those of us who care about research data sharing and preservation, the recent interest in the idea of a “data review” is a very good sign. We are a long way from having all the policies, technologies, and long-term models figured out. But a conversation about reviewing the data we put in repositories is a sign of maturity in the scholarly community – a recognition that simply sharing data is necessary, but not sufficient, when held up to the standards of reproducible research.

OR2013: Open Repositories Confront Research Data

Open Repositories 2013 was hosted by the University of Prince Edward Island from July 8-12. A strong research data stream ran throughout this conference, which was attended by over 300 participants from around the globe.  To my delight, many IASSISTers were in attendance, including the current IASSIST President and four Past-Presidents!  Rarely do such sightings happen outside an IASSIST conference.

This was my first Open Repositories conference and after the cool reception that research data received at the SPARC IR meetings in Baltimore a few years ago, I was unsure how data would be treated at this conference.  I was pleasantly surprised by the enthusiastic interest of this community toward research data.  It helped that there were many IASSISTers present but the interest in research data was beyond that of just our community.  This conference truly found an appropriate intersection between the communities of social science data and open repositories. 

Thanks go to Robin Rice (IASSIST), Angus Whyte (DCC), and Kathleen Shearer (COAR) for organizing a workshop entitled, “Institutional Repositories Dealing with Data: What a difference a ‘D’ makes!”  Michael Witt, Courtney Matthews, and I joined these three organizers to address a range of issues that research data pose for those operating repositories.  The registration for this workshop was capped at 40 because of our desire to host six discussion tables of approximately seven participants each.  The workshop was fully subscribed and Kathleen counted over 50 participants prior to the coffee break.  The number clearly expresses the wider interest in research data at OR2013.

Our workshop helped set the stage for other sessions during the week.  For example, we talked about environmental drivers popularizing interest in research data, including topics around academic integrity.  Regarding this specific issue, we noted that the focus is typically directed toward specific publication-related datasets and the access needed to support the reproducibility of published research findings.  Both the opening and closing plenary speakers addressed aspects of academic integrity and the role of repositories in supporting the reproducibility of research findings.  Victoria Stodden, the opening plenary speaker, presented a compelling and articulate case for access to both the data and computer code upon which published findings are based.  She calls herself a computational scientist and defends the need to preserve computer code as well as data to facilitate the reproducibility of scientific findings.  Jean-Claude Guédon, the closing plenary speaker, bracketed this discussion on academic integrity.  He spoke about scholarly publishing and how the commercial drive toward indicators of excellence has resulted in cheating.  He likened some academics to Lance Armstrong, cheating to become number one.  He feels that quality rather than excellence is a better indicator of scientific success.

Between these two stimulating plenary speakers, there was a number of sessions during which research data were discussed.  I was particularly interested in a panel of six entitled, “Research Data and Repositories,” especially because the speakers were from the repository community instead of the data community.  They each took turns responding to questions about what their repositories do now regarding research data and what they see happening in the future.  In a nutshell, their answers tended to describe the desire to make better connections between the publications in their repositories with the data underpinning the findings in these articles.  They also spoke about the need to support more stages of the research lifecycle, which often involves aspects of the data lifecycle within research.  There were also statements that reinforced the need for our (IASSIST’s) continued interaction with the repository community.  The use of readme files in the absence of standards-based metadata and other practices, where our data community has moved the best-practice yardstick well beyond, demonstrate the need for our communities to continue in dialogue. 

Chuck Humphrey

In search of: Best practice for code repositories?

I was asked by a colleague about organized efforts within the economics community to develop or support repositories of code for research.  Her experience was with the astrophysics world which apparently has several and she was wondering what could be learned from another academic community.  So I asked a non-random sample of technical economists with whom I work, and then expanded the question to cover all of social sciences and posed the question to the IASSIST community. 

In a nutshell, the answer seems to be “nope, nothing organized across the profession” – even with the profession very broadly defined.  The general consensus for both the economics world and the more general social science community was that there was some chaos mixed with a little schizophrenia. I was told there are there are instances of such repositories, but they were described to me as “isolated attempts” such as this one by Volker Wieland:  http://www.macromodelbase.com/.  Some folks mentioned repositories that were package or language based such as R modules or SAS code from the SAS-L list or online at sascommunity.org.

Many people pointed out that there are more repositories being associated with journals so that authors can (or are required to) submit their data and code when submitting a paper for publication. Several responses touched on this issue of replication, which is the impetus for most journal requirements, including one that pointed out a “replication archive” at Yale (http://isps.yale.edu/research/data).  I was also pointed to an interested paper that questions whether such archives promote replicable research (http://www.pages.drexel.edu/~bdm25/cje.pdf) but that’s a discussion for another post.

By far, the most common reference I received was for the repositories associated with RePEc (Research Papers in Economics) which offers a broad range of services to the economic research community.  There you’ll find the IDEAS site (http://ideas.repec.org/) and the QM&RBC site with code for Dynamic General Equilibrium models (http://dge.repec.org/) both run by the St. Louis Fed.

I also heard from support folks who had tried to build a code repository for their departments and were disappointed by the lack of enthusiasm for the project. The general consensus is that economists would love to leverage other people’s code but don’t want to give away their proprietary models.  They should know there is no such thing as a free lunch! 

 I did hear that project specific repositories were found to be useful but I think of those as collaboration tools rather than a dissemination platform.  That said, one economist did end his email to me with the following plea:  “lots of authors provide code on their websites, but there is no authoritative host. Will you start one please?”

/san/

  • IASSIST Quarterly

    Publications Special issue: A pioneer data librarian
    Welcome to the special volume of the IASSIST Quarterly (IQ (37):1-4, 2013). This special issue started as exchange of ideas between Libbie Stephenson and Margaret Adams to collect

    more...

  • Resources

    Resources

    A space for IASSIST members to share professional resources useful to them in their daily work. Also the IASSIST Jobs Repository for an archive of data-related position descriptions. more...

  • community

    • LinkedIn
    • Facebook
    • Twitter

    Find out what IASSISTers are doing in the field and explore other avenues of presentation, communication and discussion via social networking and related online social spaces. more...