Already a member?

Sign In
Syndicate content

Data Access & Open Data

Issues around access to data

The Challenge of Rescuing Data: Lessons and Thoughts

A version of this post originally appeared on the NYU Data Dispatch blog.

Data rescue efforts began in January 2017, and over the past few months many institutions hosted hack-a-thon style events to scrape data and develop strategies for preservation. The Environmental Data & Governance Initiative (EDGI) developed a data rescue toolkit, which apportioned the challenge of saving data by distinct federal agency. 

We've had a number of conversations at NYU and with other members of the library community about the implications of preserving federal data and providing access to it. The efforts, while important, call attention to a problem of organization that is very large in scope and likely cannot be solved in full by libraries.

Also a metaphor for preserving federal data

Thus far, the divide-and-conquer model has postulated that individual institutions can "claim" a specific federal agency, do a deep dive to root around its websites, download data, and then mark the agency off a list as "preserved." The process raises many questions, for libraries and for the data refuge movement. What does it mean to "claim" a federal agency? How can one institution reasonably develop a "chain of custody" for an agency's comprehensive collection of data (and how do we define chain of custody)?

How do we avoid duplicated labor? Overlap is inevitable and isn't necessarily a bad thing, but given the scope of the challenge, it would be ideal to distribute efforts so as to benefit from the hard work of metadata remediation that all of us will inevitably do.

These questions suggest even more questions about communication. How do we know when a given institution has preserved federal data, and at what point do we feel ready as a community to acknowledge that preservation has sufficiently taken place? Further, do we expect institutions to communicate that a piece of data has been published, and if so, by what means? What does preservation mean, especially in an environment where data is changing frequently, and what is the standard for discovery? Is it sufficient for one person or institution to download a file and save it? And when an institution claims that it has “rescued” data from a government agency, what commitment does it have to keep up with data refreshes on a regular basis?

An example of an attempt to engage with these issues is Stanford University’s recent decision to preserve the Housing and Urban Development spatial datasets, since they were directly attacked by Republican lawmakers. Early in the Spring 2017 semester, Stanford downloaded all of HUD's spatial data, created metadata records for them, and loaded them into their spatial discovery environment (EarthWorks).

A HUD dataset preserved in Stanford's Spatial Data Repository and digital collections

We can see from the timestamp on their metadata record that the files were added on March 24, 2017. Stanford's collection process is very robust and implies a level of curation and preservation that is impressive. As colleagues, we know that by adding a file, Stanford has committed to preserving it in its institutional repository, presenting original FGDC or ISO 19139 metadata records, and publishing their newly created records to OpenGeoMetadata, a consortium of shared geospatial metadata records. Furthermore, we know that all records are discoverable at the layer level, which suggests a granularity in description and access that often is not present at many other sources, including Data.gov.

However, if I had not had conversations with colleagues who work at Stanford, I wouldn't have realized they preserved the files at all and likely would've tried to make records for NYU's Spatial Data Repository. Even as they exist, it's difficult for me to know that these files were in fact saved as part of the Data Refuge effort. Furthermore, Stanford has made no public claim or longterm "chain of custody" agreement for HUD data, simply because no standards for doing so currently exist.

Maybe it wouldn't be the worst thing for NYU to add these files to our repository, but it seems unnecessary, given the magnitude of federal data to be preserved. However, some redundancy is a part of the goals that Data Refuge imagines:

Data collected as part of the #DataRefuge initiative will be stored in multiple, trusted locations to help ensure continued accessibility. [...]DataRefuge acknowledges--and in fact draws attention to--the fact that there are no guarantees of perfectly safe information. But there are ways that we can create safe and trustworthy copies. DataRefuge is thus also a project to develop the best methods, practices, and protocols to do so.

Each institution has specific curatorial needs and responsibilities, which imply choices about providing access to materials in library collections. These practices seldom coalesce with data management and publishing practices from those who work with federal agencies. There has to be some flexibility between community efforts to preserve data, individual institutions and their respective curation practices.

"That's Where the Librarians Come In"

NYU imagines a model that dovetails with the Data Refuge effort in which individual institutions build upon their own strengths and existing infrastructure. We took as a directive some advice that Kimberly Eke at Penn circulated, including this sample protocolWe quickly began to realize that no approach is perfect, but we wanted to develop a pilot process for collecting data and bringing it into our permanent geospatial data holdings. The remainder of this post is a narrative of that experience in order to demonstrate some of the choices we made, assumptions we started with, and strategies we deployed to preserve federal data. Our goal is to preserve a small subset of data in a way that benefits our users and also meets the standards of the Data Refuge movement.

We began by collecting the entirety of publicly accessible metadata from Data.gov, using the underlying the CKAN data catalog API. This provided us with approximately 150,000 metadata records, stored as individual JSON files. Anyone who has worked with Data.gov metadata knows that it’s messy and inconsistent but is also a good starting place to develop better records. Furthermore, the concept of Data.gov serves as an effective registry or checklist (this global metadata vault could be another starting place); it's not the only source of government data, nor is it necessarily authoritative. However, it is a good point of departure, a relatively centralized list of items that exist in a form that we can work with.

Since NYU Libraries already has a robust spatial data infrastructure and has established workflows for accessioning GIS data, we began by reducing the set of Data.gov records to those which are likely to represent spatial data. We did this by searching only for files that meet the following conditions:

  • Record contains at least one download resource with a 'format' field that contains any of {'shapefile', 'geojson', 'kml', 'kmz'}
  • Record contains at least one resource with a 'url' field that contains any of {'shapefile', 'geojson', 'kml', ['original' followed by '.zip']}

That search generated 6,353 records that are extremely likely to contain geospatial data. From that search we yielded a subset of records and then transformed them into a .CSV.

The next step was to filter down and look for meaningful patterns. We first filtered out all records that were not from federal sources, divided categories into like agencies, and started exploring them. Ultimately, we decided to rescue data from the Department of Agriculture, Forest Service. This agency seems to be a good test case for a number of the challenges that we’ve identified. We isolated 136 records and organized them here (click to view spreadsheet). However, we quickly realized that a sizable chunk of the records had already somehow become inactive or defunct after we had downloaded them (shaded in pink), perhaps because they had been superseded by another record. For example, this record is probably meant to represent the same data as this record. We can't know for sure, which means we immediately had to decide what to do with potential gaps. We forged ahead with the records that were "live" in Data.gov.

About Metadata Cleaning

There are some limitations to the metadata in Data.gov that required our team to make a series of subjective decisions:

  1. Not everything in Data.gov points to an actual dataset. Often, records can point to other portals or clearinghouses of data that are not represented within Data.gov. We ultimately decided to omit these records from our data rescue effort, even if they point to a webpage, API, or geoservice that does contain some kind of data.
  2. The approach to establishing order on Data.gov is inconsistent. Most crucially for us, there is not a one-to-one correlation between a record and an individual layer of geospatial data. This happens frequently on federal sites. For instance, the record for the U.S. Forest Service Aerial Fire Retardant Hydrographic Avoidance Areas: Aquatic actually contains eight distinct shapefile layers that correspond to the different regions of coverage. NYU’s collection practice dictates that each of these layers be represented by a distinct record, but in the Data.gov catalog, they are condensed into a single record. 
  3. Not all data providers publish records for data on Data.gov consistently. Many agencies point to some element of their data that exists, but when you leave the Data.gov catalog environment and go to the source URL listed in the resources section of the record, you’ll find even more data. We had to make decisions about whether or not (and how) we would include this kind of data.
  4. It’s very common that single Data.gov metadata records remain intact, but the data that they represent changes. The Forest Service is a good example of this, as files are frequently refreshed and maintained within the USDA Forestry geodata clearinghouse. We did not make any effort in either of these cases to track down other sets of data that the Data.gov metadata records gesture toward (at least not at this time).

Relatedly, we did not make attempts to provide original records for different formats of what appeared to be the same data. In the case of the Forest Service, many of the records contained both a shapefile and a geodatabase, as well as other original metadata files. Our general approach was to save the shapefile and publish it in our collection environment, then bundle up all other "data objects" associated with a discrete Data.gov record and include them in the preservation environment of our Spatial Data Repository.

Finally, we realized that the quality of the metadata itself varies widely. We found that it’s a good starting place to creating metadata for discovery, even if we agree that a Data.gov record is an arbitrary way to describe a single piece of data. However, we had to clean the Data.gov records to adhere to the GeoBlacklight standard and our own internal cataloging practices. Some of the revisions to the metadata are small and reflect choices that we make at NYU (these are highlighted in red). For instance, the titles were changed to reflect a date-title-area convention that we already use. Other fields (like Publisher) are authority controlled and were easy to change, while others, like format and provenance, were easy to add. For those unfamiliar with the GeoBlacklight standard, refer to the project schema pages and related documentation. Many of the metadata enhancements are system requirements for items to be discovered within our Spatial Data Repository. Subjects presented more of a problem, as these are drawn from an informal tagging system on Data.gov. We used an elaborate process of finding and replacing to remediate these subjects into the LCSH Authority, which connects the items we collect into our larger library discovery environment.

The most significant changes are in the descriptions. We preserved the essence of the original Data.gov description, yet we cleaned up the prose a little bit and added a way to trace the item that we are preserving back to its original representation in Data.gov. In the case of aforementioned instances, in which a single Data.gov record contains more than one shapefile, we generated an entirely new record and referenced it to the original Data.gov UUID. 

Future Directions: Publishing Checksums

Libraries' ability to represent precisely and accurately which datasets, or components of datasets, have been preserved is a serious impediment to embarking on a distributed repository / data-rescue project. Further, libraries need to know if data objects have been preserved and where they reside. To return to the earlier example, how is New York University to know that a particular government dataset has already been "rescued" and is being preserved (either via a publicly-accessible repository interface, or not)?

Moreover, even if there is a venue for institutions to discuss which government datasets fall within their collection priorities (e.g. "New York University cares about federal forestry data, and therefore will be responsible for the stewardship of that data"), it's not clear that there is a good strategy for representing the myriad ways in which the data might exist in its "rescued" form. Perhaps the institution that elects to preserve a dataset wants to make a few curatorial decisions in order to better contextualize the data with the rest of the institution's offerings (as we did with the Forest Service data). These types of decisions are not abnormal in the context of library accessioning.

The problem comes when data processing practices of an institution, which are often idiosyncratic and filled with "local" decisions to a certain degree, start to inhibit the ability for individuals to identify a copy of a dataset in the capacity of a copy. There is a potential tension between preservation –– preserving the original file structure, naming conventions, and even level of dissemination of government data products –– and discovery, where libraries often make decisions about the most useful way for users to find relevant data that are in conflict with the decisions exhibited in the source files.

For the purposes of mitigating the problem sketched above, we propose a data store that can be drawn upon by all members of the library / data-rescue community, whereby the arbitrary or locally-specific mappings and organizational decisions can be related back to original checksums of individual, atomic, files. File checksums would be unique identifiers in such a datastore, and given a checksum, this service would display "claims" about institutions that hold the corresponding file, and the context in which that file is accessible.

Consider this as an example:

  • New York University, as part of an intentional data rescue effort, decides to focus on collecting and preserving data from the U.S. Forest Service.
  • The documents and data from Forest Service are accessible through many venues:
    • They (or some subset) are linked to from a Data.gov record
    • They (or some subset) are linked to directly from the FSGeodata Clearinghouse
    • They are available directly from a geoservices or FTP endpoint maintained by the Forest Service (such as here).
  • NYU wants a way to grab all of the documents from the Forest Service that it is aware of and make those documents available in an online repository. The question is, if NYU has made organizational and curatorial decisions about the presentation of documents rescued, how can it be represented (to others) that the files in the repository are indeed preserved copies of other datasets? If, for instance, Purdue University comes along and wants to verify that everything on the Forest Service's site is preserved somewhere, it now becomes more difficult to do so, particularly since those documents never possessed a canonical or authoritative ID in the first place, and even could have been downloaded originally from various source URLs.

Imagine instead that as NYU accessions documents ––restructuring them and adding metadata –– they not only create checksum manifests (similar to, if not even identical to the ones created by default by BagIt), but also deposit those manifests to a centralized data store in such a form that the data store could now relate essential information:

The file with checksum 8a53c3c191cd27e3472b3e717e3c2d7d979084b74ace0d1e86042b11b56f2797 appears in as a component of the document instituton_a_9876... held by New York University.

Assuming all checksums are computed at the lowest possible level on files rescued from Federal agencies (i.e., always unzip archives, or otherwise get to an atomic file before computing a checksum), such a service could use archival manifest data as a way to signal to other institutions if a file has been preserved, regardless of whether or not it exists as a smaller component of a different intellectual entity –– and it could even communicate additional data about where to find these preserved copies. In the example of the dataset mentioned above, the original Data.gov record represents 8 distinct resources, including a Shapefile, a geodatabase, an XML metadata document, an HTML file that links to an API, and more. For the sake of preservation, we could package all of these items, generate checksums for each, and then take a further step in contributing our manifest to this hypothetical datastore. Then, as other institutions look to save other data objects, they could search against this datastore and find not merely checksums of items at the package level, but actually at the package component level, allowing them to evaluate which portion or percentage of data has been preserved.

A system such as the one sketched above could efficiently communicate preservation priorities to a community of practice, and even find use for more general collection-development priorities of a library. Other work in this field, particularly that regarding IPFS, could tie in nicely –– but unlike IPFS, this would provide a way to identify content that exists within file archives, and would not necessitate any new infrastructure for hosting material. All it would require is for an institution to contribute checksum manifests and a small amount of accompanying metadata to a central datastore.

Principles

Even though our rescue of the Forest Service data is still in process, we have learned a lot about the challenges associated with this project. We’re very interested in learning about how other institutions are handling the process of rescuing federal data and look forward to more discussions at the event in Washington D.C. on May 8.

IASSIST Quarterly (IQ) volume 40-2 is now on the website: Revolution in the air

Welcome to the second issue of Volume 40 of the IASSIST Quarterly (IQ 40:2, 2016). We present three papers in this issue.

http://iassistdata.org/iq/issue/40/2

First, there are two papers on the Data Documentation Initiative that have their own special introduction. I want to express my respect and gratitude to Joachim Wackerow (GESIS - Leibniz Institute for the Social Sciences). Joachim (Achim) and Mary Vardigan (University of Michigan) have several times and for many years communicated to and advised the readers of the IASSIST Quarterly on the continuing development of the DDI. The metadata of data is central for the use and reuse of data, and we have come a long way through the efforts of many people.    

The IASSIST 2016 conference in Bergen was a great success - I am told. I was not able to attend but heard that the conference again was 'the best ever'. I was also told that among the many interesting talks and inputs at the conference Matthew Woollard's keynote speech on 'Data Revolution' was high on the list. Good to have well informed informers! Matthew Woollard is Director of the UK Data Archive at the University of Essex. Here in the IASSIST Quarterly we bring you a transcript of his talk. Woollard starts his talk on the data revolution with the possibility of bringing to users access to data, rather than bringing data to users. The data is in the 'cloud' - in the air - 'Revolution in the air' to quote a Nobel laureate. We are not yet in the post-revolutionary phase and many issues still need to be addressed. Woollard argues that several data skills are in demand, like an understanding of data management and of the many ethical issues. Although he is not enthusiastic about the term 'Big Data', Woollard naturally addresses the concept as these days we cannot talk about data - and surely not about data revolution - without talking about Big Data. I fully support his view that we should proceed with caution, so that we are not simply replacing surveys where we 'ask more from fewer' with big data that give us 'less from more'. The revolution gives us new possibilities, and we will see more complex forms of research that will challenge data skills and demand solutions at data service institutions.  

Papers for the IASSIST Quarterly are always very welcome. We welcome input from IASSIST conferences or other conferences and workshops, from local presentations or papers especially written for the IQ. When you are preparing a presentation, give a thought to turning your one-time presentation into a lasting contribution. We permit authors 'deep links' into the IQ as well as deposition of the paper in your local repository. Chairing a conference session with the purpose of aggregating and integrating papers for a special issue IQ is also much appreciated as the information reaches many more people than the session participants, and will be readily available on the IASSIST website at http://www.iassistdata.org

Authors are very welcome to take a look at the instructions and layout:

http://iassistdata.org/iq/instructions-authors

Authors can also contact me via e-mail: kbr@sam.sdu.dk. Should you be interested in compiling a special issue for the IQ as guest editor(s) I will also be delighted to hear from you.

Karsten Boye Rasmussen   
Editor, IASSIST Quarterly

IASSIST's Statement in Response to President’s Executive Order on Visas and Immigration


February 13, 2017

Statement of the International Association for Social Science Information Services and Technology (IASSIST at http://iassistdata.org) in response to President Trump's January 27 Executive Order on Visas and Immigration, titled "PROTECTING THE NATION FROM FOREIGN TERRORIST ENTRY INTO THE UNITED STATES".

The recent executive order on visas and immigration issued on January 27th by US President Trump is of grave concern to IASSIST as an organization. IASSIST, the International Association for Social Science Information Services and Technology, is an international organization of professionals working in and with information technology, libraries, data services and research & higher education to support open science, advocate for responsible data management and use, build a broader community surrounding research data, and encourage the development of data professionals. Our membership is international, and we greatly value the ability to travel and meet to share knowledge at locations around the world. Our international fellows program and other initiatives are specifically designed to encourage participation from underrepresented regions, including the Muslim-majority countries targeted by the executive order.

While recognizing the authority of the United States over its borders, there are several aspects of this order that are troubling, viz.:

  1. Its sudden and chaotic implementation has led to severe uncertainty over whether rules and practices for entering the United States will be subject to rapid and arbitrary change.
  2. It has led to the detention of lawful permanent residents of the United States, the revocation of visas previously granted under proper vetting procedures, the perception of potential discrimination on the basis of religion, and the humanitarian crisis caused by ceasing to accept refugees.
  3. Its introduction of several restrictive elements into the domain of visas and immigration, such as the statement that those entering the US, including temporary visitors, must "support the Constitution".

For these reasons, the order generates a hostile climate for the open, collaborative scientific work of our organization, both for non-US persons seeking to work and collaborate with Americans, and for Americans traveling and working outside of the US to collaborate who may face retributive actions from other states. Our membership has legitimate concerns about whether travel to the US is possible under such conditions. The order also may have long-term repercussions that damage the reputation of the US as a location that is open to visitors and immigrants, supporting the open exchange of ideas, and protected under the rule of law from arbitrary changes impacting human freedom. In response, IASSIST will continue to speak out in favor of our organization's goals, and against such threats to international collaboration in research and data sharing.

Our May 2017 annual conference will be held in Lawrence, Kansas. Arrangements were begun long before the Executive Order on Visas and Immigration, and it is impossible to change the venue at this date. IASSIST stands in solidarity with its members and encourages them to attend the conference and participate in the international exchange of ideas that is the purpose of our association. We hope that no member will be denied entry into the US due to the administration's recent actions. IASSIST will assist its membership with visa issues and other concerns emanating from this order. We also reaffirm that we are committed to an environment free from discrimination, harassment, and retaliation, at the annual conference and all IASSIST activities.

 Tuomas J. Alaterä, President
 Jen Green, Vice-President
 Ryan Womack, Secretary
 Thomas Lindsay, Treasurer

International Association for Social Science Information Service and Technology (IASSIST)

IQ 40:1 Now Available!

Our World and all the Local Worlds
Welcome to the first issue of Volume 40 of the IASSIST
Quarterly (IQ 40:1, 2016). We present four papers in this issue.
The first paper presents data from our very own world,
extracted from papers published in the IQ through four
decades. What is published in the IQ is often limited in
geographical scope and in this issue the other three papers
present investigations and project research carried out at
New York University, Purdue University, and the Federal
Reserve System. However, the subject scope of the papers
and the methods employed bring great diversity. And
although the papers are local in origin they all have a strong
focus for generalization in order to spread the information
and experience.


We proudly present the paper that received the 'best
paper award' at the IASSIST conference 2015. Great thanks
are expressed to all the reviewers who took part in the
evaluation! In the paper 'Social Science Data Archives: A
Historical Social Network Analysis' the authors Kristin R.
Eschenfelder (University of Wisconsin-Madison), Morgaine
Gilchrist Scott, Kalpana Shankar, and Greg Downey
are reporting on inter-organizational influence and
collaboration among social science data archives through
data of articles published in IASSIST Quarterly in 1976
to 2014. The paper demonstrates social network analysis
(SNA) using a web of 'nodes' (people/authors/institutions)
and 'links' (relationships between nodes). Several types
of relationships are identified: influencing, collaborating,
funding, and international. The dynamics are shown in
detail by employing five year sections. I noticed that from
a reluctant start the amount of relationships has grown
significantly and archives have continuously grown better
at bringing in 'influence' from other 'nodes'. The paper
contributes to the history of social science data archives and
the shaping of a research discipline.


The paper 'Understanding Academic Patrons’ Data Needs
through Virtual Reference Transcripts: Preliminary Findings
from New York University Libraries' is authored by Margaret
Smith and Jill Conte who are both librarians at New York
University, and Samantha Guss, a librarian at University
of Richmond who worked at New York University from
2009-14. The goal of their paper is 'to contribute to the
growing body of knowledge about how information
needs are conceptualized and articulated, and how this
knowledge can be used to improve data reference in an
academic library setting'. This is carried out by analysis of
chat transcripts of requests for census data at NYU. There is
a high demand for the virtual services of the NYU Libraries
and there are as many as 15,000 annual chat transactions.
There has not been much qualitative research of users'
data needs, but here the authors exemplify the iterative
nature of grounded theory with data collection and analysis
processes inextricably entwined and also using a range of
software tools like FileLocator Pro, TextCrawler, and Dedoose.
Three years of chat reference transcripts were filtered down
to 147 transcripts related to United States and international
census data. The unique data provides several insights,
shown in the paper. However, the authors are also aware of
the limitations in the method as it did not include whether
the patron or librarian considered the interaction successful.
The conclusion is that there is a need for additional librarian
training and improved research guides.


The third paper is also from a university. Amy Barton, Paul
J. Bracke, Ann Marie Clark, all from Purdue University,
collaborated on the paper 'Digitization, Data Curation,
and Human Rights Documents: Case Study of a Library
Researcher-Practitioner Collaboration'. The project
concerns the digitization of Urgent Action Bulletins of
Amnesty International from 1974 to 2007. The political
science research centered on changes of transnational
human rights advocacy and legal instrumentation, while
the Libraries’ research related to data management,
metadata, data lifecycle, etcetera. The specific research
collaboration model developed was also generalized for
future practitioner-librarian collaboration projects. The
project is part of a recent tendency where academic
libraries will improve engagement and combine activities
between libraries and users and institutions. The project
attempts to integrate two different lifecycle models thus
serving both research and curatorial goals where the
central question is: 'can digitization processes be designed
in a manner that feeds directly into analytical workflows
of social science researchers, while still meeting the
needs of the archive or library concerned with long-term
stewardship of the digitized content?'. The project builds
on data of Urgent Action Bulletins produced by Amnesty
International for indication of how human rights concerns
changed over time, and the threats in different countries
at different periods, as well as combining library standards
for digitization and digital collections with researcher-driven
metadata and coding strategies. The data creation
started with the scanning and creation of the optical
character recognized (OCR) version of full text PDFs for text
recognition and modeling in NVivo software. The project
did succeed in developing shared standards. However, a
fundamental challenge was experienced in the grant-driven
timelines for both library and researcher. It seems to me that
the expectation of parallel work was the challenge to the
project. Things take time.


In the fourth paper we enter the case of the Federal Reserve
System. San Cannon and Deng Pan, working at the Federal
Reserve Bank in Kansas City and Chicago, created a pilot
for an infrastructure and workflow support for making the
publication of research data a regular part of the research
lifecycle. This is reported in the paper 'First Forays into
Research Data Dissemination: A Tale from the Kansas City
Fed'. More than 750 researchers across the system produce
yearly about 1,000 journal articles, working papers, etcetera.
The need for data to support the research has been
recognized, and the institution is setting up a repository
and defining a workflow to support data preservation
and future dissemination. In early 2015 the internal Center
for the Advancement of Research and Data in Economics
(CADRE) was established with a mission to support, enhance,
and advance data or computationally intensive research,
and preservation and dissemination were identified as
important support functions for CADRE. The paper presents
details and questions in the design such as types of
collections, kind and size of data files, and demonstrates
influence of testers and curators. The pilot also had to
decide on the metadata fields to be used when data is
submitted to the system. The complete setup including
incorporated fields was enhanced through pilot testing and
user feedback. The pilot is now being expanded to other
Federal Reserve Banks.


Papers for the IASSIST Quarterly are always very welcome.
We welcome input from IASSIST conferences or other
conferences and workshops, from local presentations or
papers especially written for the IQ. When you are preparing
a presentation, give a thought to turning your one-time
presentation into a lasting contribution. We permit authors
'deep links' into the IQ as well as deposition of the paper in
your local repository. Chairing a conference session with
the purpose of aggregating and integrating papers for a
special issue IQ is also much appreciated as the information
reaches many more people than the session participants,
and will be readily available on the IASSIST website at
http://www.iassistdata.org.


Authors are very welcome to take a look at the instructions
and layout: http://iassistdata.org/iq/instructions-authors.

Authors can also contact me via e-mail: kbr@sam.sdu.dk.
Should you be interested in compiling a special issue for
the IQ as guest editor(s) I will also be delighted to hear
from you.


Karsten Boye Rasmussen
June 2016
Editor

IASSIST 2016 Program At-A-Glance, Part 2: Data infrastructure, data processing and research data management

 

Here's another list of highlights from IASSIST2016 which is focusing on the data revolution. For previous highlights, see here.

Infrastructure

  • For those of you with an interest in technical infrastructure, the University of Applied Sciences HTW Chur will showcase an early protype MMRepo (1 June, 3F), whose function is to store qualitative and quantitative data into one big data repository.
  • The UK Data Service will present the following panel "The CESSDA Technical Framework - what is it and why is it needed?", which elaborates how the CESSDA Research Infrastructure should have modern data curation techniques rooted in sophisticated IT capabilities at its core, in order to better serve its community.

  • If you have been wondering about the various operational components and the associated technology counterparts involved with running a data science repository, then the presentation by ICPSR is for you. Participants in that panel will leave with an understanding of how the Archonnex Architecture at ICPSR is strengthening the data services offered to new researchers and much more.

Data processing

Be sure to check out the aforementioned infrastructure offerings if you’re interested in data processing, but also check out a half-day workshop on 31 May, “Text Processing with Regular Expressions,” presented by Harrison Dekker, UC Berkeley, that will help you learn regular expression syntax and how to use it in R, Python, and on the command line. The workshop will be example-driven.

Data visualisation

If you are comfortable working with quantitative data and are familiar with the R tool for statistical computing and want to learn how to create a variety of visualisations, then the workshop by the University of Minnesota on 31 May is for you. It will introduce the logic behind ggplot2 and give participants hands-on experience creating data visualizations with this package. This session will also introduce participants to related tools for creating interactive graphics from this syntax.

Programming

  • If you’re interesting in programming there’s a full-day Intro to Python for Data Wrangling workshop on 31 May, led by Tim Dennis, UC San Diego,  that will provide tools to use scientific notebooks in the cloud, write basic Python programs, integrate disparate csv files and more.

  • Also, the aforementioned Regular Expressions workshop also on 31 May will offer  in-workshop opportunities  to working with real data and perform representative data cleaning and validation operations in multiple languages.

Research data management

  • Get a behind-the-scenes look at data management and see how an organization such as the Odum Institute manages its archiving workflows, head to “Automating Archive Policy Enforcement using Dataverse and iRODS” on 31 May with presenters from the UNC Odom Institute, UNC Chapel Hill. ’Participants will see machine actionable rules in practice and be introduced to an environment where written policies can be expressed in ways an archive can automate their enforcement.

  • Another good half-day workshop, targeted to for people tasked with teaching good research data management practices to researchers is  “Teaching Research Data Management Skills Using Resources and Scenarios Based on Real Data,” 31 May, with presenters from ICPSR, the UK Data Archive and FORS. The organisers of this workshop will showcase recent examples of how they have developed teaching resources for hands-on-training, and will talk about successes and failures in this regard.

Tools

If you’re just looking to add more resources to your data revolution toolbox, whether it’s metadata, teaching, data management, open and restricted access, or documentation, here’s a quick list of highlights:

  • At Creating GeoBlacklight Metadata: Leveraging Open Source Tools to Facilitate Metadata Genesis (31 May), presenters from New York University will provide hands-on experience in creating GeoBlacklight geospatial metadata, including demos on how to capture, export, and store GeoBlacklight metadata.

  • DDI Tools Demo (1 June). The Data Documentation Initiative (DDI) is an international standard for describing statistical and social science data.

  • DDI tools: No Tools, No Standard (3 June), where participants will be introduced to the work of the DDI Developers Community and get an overview of tools available from the community.

Open-access

As mandates for better accessibility of data affects more researchers, dive into the Conversation with these IASSIST offerings:

Metadata

Don’s miss IASSIST 2016’s offerings on metadata, which is the data about the data that makes finding and working with data easier to do. There are many offerings, with a quick list of highlights below:

  • Creating GeoBlacklight Metadata: Leveraging Open Source Tools to Facilitate Metadata Genesis (Half-day workshop, 31 May), with presenters from New York University

  • At Posters and Snacks on 2 June, Building A Metadata Portfolio For Cessda, with presenters from the Finnish Social Science Data Archive; GESIS – Leibniz-Institute for the Social Sciences; and UK Data Service

Spread the word on Twitter using #IASSIST16. 


A story by Dory Knight-Ingram (
ICPSR)

Interested in the “data revolution” and what it means for research? Here’s why you should attend IASSIST2016

 

Part 1: Data sharing, new data sources and data protection

IASSIST is an international organisation of information technology and data services professionals which aims to provide support to research and teaching in the social sciences. It has over 300 members ranging from data archive staff and librarians to statistical agencies, government departments and non-profit organisations.

The theme of this year’s conference is Embracing the ‘data revolution’: opportunities and challenges for research” and it is the 42nd of its kind, taking place every year. IASSIST2016 will take place in Bergen, Norway, from 31 May to 3 June, hosted by NSD - Norwegian Centre for Research Data.

Here is a first snapshot of what is there and why it is important.

Data sharing

If you have ever wondered whether data sharing is to the advantage of researchers, there will be a session led by Utrecht University Library exploring the matter. The first results of a survey which explores personal beliefs, intention and behaviour regarding the sharing of data will also be presented by GESIS. The relationship between data sharing and data citation, relatively overlooked until now, will then be addressed by the Australian Data Archive.

If you are interested in how a data journal could incentivise replications in economics, you should think about attending a session by ZBW Leibniz Information Centre for Economics which will present some studies describing the outcome of replication attempts and discuss the meaning of failed replications in economics.

GESIS will then look into improving research data sharing by addressing different scholarly target groups such as individual researchers, academic institutions, or scientific journals, all of which place diverse demands on a data sharing tool. They will focus on the tools offered by GESIS as well as a joint tool, “SowiDataNet”, offered together with the Social Science Centre Berlin, the German Institute for Economic Research, and the German National Library of Economic.

The UKDA and UKDS will present a paper which seeks to explore the role that case studies of research can play in regard to effective data sharing, reuse and impact.

The Data Archive in Finland (FSD) will also be presented as a case study of an archive that is broadening its services to the health sciences and humanities, disciplines in which data sharing practices have not yet been established.

If you’d like to know more about data accessibility, which is being required by journals and mandated by government funders, join a diverse group of open data experts as IASSIST dives into open data dialogue that includes presentations on Open Data and Citizen Empowerment and 101 Cool Things to do with Open Data as part of the “Opening up on open data workshop.” Presenters will be from archives from across the globe.

New data sources

A talk entitled “Data science: The future of social science?” by UKDA will introduce its conceptual and technical work in developing a big data platform for social science and outline preliminary findings from work using energy data.

If you have been wondering about the role of social media data in the academic environment, the session by the University of California will include an overview of the social media data landscape and the Crimson Hexagon product.

The three Vs of big data, volume, variety and velocity, are being explored in the “Hybrid Data Lake” being built by UKDA using the Universal Decimal Classification platform and expanding “topics” search while using big data management. Find out more about it as well as possible future applications.

Data protection

If you follow data protection issues, the panel on “Data protection: legal and ethical reviews” is for you, starting off with a presentation of the Administrative Data Research Network's (ADRN) Citizen's Panel, which look at public concerns about research using administrative data, the content of which is both personal and confidential. The ADRN was set up as part of the UK Government’s Big Data initiative as a UK-wide partnership between universities, government bodies, national statistics authorities and the wider research community.

The next ADRN presentation within this session will outline their application process and the role of the Approvals Panel in relation to ethical review. The aim is “to expand the discussion towards a broader reflection on the ethical dilemmas that administrative data pose”, as well as present some steps taken to address these difficulties.

NSD will then present the new EU General Data Protection Regulation (GDPR), recently adopted at EU level, and explain how it will affect data collection, data use, data preservation and data sharing. If you have been wondering how the regulation will influence the possibilities for processing personal data for research purposes, or how personal data are defined, what conditions apply to an informed consent, or in which cases it is legal and ethical to conduct research without the consent of the data subjects, this presentation is for you.

The big picture

Wednesday 1 June will kick-off with a plenary entitled “Data for decision-makers: Old practice - new challenges” by Gudmund Hernes, the current president of the International Social Science Council and Norway’s former Minister of Education and Research 1990-95, and Minister of Health 1995-97.

The third day of the conference (2 June) will begin with a plenary - “Embracing the ‘Data Revolution’: Opportunities and Challenges for Research’ or ‘What you need to know about the data landscape to keep up to date”, by Matthew Woollard, Director of the UK Data Archive at the University of Essex and Director of the UK Data Service.

If you want to know more about the three European projects under the framework of the Horizon 2020 programme of the European Commission that CESSDA is involved in, one on big data (Big Data Europe - Empowering Communities with Data Technologies), another on - strengthening and widening the European infrastructure for social science data archives (CESSDA SaW) and a third on synergies for Europe's Research Infrastructures in the Social Sciences (SERISS), this panel is for you.  

"Don't Hate the Player, Hate the Game": Strategies for Discussing and Communicating Data Services” considers how libraries might strategically reconsider communications about data services.

Keep an eye on this blog for more news in the run-up to IASSIST2016.

Find out more on the IASSIST2016 website.

Spread the word on Twitter using #IASSIST16.

We are looking forward to seeing you in Bergen! 


A story by Eleanor Smith (CESSDA)

Latest Issue of IQ Available! Data Documentation Initiative - Results, Tools, and Further Initiatives

Welcome to the third issue of Volume 39 of the IASSIST Quarterly (IQ 39:3, 2015). This special issue is guest edited by Joachim Wackerow of GESIS – Leibniz Institute for the Social Sciences in Germany and Mary Vardigan of ICPSR at the University of Michigan, USA. That sentence is a direct plagiarism from the editor’s notes of the recent double issue (IQ 38:4 & 39:1). We are very grateful for all the work Mary and Achim have carried out and are developing further in the continuing story of the Data Documentation Initiative (DDI), and for their efforts in presenting the work here in the ASSIST Quarterly.

As in the recent double issue on DDI this special issue also presents results, tools, and further initiatives. The DDI started 20 years ago and much has been accomplished. However, creative people are still refining and improving it, as well as developing new areas for the use of DDI.

Mary Vardigan and Joachim Wackerow give on the next page an overview of the content of DDI papers in this issue.

Let me then applaud the two guest editors and also the many authors who made this possible:

  • Alerk Amin, RAND Cooperation, www.rand.org, USA
  • Ingo Barkow, Associate Professor for Data Management at the University for Applied Sciences Eastern Switzerland (HTW Chur), Switzerland
  • Stefan Kramer, American University, Washington, DC, USA
  • David Schiller, Research Data Centre (FDZ) of the German Federal Employment Agency (BA) at the Institute for Employment Research (IAB)
  • Jeremy Williams, Cornell Institute for Social and Economic Research, USA
  • Larry Hoyle, senior scientist at the Institute for Policy & Social Research at the University of Kansas, USA
  • Joachim Wackerow, metadata expert at GESIS - Leibniz Institute for the Social Sciences, Germany
  • William Poynter, UCL Institute of Education, London, UK
  • Jennifer Spiegel, UCL Institute of Education, London, UK
  • Jay Greenfield, health informatics architect working with data standards, USA
  • Sam Hume, vice president of SHARE Technology and Services at CDISC, USA
  • Sanda Ionescu, user support for data and documentation, ICPSR, USA
  • Jeremy Iverson, co-founder and partner at Colectica, USA
  • John Kunze, systems architect at the California Digital Library, USA
  • Barry Radler, researcher at the University of Wisconsin Institute on Aging, USA
  • Wendy Thomas, director of the Data Access Core in the Minnesota Population Center (MPC) at the University of Minnesota, USA
  • Mary Vardigan, archivist at the Inter-university Consortium for Political and Social Research (ICPSR), USA
  • Stuart Weibel, worked in OCLC Research, USA
  • Michael Witt, associate professor of Library Science at Purdue University, USA.

I hope you will enjoy their work in this issue, and I am certain that the contact authors will enjoy hearing from you
about new potential results, tools, and initiatives.

Articles for the IASSIST Quarterly are always very welcome. They can be papers from IASSIST conferences or other
conferences and workshops, from local presentations or papers especially written for the IQ. When you are preparing
a presentation, give a thought to turning your one-time presentation into a lasting contribution to continuing development. As an author you are permitted ‘deep links’ where you link directly to your paper published in the IQ. Chairing a conference session with the purpose of aggregating and integrating papers for a special issue IQ is also much appreciated as the information reaches many more people than the session participants, and will be readily available on the IASSIST website at http://www.iassistdata.org.

Authors are very welcome to take a look at the instructions and layout: http://iassistdata.org/iq/instructions-authors. Authors can also contact me via e-mail: kbr@sam.sdu.dk.

Should you be interested in compiling a special issue for the IQ as guest editor(s) I will also be delighted to hear from you.

Karsten Boye Rasmussen
September 2015
Editor

New Perspectives on DDI

This issue features four papers that look at leveraging the structured metadata provided by DDI in
different ways. The first, “Design Considerations for DDI-Based Data Systems,“ aims to help decisionmakers
by highlighting the approach of using relational databases for data storage in contrast to
representing DDI in its native XML format. The second paper, “DDI as a Common Format for Export
and Import for Statistical Packages,” describes an experiment using the program Stat/Transfer to
move datasets among five popular packages with DDI Lifecycle as an intermediary format. The paper
“Protocol Development for Large-Scale Metadata Archiving Using DDI Lifecycle” discusses the use
of a DDI profile to document CLOSER (Cohorts and Longitudinal Studies Enhancement Resources,
www.closer.ac.uk), which brings together nine of the UK’s longitudinal cohort studies by producing a
metadata discovery platform (MDP). And finally, “DDI and Enhanced Data Citation“ reports on efforts in
extend data citation information in DDI to include a larger set of elements and a taxonomy for the role
of research contributors.

Mary Vardigan - vardigan@umich.edu
Joachim Wackerow - Joachim.Wackerow@gesis.org

IQ double issue 38(4)/39(1) is up, and so is vol 39(2)!

Hi folks!  A lovely gift for your reading pleasure over the holidays, we present two, yes, TWO issues of the IASSIST Quarterly.  The first is the double issue, 38(4)/39(1) with guest editors, Joachim Wacherow of GESIS – Leibniz Institute for the Social Sciences in Germany and Mary Vardigan of ICPSR at the University of Michigan, USA.  This issue focuses on the Data Documentation Initiative (DDI) and how it makes meta-analysis possible.  The second issue is 39(2), and is all about data:  avoiding statistical disclosure, using data, and improving digital preservation.  Although we usually post the full text of the Editor's Notes in the blog post, it seems lengthy to do that for both issues.  You will find them, though, on the web site: the Editor's Notes for the double issue, and the Editor's Notes for issue 39(2).

Michele Hayslett, for the IQ Publications Committee

Looking Back/Moving Forward - Reflections on the First Ten Years of Open Repositories

Open Repositories conference celebrated its first decade by having four full days of exciting workshops, keynotes, sessions, 24/7 talks, and development track and repository interest group sessions in Indianapolis, USA. All the fun took place in the second week of June. The OR2015 conference was themed "Looking Back/Moving Forward: Open Repositories at the Crossroads" and it brought over 400 repository developers and managers, librarians and library IT professionals, service providers and other experts to hot and humid Indy.

Like with IDCC earlier this year, IASSIST was officially a supporter of OR2015. In my opinion, it was a worthy investment given the topics covered, depth and quality of presentations, and attendee profile. Plus I got to do what I love - talk about IASSIST and invite people to attend or present in our own conference.

While there may not be extremely striking overlap with IASSIST and OR conferences, I think there are sound reasons to keep building linkages between these two. Iassisters could certainly provide beneficial insight on various RDM questions and also for instance on researchers' needs, scholarly communication, reusing repository content, research data resources and access, or data archiving and preservation challenges. We could take advantage of the passion and dedication the repository community shows in making repositories and their building blocks perfect. It's quite clear that there is a lot more to be achieved when repository developers and users meet and address problems and opportunities with creativity and commitment.

 

While IASSIST2015 had a plenary speaker from Facebook, OR had keynote speakers from Mozilla Science Lab and Google Scholar. Mozilla's Kaitlin Thaney skyped a very interesting opening keynote (that is what you resort to when thunderstorms prevent your keynote speaker from arriving!) on how to leverage the power of the web for research. Distributed and collaborative approach to research, public sharing and transparency, new models of discovery and freedom to innovate and prototype, and peer-to-peer professional development were among the powers of web-enabled open science.
 
Anurag Acharya from Google gave a stimulating talk on pitfalls and best practices on indexing repositories. His points were primarily aimed at repository managers fine-tuning their repository platforms to be as easily harvestable as possible. However, many of his remarks are worth taking into account when building data portals or data rich web services. On the other, hand it can be asked if it is our job (as repository or data managers) to make things easy for Google Scholar, or do we have other obligations that put our needs and our users first. Often these two are not conflicting though. What is more notable from my point of view was Acharya's statement that Google Scholar does not index other research outputs (data, appendixes, abstracts, code…) than articles from the repositories. But should it not? His answer was that it would be lovely, but it cannot be done efficiently because these resources are not comprehensive enough, and it would not possible for example to properly and accurately link users to actual datasets from the index. I'd like to think this is something for IASSIST community to contemplate.

Open Researcher and Contributor ID (ORCID) had a very strong presence in OR2015. ORCID provides an open persistent identifier that distinguishes a researcher from every other researcher, and through their API interfaces that ID can be connected to organisational and inter-organisational research information systems, helping to associate researchers and their research activities. In addition to a workshop on ORCID APIs there were many presentations about ORCID integrations. It seems that ORCID is getting close to reaching a critical mass of users and members, allowing it to take big leaps in developing its services. However, it still remains to be seen how widely it will be adopted. For research data archiving purposes having a persistent identifier provides obvious advantages as researchers are known to move from one organisation to another, work cross-nationally, and collaborate across disciplines.

Many presentations at least partly addressed familiar but ever challenging research data service questions on deposits, providing data services for the researcher community and overcoming ethical, legal or institutional barriers, or providing and managing a trustworthy digital service with somewhat limited resources. Check for example Andrew Gordon's terrific presentation on Databrary, a research-centered repository for video data. Metadata harmonisation, ontologies, putting emphasis on high quality metadata and ensuring repurposing of metadata were among the common topics as well, alongside a focus on complying with standards - both metadata and technical.

I see there would be a good opportunity and considerable common ground for shared learning here, for example DDI and other metadata experts to work with repository developers and IASSIST's data librarians and archivists to provide training and take part in projects which concentrate on repository development in libraries or archives.

Keynotes and a number of other sessions were live streamed and recorded for later viewing. Videos of keynotes and some other talks and most presentation slides are available already, rest of the videos will be available in the coming weeks.

A decade against decay: the 10th International Digital Curation Conference

The International Digital Curation Conference (IDCC) is now ten years old. On the evidence of its most recent conference, is in rude health and growing fast.

IDCC is the first time IASSIST decided to formally support another organisational conference. I think it was a wise investment given the quality of plenaries, presentations, posters, and discussions.

DCC already has available a number of blogs covering the substance of sessions, including an excellent summary by IASSIST web editor, Robin Rice. Presentations and posters are already available, and video from plenary sessions will soon be online.

Instead I will use this opportunity to pick-up on hanging issues and suggestions for future conferences.

One was apportionment of responsibility. Ultimately, researchers are responsible for management of their data, but they can only do so if supporting infrastructure is in place to help them. So, who is responsible for providing that: funders or institutions? This theme emerged in the context of the UK’s Engineering and Physical Sciences Research Council who will soon enforce expectations identifying the institution as responsible for supporting good Research Data Management.

Related to that was a discussion on the role of libraries in this decade. Are they relevant? Can they change to meet new challenges? Starting out as a researcher who became a data archivist and is now a librarian, I wouldn’t be here if libraries weren’t meeting these challenges. There’s a “hush” of IASSIST members also ready to take issue with the suggestions libraries aren’t relevant or not engaged with data, in fact they did so at our last conference.

Melissa Terras, (UCL) did a fantastic job presenting [PDF] work in the digital humanities that is innovative in not only preserving, but rescuing objects – and all done on small change research budgets. I hope a future IDCC finds space for a social sciences person to present on issues we face in preservation and reuse. Clifford Lynch (CNI) touched on the problems of data reuse and human subjects, which remained one of the few glancing references to a significant problem and one IASSIST members are addressing. Indeed, thanks must go to a former president of this association, Peter Burhill (Edinburgh) who mentioned IASSIST and how it relates to the IDCC audience on more than one occasion.

Finally, if you were stimulated by IDCC’s talk of data, reuse, and preservation then don’t forget our own conference in Minneapolis later this year.

  • IASSIST Quarterly

    Publications Special issue: A pioneer data librarian
    Welcome to the special volume of the IASSIST Quarterly (IQ (37):1-4, 2013). This special issue started as exchange of ideas between Libbie Stephenson and Margaret Adams to collect

    more...

  • Resources

    Resources

    A space for IASSIST members to share professional resources useful to them in their daily work. Also the IASSIST Jobs Repository for an archive of data-related position descriptions. more...

  • community

    • LinkedIn
    • Facebook
    • Twitter

    Find out what IASSISTers are doing in the field and explore other avenues of presentation, communication and discussion via social networking and related online social spaces. more...