Already a member?

Sign In
Syndicate content

Blogs

The Challenge of Rescuing Data: Lessons and Thoughts

A version of this post originally appeared on the NYU Data Dispatch blog.

Data rescue efforts began in January 2017, and over the past few months many institutions hosted hack-a-thon style events to scrape data and develop strategies for preservation. The Environmental Data & Governance Initiative (EDGI) developed a data rescue toolkit, which apportioned the challenge of saving data by distinct federal agency. 

We've had a number of conversations at NYU and with other members of the library community about the implications of preserving federal data and providing access to it. The efforts, while important, call attention to a problem of organization that is very large in scope and likely cannot be solved in full by libraries.

Also a metaphor for preserving federal data

Thus far, the divide-and-conquer model has postulated that individual institutions can "claim" a specific federal agency, do a deep dive to root around its websites, download data, and then mark the agency off a list as "preserved." The process raises many questions, for libraries and for the data refuge movement. What does it mean to "claim" a federal agency? How can one institution reasonably develop a "chain of custody" for an agency's comprehensive collection of data (and how do we define chain of custody)?

How do we avoid duplicated labor? Overlap is inevitable and isn't necessarily a bad thing, but given the scope of the challenge, it would be ideal to distribute efforts so as to benefit from the hard work of metadata remediation that all of us will inevitably do.

These questions suggest even more questions about communication. How do we know when a given institution has preserved federal data, and at what point do we feel ready as a community to acknowledge that preservation has sufficiently taken place? Further, do we expect institutions to communicate that a piece of data has been published, and if so, by what means? What does preservation mean, especially in an environment where data is changing frequently, and what is the standard for discovery? Is it sufficient for one person or institution to download a file and save it? And when an institution claims that it has “rescued” data from a government agency, what commitment does it have to keep up with data refreshes on a regular basis?

An example of an attempt to engage with these issues is Stanford University’s recent decision to preserve the Housing and Urban Development spatial datasets, since they were directly attacked by Republican lawmakers. Early in the Spring 2017 semester, Stanford downloaded all of HUD's spatial data, created metadata records for them, and loaded them into their spatial discovery environment (EarthWorks).

A HUD dataset preserved in Stanford's Spatial Data Repository and digital collections

We can see from the timestamp on their metadata record that the files were added on March 24, 2017. Stanford's collection process is very robust and implies a level of curation and preservation that is impressive. As colleagues, we know that by adding a file, Stanford has committed to preserving it in its institutional repository, presenting original FGDC or ISO 19139 metadata records, and publishing their newly created records to OpenGeoMetadata, a consortium of shared geospatial metadata records. Furthermore, we know that all records are discoverable at the layer level, which suggests a granularity in description and access that often is not present at many other sources, including Data.gov.

However, if I had not had conversations with colleagues who work at Stanford, I wouldn't have realized they preserved the files at all and likely would've tried to make records for NYU's Spatial Data Repository. Even as they exist, it's difficult for me to know that these files were in fact saved as part of the Data Refuge effort. Furthermore, Stanford has made no public claim or longterm "chain of custody" agreement for HUD data, simply because no standards for doing so currently exist.

Maybe it wouldn't be the worst thing for NYU to add these files to our repository, but it seems unnecessary, given the magnitude of federal data to be preserved. However, some redundancy is a part of the goals that Data Refuge imagines:

Data collected as part of the #DataRefuge initiative will be stored in multiple, trusted locations to help ensure continued accessibility. [...]DataRefuge acknowledges--and in fact draws attention to--the fact that there are no guarantees of perfectly safe information. But there are ways that we can create safe and trustworthy copies. DataRefuge is thus also a project to develop the best methods, practices, and protocols to do so.

Each institution has specific curatorial needs and responsibilities, which imply choices about providing access to materials in library collections. These practices seldom coalesce with data management and publishing practices from those who work with federal agencies. There has to be some flexibility between community efforts to preserve data, individual institutions and their respective curation practices.

"That's Where the Librarians Come In"

NYU imagines a model that dovetails with the Data Refuge effort in which individual institutions build upon their own strengths and existing infrastructure. We took as a directive some advice that Kimberly Eke at Penn circulated, including this sample protocolWe quickly began to realize that no approach is perfect, but we wanted to develop a pilot process for collecting data and bringing it into our permanent geospatial data holdings. The remainder of this post is a narrative of that experience in order to demonstrate some of the choices we made, assumptions we started with, and strategies we deployed to preserve federal data. Our goal is to preserve a small subset of data in a way that benefits our users and also meets the standards of the Data Refuge movement.

We began by collecting the entirety of publicly accessible metadata from Data.gov, using the underlying the CKAN data catalog API. This provided us with approximately 150,000 metadata records, stored as individual JSON files. Anyone who has worked with Data.gov metadata knows that it’s messy and inconsistent but is also a good starting place to develop better records. Furthermore, the concept of Data.gov serves as an effective registry or checklist (this global metadata vault could be another starting place); it's not the only source of government data, nor is it necessarily authoritative. However, it is a good point of departure, a relatively centralized list of items that exist in a form that we can work with.

Since NYU Libraries already has a robust spatial data infrastructure and has established workflows for accessioning GIS data, we began by reducing the set of Data.gov records to those which are likely to represent spatial data. We did this by searching only for files that meet the following conditions:

  • Record contains at least one download resource with a 'format' field that contains any of {'shapefile', 'geojson', 'kml', 'kmz'}
  • Record contains at least one resource with a 'url' field that contains any of {'shapefile', 'geojson', 'kml', ['original' followed by '.zip']}

That search generated 6,353 records that are extremely likely to contain geospatial data. From that search we yielded a subset of records and then transformed them into a .CSV.

The next step was to filter down and look for meaningful patterns. We first filtered out all records that were not from federal sources, divided categories into like agencies, and started exploring them. Ultimately, we decided to rescue data from the Department of Agriculture, Forest Service. This agency seems to be a good test case for a number of the challenges that we’ve identified. We isolated 136 records and organized them here (click to view spreadsheet). However, we quickly realized that a sizable chunk of the records had already somehow become inactive or defunct after we had downloaded them (shaded in pink), perhaps because they had been superseded by another record. For example, this record is probably meant to represent the same data as this record. We can't know for sure, which means we immediately had to decide what to do with potential gaps. We forged ahead with the records that were "live" in Data.gov.

About Metadata Cleaning

There are some limitations to the metadata in Data.gov that required our team to make a series of subjective decisions:

  1. Not everything in Data.gov points to an actual dataset. Often, records can point to other portals or clearinghouses of data that are not represented within Data.gov. We ultimately decided to omit these records from our data rescue effort, even if they point to a webpage, API, or geoservice that does contain some kind of data.
  2. The approach to establishing order on Data.gov is inconsistent. Most crucially for us, there is not a one-to-one correlation between a record and an individual layer of geospatial data. This happens frequently on federal sites. For instance, the record for the U.S. Forest Service Aerial Fire Retardant Hydrographic Avoidance Areas: Aquatic actually contains eight distinct shapefile layers that correspond to the different regions of coverage. NYU’s collection practice dictates that each of these layers be represented by a distinct record, but in the Data.gov catalog, they are condensed into a single record. 
  3. Not all data providers publish records for data on Data.gov consistently. Many agencies point to some element of their data that exists, but when you leave the Data.gov catalog environment and go to the source URL listed in the resources section of the record, you’ll find even more data. We had to make decisions about whether or not (and how) we would include this kind of data.
  4. It’s very common that single Data.gov metadata records remain intact, but the data that they represent changes. The Forest Service is a good example of this, as files are frequently refreshed and maintained within the USDA Forestry geodata clearinghouse. We did not make any effort in either of these cases to track down other sets of data that the Data.gov metadata records gesture toward (at least not at this time).

Relatedly, we did not make attempts to provide original records for different formats of what appeared to be the same data. In the case of the Forest Service, many of the records contained both a shapefile and a geodatabase, as well as other original metadata files. Our general approach was to save the shapefile and publish it in our collection environment, then bundle up all other "data objects" associated with a discrete Data.gov record and include them in the preservation environment of our Spatial Data Repository.

Finally, we realized that the quality of the metadata itself varies widely. We found that it’s a good starting place to creating metadata for discovery, even if we agree that a Data.gov record is an arbitrary way to describe a single piece of data. However, we had to clean the Data.gov records to adhere to the GeoBlacklight standard and our own internal cataloging practices. Some of the revisions to the metadata are small and reflect choices that we make at NYU (these are highlighted in red). For instance, the titles were changed to reflect a date-title-area convention that we already use. Other fields (like Publisher) are authority controlled and were easy to change, while others, like format and provenance, were easy to add. For those unfamiliar with the GeoBlacklight standard, refer to the project schema pages and related documentation. Many of the metadata enhancements are system requirements for items to be discovered within our Spatial Data Repository. Subjects presented more of a problem, as these are drawn from an informal tagging system on Data.gov. We used an elaborate process of finding and replacing to remediate these subjects into the LCSH Authority, which connects the items we collect into our larger library discovery environment.

The most significant changes are in the descriptions. We preserved the essence of the original Data.gov description, yet we cleaned up the prose a little bit and added a way to trace the item that we are preserving back to its original representation in Data.gov. In the case of aforementioned instances, in which a single Data.gov record contains more than one shapefile, we generated an entirely new record and referenced it to the original Data.gov UUID. 

Future Directions: Publishing Checksums

Libraries' ability to represent precisely and accurately which datasets, or components of datasets, have been preserved is a serious impediment to embarking on a distributed repository / data-rescue project. Further, libraries need to know if data objects have been preserved and where they reside. To return to the earlier example, how is New York University to know that a particular government dataset has already been "rescued" and is being preserved (either via a publicly-accessible repository interface, or not)?

Moreover, even if there is a venue for institutions to discuss which government datasets fall within their collection priorities (e.g. "New York University cares about federal forestry data, and therefore will be responsible for the stewardship of that data"), it's not clear that there is a good strategy for representing the myriad ways in which the data might exist in its "rescued" form. Perhaps the institution that elects to preserve a dataset wants to make a few curatorial decisions in order to better contextualize the data with the rest of the institution's offerings (as we did with the Forest Service data). These types of decisions are not abnormal in the context of library accessioning.

The problem comes when data processing practices of an institution, which are often idiosyncratic and filled with "local" decisions to a certain degree, start to inhibit the ability for individuals to identify a copy of a dataset in the capacity of a copy. There is a potential tension between preservation –– preserving the original file structure, naming conventions, and even level of dissemination of government data products –– and discovery, where libraries often make decisions about the most useful way for users to find relevant data that are in conflict with the decisions exhibited in the source files.

For the purposes of mitigating the problem sketched above, we propose a data store that can be drawn upon by all members of the library / data-rescue community, whereby the arbitrary or locally-specific mappings and organizational decisions can be related back to original checksums of individual, atomic, files. File checksums would be unique identifiers in such a datastore, and given a checksum, this service would display "claims" about institutions that hold the corresponding file, and the context in which that file is accessible.

Consider this as an example:

  • New York University, as part of an intentional data rescue effort, decides to focus on collecting and preserving data from the U.S. Forest Service.
  • The documents and data from Forest Service are accessible through many venues:
    • They (or some subset) are linked to from a Data.gov record
    • They (or some subset) are linked to directly from the FSGeodata Clearinghouse
    • They are available directly from a geoservices or FTP endpoint maintained by the Forest Service (such as here).
  • NYU wants a way to grab all of the documents from the Forest Service that it is aware of and make those documents available in an online repository. The question is, if NYU has made organizational and curatorial decisions about the presentation of documents rescued, how can it be represented (to others) that the files in the repository are indeed preserved copies of other datasets? If, for instance, Purdue University comes along and wants to verify that everything on the Forest Service's site is preserved somewhere, it now becomes more difficult to do so, particularly since those documents never possessed a canonical or authoritative ID in the first place, and even could have been downloaded originally from various source URLs.

Imagine instead that as NYU accessions documents ––restructuring them and adding metadata –– they not only create checksum manifests (similar to, if not even identical to the ones created by default by BagIt), but also deposit those manifests to a centralized data store in such a form that the data store could now relate essential information:

The file with checksum 8a53c3c191cd27e3472b3e717e3c2d7d979084b74ace0d1e86042b11b56f2797 appears in as a component of the document instituton_a_9876... held by New York University.

Assuming all checksums are computed at the lowest possible level on files rescued from Federal agencies (i.e., always unzip archives, or otherwise get to an atomic file before computing a checksum), such a service could use archival manifest data as a way to signal to other institutions if a file has been preserved, regardless of whether or not it exists as a smaller component of a different intellectual entity –– and it could even communicate additional data about where to find these preserved copies. In the example of the dataset mentioned above, the original Data.gov record represents 8 distinct resources, including a Shapefile, a geodatabase, an XML metadata document, an HTML file that links to an API, and more. For the sake of preservation, we could package all of these items, generate checksums for each, and then take a further step in contributing our manifest to this hypothetical datastore. Then, as other institutions look to save other data objects, they could search against this datastore and find not merely checksums of items at the package level, but actually at the package component level, allowing them to evaluate which portion or percentage of data has been preserved.

A system such as the one sketched above could efficiently communicate preservation priorities to a community of practice, and even find use for more general collection-development priorities of a library. Other work in this field, particularly that regarding IPFS, could tie in nicely –– but unlike IPFS, this would provide a way to identify content that exists within file archives, and would not necessitate any new infrastructure for hosting material. All it would require is for an institution to contribute checksum manifests and a small amount of accompanying metadata to a central datastore.

Principles

Even though our rescue of the Forest Service data is still in process, we have learned a lot about the challenges associated with this project. We’re very interested in learning about how other institutions are handling the process of rescuing federal data and look forward to more discussions at the event in Washington D.C. on May 8.

New to IASSIST or Willing to Mentor Someone New? (IASSIST 2017 Conference)

New to IASSIST or Willing to Mentor Someone New?

We are excited to have new members in IASSIST. IASSIST is a home for data services professionals across many disciplines: librarians, data archivists, open data proponents, data support staff, etc. For some, it is an organization where you don’t have to explain what you do because our members already understand. We get metadata, data support, data access issues, database challenges, the challenge of replication and so much more! Although we are a long-established organization, new members are the lifeblood of IASSIST!
Networking is a great benefit of attending the IASSIST conference but the week quickly goes by and and it can be daunting to join a lively group like this. To get the most out of your membership, we encourage everyone to join the IASSIST mentorship program.Please sign up by Friday, May 12 using Google Forms. Conference contact assignments for IASSIST will be emailed by the end of the day Tuesday, May 16.If for any reason the email link does not open for you, go to
https://docs.google.com/forms/d/e/1FAIpQLScyW2B9m8o5-6Z0D0FPdsTcqkOpUDk_w_u4WwuyIpDWgjAQJQ/viewform?c=0&w=1

If you have any questions, please contact Bobray Bordelon (bordelon@princeton.edu). Thank you for participating & see you in Lawrence, Kansas!


Bobray Bordelon
IASSIST 2017 Mentor Program Coordinator
Economics & Finance Librarian/Data Services Librarian
Princeton University Library
bordelon@princeton.edu

#IDCC17: Notes from the International Digital Curation Conference 2017

For the third time IASSIST sponsored the International Digital Curation Conference. This time allowing three students, one each from Switzerland, Korea, and Canada to attend the conference, which titled itself "Upstream, Downstream: embedding digital curation workflows for data science, scholarship and society".

Data science was a strong theme of the three keynote presentations, in particular how curation and data management are an active, integrated, ongoing parts of analysis rather than a passive epilogue in research.

Maria Wolters talked about how missing data can provide research insights analysing patterns of absence and, counter-intuitively, can improve the quality of datasets through the concept of managed forgetting –asking is it important to preserve and is it relevant at the moment – we can better manage and find data. Alice Daish showed her work as a data scientist at the British Museum, with the goal of enabling data informed decision-making. This involved identifying data "silos" and "wrangling" data in to exportable formats, along with zealous use and promotion of R, but also thinking about the way data is communicated to management. Chris Williams demonstrated how the Alan Turing Institute handles data mining. He reports that about 80 percent of work on data mining involves understanding and preparing data. This ranges from understanding formats and running descriptives to look for outliers and anomalies to cleaning untidy and inconsistent metadata and coding. The aim is to automate as much of this as possible with the Automatic Statistician project.

In a session on data policies, University of Toronto's Dylanne Dearborn and Leanne Trimble showed how libraries can use creative thinking to matching publication patterns against journal data policies in providing support. Fieke Schoots outlined the approach at Leiden which includes requirements from PhD's to state location of research data before their defence can take place and twenty year retention for Data Management Plans. Switching to journals, Ian Hrynaszkiewicz talked about the work Springer Nature has done to standardise journal data polices into one of four types allied with support for authors and editors on policy identification and implementation.

Ruth Geraghty dealt with ethical challenges in retro-fitting a data set for sharing. She introduced the Children’s Research Network for Ireland and Northern Ireland. This involved attempting to obtain consent from participants for sharing, but also work on anonymising the data to enable sharing. Although a problematic and resource intensive endeavour the result is not only a reusable data set but informed guidance for other projects on archiving and sharing. Niamh Moore has long experience of archiving her research and focused on another legacy archive – the Clayoquot Lives oral history project. Niamh is using Omeka as a sharing platform because it gives the researcher control of how the data can be presented for reuse. For example, Omeka has capacity for creating exhibits to showcase themes.

Community is important in both curation and management. Marta Teperek and Rosie Higman introduced work at Cambridge on collaborative communities and data champions. Finding a top-down compliance approach was not working, Cambridge moved to a bottom-up engagement style bringing researchers into decision-making on policies and support. Data champions are a new approach to seed advocates and trainers around the university as local contact points, based on a community of practice model. The rewards of this approach are potentially rich, but the cost of setting-up and managing it are high and the behaviour of the community is not always controllable. Two presentations on community/citizen science from Andrea Copeland and Peter Darch also hit on the theme of controlling groups in curating data. The Galaxy Zoo project found there were lessons to learn about the behaviour of volunteers, particularly the negative impact of a "league table" credit system in retaining contributors, and how volunteers expected to only contribute classifications were in some cases doing data science work in noticing unusual objects.

A topic of relevance to social science focused curation is sensitive data. Debra Hiom introduced University of Bristol's method of providing safe access to sensitive data. Once again, it's resource intensive - requiring a committee classification of data into levels of access and process reviews to ensure applications are genuine. However the result is that data that cannot be open can be shared responsibly. Sebastian Karcher from the Qualitative Data Archive spoke about managing sensitive data in the cloud, a task further complicated by the lack of a federal data protection law in the United States. Elizabeth Hull (Dryad) presented on developing an ethical framework for curating social media data. A common perception is social media posts are fair use, if made public. However, from an ethical perspective posters may not understand their "data" is being collected for research purposes and users need to know that use of @ or # on Twitter means they are inviting involvement and sharing in wider discussions. Hull offered a "STEP" approach as way to deal with social media data, balancing benefit of preservation and sharing against risk of harm and reasonable consent from research subjects.

IASSIST Quarterly (IQ) volume 40-2 is now on the website: Revolution in the air

Welcome to the second issue of Volume 40 of the IASSIST Quarterly (IQ 40:2, 2016). We present three papers in this issue.

http://iassistdata.org/iq/issue/40/2

First, there are two papers on the Data Documentation Initiative that have their own special introduction. I want to express my respect and gratitude to Joachim Wackerow (GESIS - Leibniz Institute for the Social Sciences). Joachim (Achim) and Mary Vardigan (University of Michigan) have several times and for many years communicated to and advised the readers of the IASSIST Quarterly on the continuing development of the DDI. The metadata of data is central for the use and reuse of data, and we have come a long way through the efforts of many people.    

The IASSIST 2016 conference in Bergen was a great success - I am told. I was not able to attend but heard that the conference again was 'the best ever'. I was also told that among the many interesting talks and inputs at the conference Matthew Woollard's keynote speech on 'Data Revolution' was high on the list. Good to have well informed informers! Matthew Woollard is Director of the UK Data Archive at the University of Essex. Here in the IASSIST Quarterly we bring you a transcript of his talk. Woollard starts his talk on the data revolution with the possibility of bringing to users access to data, rather than bringing data to users. The data is in the 'cloud' - in the air - 'Revolution in the air' to quote a Nobel laureate. We are not yet in the post-revolutionary phase and many issues still need to be addressed. Woollard argues that several data skills are in demand, like an understanding of data management and of the many ethical issues. Although he is not enthusiastic about the term 'Big Data', Woollard naturally addresses the concept as these days we cannot talk about data - and surely not about data revolution - without talking about Big Data. I fully support his view that we should proceed with caution, so that we are not simply replacing surveys where we 'ask more from fewer' with big data that give us 'less from more'. The revolution gives us new possibilities, and we will see more complex forms of research that will challenge data skills and demand solutions at data service institutions.  

Papers for the IASSIST Quarterly are always very welcome. We welcome input from IASSIST conferences or other conferences and workshops, from local presentations or papers especially written for the IQ. When you are preparing a presentation, give a thought to turning your one-time presentation into a lasting contribution. We permit authors 'deep links' into the IQ as well as deposition of the paper in your local repository. Chairing a conference session with the purpose of aggregating and integrating papers for a special issue IQ is also much appreciated as the information reaches many more people than the session participants, and will be readily available on the IASSIST website at http://www.iassistdata.org

Authors are very welcome to take a look at the instructions and layout:

http://iassistdata.org/iq/instructions-authors

Authors can also contact me via e-mail: kbr@sam.sdu.dk. Should you be interested in compiling a special issue for the IQ as guest editor(s) I will also be delighted to hear from you.

Karsten Boye Rasmussen   
Editor, IASSIST Quarterly

IASSIST's Statement in Response to President’s Executive Order on Visas and Immigration


February 13, 2017

Statement of the International Association for Social Science Information Services and Technology (IASSIST at http://iassistdata.org) in response to President Trump's January 27 Executive Order on Visas and Immigration, titled "PROTECTING THE NATION FROM FOREIGN TERRORIST ENTRY INTO THE UNITED STATES".

The recent executive order on visas and immigration issued on January 27th by US President Trump is of grave concern to IASSIST as an organization. IASSIST, the International Association for Social Science Information Services and Technology, is an international organization of professionals working in and with information technology, libraries, data services and research & higher education to support open science, advocate for responsible data management and use, build a broader community surrounding research data, and encourage the development of data professionals. Our membership is international, and we greatly value the ability to travel and meet to share knowledge at locations around the world. Our international fellows program and other initiatives are specifically designed to encourage participation from underrepresented regions, including the Muslim-majority countries targeted by the executive order.

While recognizing the authority of the United States over its borders, there are several aspects of this order that are troubling, viz.:

  1. Its sudden and chaotic implementation has led to severe uncertainty over whether rules and practices for entering the United States will be subject to rapid and arbitrary change.
  2. It has led to the detention of lawful permanent residents of the United States, the revocation of visas previously granted under proper vetting procedures, the perception of potential discrimination on the basis of religion, and the humanitarian crisis caused by ceasing to accept refugees.
  3. Its introduction of several restrictive elements into the domain of visas and immigration, such as the statement that those entering the US, including temporary visitors, must "support the Constitution".

For these reasons, the order generates a hostile climate for the open, collaborative scientific work of our organization, both for non-US persons seeking to work and collaborate with Americans, and for Americans traveling and working outside of the US to collaborate who may face retributive actions from other states. Our membership has legitimate concerns about whether travel to the US is possible under such conditions. The order also may have long-term repercussions that damage the reputation of the US as a location that is open to visitors and immigrants, supporting the open exchange of ideas, and protected under the rule of law from arbitrary changes impacting human freedom. In response, IASSIST will continue to speak out in favor of our organization's goals, and against such threats to international collaboration in research and data sharing.

Our May 2017 annual conference will be held in Lawrence, Kansas. Arrangements were begun long before the Executive Order on Visas and Immigration, and it is impossible to change the venue at this date. IASSIST stands in solidarity with its members and encourages them to attend the conference and participate in the international exchange of ideas that is the purpose of our association. We hope that no member will be denied entry into the US due to the administration's recent actions. IASSIST will assist its membership with visa issues and other concerns emanating from this order. We also reaffirm that we are committed to an environment free from discrimination, harassment, and retaliation, at the annual conference and all IASSIST activities.

 Tuomas J. Alaterä, President
 Jen Green, Vice-President
 Ryan Womack, Secretary
 Thomas Lindsay, Treasurer

International Association for Social Science Information Service and Technology (IASSIST)

IASSIST Call for Event Sponsorship Proposals

The IASSIST Liaison and Organizational Sponsorship Task Force is seeking proposals for sponsorships of regional or local events during calendar year 2017. The goal of these sponsorships is to support local networks of data professionals and data-related activities across the globe in order to help support IASSISTers activities throughout the year and increase awareness of the value of IASSIST membership.

Events should be a gathering of data professionals from multiple institutions and may vary in size and scope from workshops, symposia, conferences, etc. These may be established events or new endeavors. We are particularly looking to sponsor regional or local level events that will attract data professionals who would benefit from IASSIST membership, but may not always be able to travel to attend IASSIST conferences. Preference will be given to events from geographic areas outside of traditional IASSIST conference locations (North America and Western Europe), and from underrepresented membership areas as such as Latin/South America, Africa, Asia/Pacific, and Eastern Europe.

Requests for sponsorships may be monetary, and may also include a request for mentorship assistance by matching the event planning committee with an experienced IASSIST member with relevant expertise (e.g., conference planning, subject/content, geographic familiarity).

Accepted events will be required to designate an active IASSIST member as the liaison. Generally, this would be an IASSIST member who will be attending the event and although not required, may be on the planning committee or otherwise contributing to the event. The liaison will be responsible for assistance with coordinating logistics related to the sponsorship, ensuring that the sponsorship is recognized at the event, and contributing a post to the IASSIST iBlog about the event.

Proposals should include:

  • Name of the event and event details (date, location, any other pertinent information)
  • Organizing or hosting institution
  • Description of event and how it relates to IASSIST goals and communities
  • Specific request for sponsorship: amount of money and/or mentorship assistance
  • Description of how the sponsorship will be used
  • Name and contact information of person submitting proposal and designated event liaison to IASSIST (if different)

Proposals are due on Friday, January 13 2017 via the Application Form. Notification of sponsorship awards will be by Friday, Feb 3 2017. The number and monetary extent of awarded sponsorships will depend on the number and quality of applications received. Individual sponsorship requests may range from $0 USD (request for mentorship only) to $2,000 USD.

Please direct questions to Hailey Mooney, IASSIST Membership Chair (haileym@umich.edu).

IASSIST sponsors IFLA 2016 Knowledge Management conference

IASSIST proudly sponsored a full-day conference about knowledge management (KM) on August 12, 2016 in Cincinnati, Ohio, USA at the University of Cincinnati. The theme of the conference was Sharing Practices and Actions for Making Best Use of Organizational Knowledge in Libraries.The conference took place as part of the International Federation of Library Associations' (IFLA) annual conference held this year in Columbus, Ohio, USA.

The KM conference featured two keynote speakers: Valerie Forrestal, author of the 2015 book Knowledge Management in Libraries, and Jay Liebowitz, whose most recent book Successes and Failures of Knowledge Management was published just this year.

In addition to the keynotes, we had six scholarly presentations from information professionals on a variety of KM topics. Five of the accepted papers are available full-text. Outside of the United States, we had speakers and audience members visit us from Canada, China, and Iran.

The entire IFLA Knowledge Management Section thanks IASSIST for their sponsorship of the conference. In the future, we hope that our section can work collaboratively with IASSIST in the shared interest of information, knowledge, and data topics worldwide.

I hope to see many of you at IASSIST 2017 in Lawrence, Kansas!

Spencer Acadia, IFLA KM 2016 Program Chair and Standing Committee Member, acadias1@gmail.com

IASSIST 2017 Call for Proposals Now Open!

We are delighted to announce the call for proposals for the IASSIST 2017 Conference.

IASSIST 2017 CALL FOR PROPOSALS

Data in the Middle: The common language of research

The 43rd annual conference of the International Association for Social Science Information Services and Technology (IASSIST) will be held in Lawrence, Kansas from May 23-26, 2017. #iassist17

Many issues around data (sources, strategies, and tools) are similar across disciplines. While IASSIST has its roots in social science data, it has also welcomed discussions over the years of other disciplines' issues as they relate to data, data management, and support of users. So again this year, in line with this tradition, we are arranging a conference that will benefit those who support researchers across all disciplines: social sciences, health and natural sciences, and humanities. Please join the international data community in Lawrence, KS, "in the middle" of the U.S., for insights and discussion on how data in all disciplines are found, shared, used, and managed. Join us and draw inspiration from this diverse gathering! 

We welcome submissions for papers, presentations, panels, posters, and pecha kuchas.

The full Call for Proposals, along with the link to the submission form, can be accessed on the conference website here: 

http://www.iassist17.dept.ku.edu/proposals/

Questions can be directed to the Program Chairs, Samantha Guss and Michele Hayslett, at iassist2017@gmail.com.

 

Pre-conference Workshops

We are also accepting submissions for Pre-conference Workshops under a separate Call for Workshops, which can be accessed here: 

http://www.iassist17.dept.ku.edu/proposals/workshops/

Questions about workshops may be sent to the Workshop Coordinators, Jenny Muilenburg (jmuil@uw.edu) and Andy Rutkowski (arutkowski@library.ucla.edu).

 

Deadline for all submissions: 21 November 2016.

Notification of acceptance: February 2017.

Notes from the second Jisc Research Data Network event

Jisc held their second Research Data Network event in Cambridge. I went along to take notes.

Danny Kingsley gave an overview of why data sharing is important, which was useful as introduction for those new to this, and a refresher of first principles to the more experienced.

The day then moved into parallel sessions on aspects of the network's activity.

The Research Data Shared Service is an initiative to help intuitions with RDM infrastructure. Jisc research suggests the priority for universities is addressing the digital preservation gap. Consequently, Jisc are looking at providing data repository and long-term preservation services as well as considering how a service could integrate with existing CRIS systems and repositories. This will take place in a "University of Jisc" that allows a testing environment using research data.

Jisc are developing templates and guidance for publishers on creating a research data policy which can then adapt to their journals. They are working with Springer Nature who are trying to fit their 3000 journals to into one of four types of data policy, ranging from encouraged to mandatory sharing and availability criteria.

Cambridge's Research Data support service provided insight into engaging researchers in research data management. Their initial compliance message was not working, so they switched to a positive benefits message. This is underpinned by "adequate provisions": online information, consultancies, reviewing data management plan, and training sessions. They also invest resources in advocacy and outreach including a "democratic" approach involving researchers in shaping the service and policies.

Jisc are developing a "core" metadata profile for research data. The profile is based on focus group testing, and integration with existing standards. The aim is to encourage better quality metadata submissions from researchers, with "gold, silver, and bronze" thresholds.

The final session introduced Jisc's template business case for RDM support. This is intended to allow institutions to adapt a structured case for supporting RDM services that can be presented to university management. The case covers the economic benefits of data sharing and preservation, along with institutional and researcher benefits, with a focus on numbers. My particular favourite: UK universities hold an estimated 450 petabytes of research data. The case will be available this autumn.

Should you have further interest in their activities, Jisc have a Research Data Network website and presentations from the day are also available.

IQ 40:1 Now Available!

Our World and all the Local Worlds
Welcome to the first issue of Volume 40 of the IASSIST
Quarterly (IQ 40:1, 2016). We present four papers in this issue.
The first paper presents data from our very own world,
extracted from papers published in the IQ through four
decades. What is published in the IQ is often limited in
geographical scope and in this issue the other three papers
present investigations and project research carried out at
New York University, Purdue University, and the Federal
Reserve System. However, the subject scope of the papers
and the methods employed bring great diversity. And
although the papers are local in origin they all have a strong
focus for generalization in order to spread the information
and experience.


We proudly present the paper that received the 'best
paper award' at the IASSIST conference 2015. Great thanks
are expressed to all the reviewers who took part in the
evaluation! In the paper 'Social Science Data Archives: A
Historical Social Network Analysis' the authors Kristin R.
Eschenfelder (University of Wisconsin-Madison), Morgaine
Gilchrist Scott, Kalpana Shankar, and Greg Downey
are reporting on inter-organizational influence and
collaboration among social science data archives through
data of articles published in IASSIST Quarterly in 1976
to 2014. The paper demonstrates social network analysis
(SNA) using a web of 'nodes' (people/authors/institutions)
and 'links' (relationships between nodes). Several types
of relationships are identified: influencing, collaborating,
funding, and international. The dynamics are shown in
detail by employing five year sections. I noticed that from
a reluctant start the amount of relationships has grown
significantly and archives have continuously grown better
at bringing in 'influence' from other 'nodes'. The paper
contributes to the history of social science data archives and
the shaping of a research discipline.


The paper 'Understanding Academic Patrons’ Data Needs
through Virtual Reference Transcripts: Preliminary Findings
from New York University Libraries' is authored by Margaret
Smith and Jill Conte who are both librarians at New York
University, and Samantha Guss, a librarian at University
of Richmond who worked at New York University from
2009-14. The goal of their paper is 'to contribute to the
growing body of knowledge about how information
needs are conceptualized and articulated, and how this
knowledge can be used to improve data reference in an
academic library setting'. This is carried out by analysis of
chat transcripts of requests for census data at NYU. There is
a high demand for the virtual services of the NYU Libraries
and there are as many as 15,000 annual chat transactions.
There has not been much qualitative research of users'
data needs, but here the authors exemplify the iterative
nature of grounded theory with data collection and analysis
processes inextricably entwined and also using a range of
software tools like FileLocator Pro, TextCrawler, and Dedoose.
Three years of chat reference transcripts were filtered down
to 147 transcripts related to United States and international
census data. The unique data provides several insights,
shown in the paper. However, the authors are also aware of
the limitations in the method as it did not include whether
the patron or librarian considered the interaction successful.
The conclusion is that there is a need for additional librarian
training and improved research guides.


The third paper is also from a university. Amy Barton, Paul
J. Bracke, Ann Marie Clark, all from Purdue University,
collaborated on the paper 'Digitization, Data Curation,
and Human Rights Documents: Case Study of a Library
Researcher-Practitioner Collaboration'. The project
concerns the digitization of Urgent Action Bulletins of
Amnesty International from 1974 to 2007. The political
science research centered on changes of transnational
human rights advocacy and legal instrumentation, while
the Libraries’ research related to data management,
metadata, data lifecycle, etcetera. The specific research
collaboration model developed was also generalized for
future practitioner-librarian collaboration projects. The
project is part of a recent tendency where academic
libraries will improve engagement and combine activities
between libraries and users and institutions. The project
attempts to integrate two different lifecycle models thus
serving both research and curatorial goals where the
central question is: 'can digitization processes be designed
in a manner that feeds directly into analytical workflows
of social science researchers, while still meeting the
needs of the archive or library concerned with long-term
stewardship of the digitized content?'. The project builds
on data of Urgent Action Bulletins produced by Amnesty
International for indication of how human rights concerns
changed over time, and the threats in different countries
at different periods, as well as combining library standards
for digitization and digital collections with researcher-driven
metadata and coding strategies. The data creation
started with the scanning and creation of the optical
character recognized (OCR) version of full text PDFs for text
recognition and modeling in NVivo software. The project
did succeed in developing shared standards. However, a
fundamental challenge was experienced in the grant-driven
timelines for both library and researcher. It seems to me that
the expectation of parallel work was the challenge to the
project. Things take time.


In the fourth paper we enter the case of the Federal Reserve
System. San Cannon and Deng Pan, working at the Federal
Reserve Bank in Kansas City and Chicago, created a pilot
for an infrastructure and workflow support for making the
publication of research data a regular part of the research
lifecycle. This is reported in the paper 'First Forays into
Research Data Dissemination: A Tale from the Kansas City
Fed'. More than 750 researchers across the system produce
yearly about 1,000 journal articles, working papers, etcetera.
The need for data to support the research has been
recognized, and the institution is setting up a repository
and defining a workflow to support data preservation
and future dissemination. In early 2015 the internal Center
for the Advancement of Research and Data in Economics
(CADRE) was established with a mission to support, enhance,
and advance data or computationally intensive research,
and preservation and dissemination were identified as
important support functions for CADRE. The paper presents
details and questions in the design such as types of
collections, kind and size of data files, and demonstrates
influence of testers and curators. The pilot also had to
decide on the metadata fields to be used when data is
submitted to the system. The complete setup including
incorporated fields was enhanced through pilot testing and
user feedback. The pilot is now being expanded to other
Federal Reserve Banks.


Papers for the IASSIST Quarterly are always very welcome.
We welcome input from IASSIST conferences or other
conferences and workshops, from local presentations or
papers especially written for the IQ. When you are preparing
a presentation, give a thought to turning your one-time
presentation into a lasting contribution. We permit authors
'deep links' into the IQ as well as deposition of the paper in
your local repository. Chairing a conference session with
the purpose of aggregating and integrating papers for a
special issue IQ is also much appreciated as the information
reaches many more people than the session participants,
and will be readily available on the IASSIST website at
http://www.iassistdata.org.


Authors are very welcome to take a look at the instructions
and layout: http://iassistdata.org/iq/instructions-authors.

Authors can also contact me via e-mail: kbr@sam.sdu.dk.
Should you be interested in compiling a special issue for
the IQ as guest editor(s) I will also be delighted to hear
from you.


Karsten Boye Rasmussen
June 2016
Editor

  • IASSIST Quarterly

    Publications Special issue: A pioneer data librarian
    Welcome to the special volume of the IASSIST Quarterly (IQ (37):1-4, 2013). This special issue started as exchange of ideas between Libbie Stephenson and Margaret Adams to collect

    more...

  • Resources

    Resources

    A space for IASSIST members to share professional resources useful to them in their daily work. Also the IASSIST Jobs Repository for an archive of data-related position descriptions. more...

  • community

    • LinkedIn
    • Facebook
    • Twitter

    Find out what IASSISTers are doing in the field and explore other avenues of presentation, communication and discussion via social networking and related online social spaces. more...