Already a member?

Sign In
Syndicate content

Tools, Apps, Technology

OR2013: Open Repositories Confront Research Data

Open Repositories 2013 was hosted by the University of Prince Edward Island from July 8-12. A strong research data stream ran throughout this conference, which was attended by over 300 participants from around the globe.  To my delight, many IASSISTers were in attendance, including the current IASSIST President and four Past-Presidents!  Rarely do such sightings happen outside an IASSIST conference.

This was my first Open Repositories conference and after the cool reception that research data received at the SPARC IR meetings in Baltimore a few years ago, I was unsure how data would be treated at this conference.  I was pleasantly surprised by the enthusiastic interest of this community toward research data.  It helped that there were many IASSISTers present but the interest in research data was beyond that of just our community.  This conference truly found an appropriate intersection between the communities of social science data and open repositories. 

Thanks go to Robin Rice (IASSIST), Angus Whyte (DCC), and Kathleen Shearer (COAR) for organizing a workshop entitled, “Institutional Repositories Dealing with Data: What a difference a ‘D’ makes!”  Michael Witt, Courtney Matthews, and I joined these three organizers to address a range of issues that research data pose for those operating repositories.  The registration for this workshop was capped at 40 because of our desire to host six discussion tables of approximately seven participants each.  The workshop was fully subscribed and Kathleen counted over 50 participants prior to the coffee break.  The number clearly expresses the wider interest in research data at OR2013.

Our workshop helped set the stage for other sessions during the week.  For example, we talked about environmental drivers popularizing interest in research data, including topics around academic integrity.  Regarding this specific issue, we noted that the focus is typically directed toward specific publication-related datasets and the access needed to support the reproducibility of published research findings.  Both the opening and closing plenary speakers addressed aspects of academic integrity and the role of repositories in supporting the reproducibility of research findings.  Victoria Stodden, the opening plenary speaker, presented a compelling and articulate case for access to both the data and computer code upon which published findings are based.  She calls herself a computational scientist and defends the need to preserve computer code as well as data to facilitate the reproducibility of scientific findings.  Jean-Claude Guédon, the closing plenary speaker, bracketed this discussion on academic integrity.  He spoke about scholarly publishing and how the commercial drive toward indicators of excellence has resulted in cheating.  He likened some academics to Lance Armstrong, cheating to become number one.  He feels that quality rather than excellence is a better indicator of scientific success.

Between these two stimulating plenary speakers, there was a number of sessions during which research data were discussed.  I was particularly interested in a panel of six entitled, “Research Data and Repositories,” especially because the speakers were from the repository community instead of the data community.  They each took turns responding to questions about what their repositories do now regarding research data and what they see happening in the future.  In a nutshell, their answers tended to describe the desire to make better connections between the publications in their repositories with the data underpinning the findings in these articles.  They also spoke about the need to support more stages of the research lifecycle, which often involves aspects of the data lifecycle within research.  There were also statements that reinforced the need for our (IASSIST’s) continued interaction with the repository community.  The use of readme files in the absence of standards-based metadata and other practices, where our data community has moved the best-practice yardstick well beyond, demonstrate the need for our communities to continue in dialogue. 

Chuck Humphrey

Introducing the IASSIST Data Visualization Interest Group (DVIG!)

Hello fellow IASSISTer’s

     With the upcoming 2013 Conference nearing, we thought it very fitting to introduce you all to the newly created IASSIST Data Visualization Interest Group. Formed over the winter and now spring of 2013, this group brings together over 46 IASSIST members from across the world (literally across-the-world! check out the map of our locations), who are all interested in data visualization.  We hope to share a range of skills and information around tools, best practice visualization, and discuss innovative representations of data, statistics, and information. Here is just a glimpse of our group’s tools exposure.

    As research becomes more interdisciplinary and data and information are more readily used and reused, core literacies surrounding the use and understandability of data are required. Data Visualization supports a means to make sense of data, through visual representation, and to communicate ideas and information effectively. And, it is quickly becoming a well-developed field not only in terms of the technology (in the development of tools for analyzing and visualizing data), but also as an established field of study and research discipline. As data and information professionals, we are required to stay abreast of the latest technologies, disciplines, methods and techniques, used for research in this data-intensive and changing research landscape. Data Visualization, with its many branches and techniques seeks to present data, information, and statistics in new ways, ways that our researchers are harnessing with the use of high-powered computers (and sometimes not so high-powered) to perform analysis of data.  From conventional ways to visualize and graph data – such as tables, histograms, pie charts, and bar and line graphs, to the often more complex network relationship models and diagrams, cluster and burst analysis, and text analysis charts; we see data visualization techniques at play more than ever. 

This group has set a core mission and charge to focus on promoting a greater understanding of data visualization – its creation, uses, and importance in research, across disciplines.  Particular areas of focus include, but are not limited to the following:

  • Enable opportunities for IASSIST members to learn and enhance their skills in this growing field;
  • Support a culture of best practice for data visualization techniques; creation, use, and curation;
  • Discussion of the relevant tools (programs, web tools, and software) for all kinds of data visualizations (spatial, temporal, categorical, multivariate, graphing, networks, animation, etc.);
  • Provide input and feedback on data visualization tools;
  • Capture examples of data visualization to emulate and avoid;
  • Explore opportunities for service development in libraries;
  • Be aware of and communicate to others the needs of researchers in this field;
  • Use of data visualization for allowing pre-analysis browsing of data content in repositories
  • Connect with communities of metadata developers and users (e.g., DDI Alliance) to gain better understanding of how metadata can enable better visualization, and how in turn visualization need might drive development of metadata standards.
  • And more!

Please join me in welcoming this new interest group, and we hope to share and learn from you all at the upcoming conference! We are always seeking input and to share ideas, please get in touch with us at iassist-dataviz@lists.carleton.edu (either myself or another member can add you to the group).

All the best, and Happy Easter!

Amber Leahey

Some reflections on research data confidentiality, privacy, and curation by Limor Peer

Some reflections on research data confidentiality, privacy, and curation

Limor Peer

Maintaining research subjects’ confidentiality is an essential feature of the scientific research enterprise. It also presents special challenges to the data curation process. Does the effort to open access to research data complicate these challenges?

A few reasons why I think it does: More data are discoverable and could be used to re-identify previously de-identified datasets; systems are increasingly interoperable, potentially bridging what may have been insular academic data with other data and information sources; growing pressure to open data may weaken some of the safeguards previously put in place; and some data are inherently identifiable

But these challenges should not diminish the scientific community’s firm commitment to both principles. It is possible, and desirable, for openness and privacy co-exist. It will not be simple to do, and here’s what we need to keep in mind:

First, let’s be clear about semantics. Open data and public data are not the same thing. As Melanie Chernoff observed, “All open data is publicly available. But not all publicly available data is open.” This distinction is important because what our community means by open (standards, format) may not be what policy-makers and the public at large mean (public access). Chernoff rightly points out that “whether data should be made publicly available is where privacy concerns come into play. Once it has been determined that government data should be made public, then it should be done so in an open format.” So, yes, we want as much data as possible to be public, but we most definitely want data to be open.

Another term that could be clarified is usefulness. In the academic context, we often think of data re-use by other scholars, in the service of advancing science. But what if the individuals from whom the data were collected are the ones who want to make use of it? It’s entirely conceivable that the people formerly known as “research subjects” begin demanding access to, and control over, their own personal data as they become more accustomed to that in other contexts. This will require some fresh ideas about regulation and some rethinking of the concept of informed consent (see, for example, the work of John Wilbanks, NIH, and the National Cancer Institute on this front). The academic community is going to have to confront this issue.

Precisely because terms are confusing and often vaguely defined, we should use them carefully. It’s tempting to pit one term against the other, e.g., usefulness vs. privacy, but it may not be productive. The tension between privacy and openness or transparency does not mean that we have to choose one over the other. As Felix Wu says, “there is nothing inherently contradictory about hiding one piece of information while revealing another, so long as the information we want to hide is different from the information we want to disclose.” The complex reality is that we have to weigh them carefully and make context-based decisions.

I think the IASSIST community is in a position to lead on this front, as it is intimately familiar with issues of disclosure risk. Just last spring, the 2012 IASSIST conference included a panel on confidentiality, privacy and security. IASSIST has a special interest group on Human Subjects Review Committees and Privacy and Confidentiality in Research. Various IASSIST members have been involved with heroic efforts to create solutions (e.g., via the DDI Alliance, UKDA and ICPSR protocols) and educate about the issue (e.g., ICPSR webinar , ICPSR summer course, and MANTRA module). A recent panel at the International Data Curation Conference in Amsterdam showcased IASSIST members’ strategies for dealing with this issue (see my reflections about the panel).

It might be the case that STEM is leading the push for open data, but these disciplines are increasingly confronted with problems of re-identification, while the private sector is increasingly being scrutinized for its practices (see this on “data hops”). The social (and, of course, medical) sciences have a well-developed regulatory framework around the issue of research ethics that many of us have been steeped in. Government agencies have their own approaches and standards (see recent report by the U.S. Government Accountability office). IASSIST can provide a bridge; we have the opportunity to help define the conversation and offer some solutions.

In search of: Best practice for code repositories?

I was asked by a colleague about organized efforts within the economics community to develop or support repositories of code for research.  Her experience was with the astrophysics world which apparently has several and she was wondering what could be learned from another academic community.  So I asked a non-random sample of technical economists with whom I work, and then expanded the question to cover all of social sciences and posed the question to the IASSIST community. 

In a nutshell, the answer seems to be “nope, nothing organized across the profession” – even with the profession very broadly defined.  The general consensus for both the economics world and the more general social science community was that there was some chaos mixed with a little schizophrenia. I was told there are there are instances of such repositories, but they were described to me as “isolated attempts” such as this one by Volker Wieland:  http://www.macromodelbase.com/.  Some folks mentioned repositories that were package or language based such as R modules or SAS code from the SAS-L list or online at sascommunity.org.

Many people pointed out that there are more repositories being associated with journals so that authors can (or are required to) submit their data and code when submitting a paper for publication. Several responses touched on this issue of replication, which is the impetus for most journal requirements, including one that pointed out a “replication archive” at Yale (http://isps.yale.edu/research/data).  I was also pointed to an interested paper that questions whether such archives promote replicable research (http://www.pages.drexel.edu/~bdm25/cje.pdf) but that’s a discussion for another post.

By far, the most common reference I received was for the repositories associated with RePEc (Research Papers in Economics) which offers a broad range of services to the economic research community.  There you’ll find the IDEAS site (http://ideas.repec.org/) and the QM&RBC site with code for Dynamic General Equilibrium models (http://dge.repec.org/) both run by the St. Louis Fed.

I also heard from support folks who had tried to build a code repository for their departments and were disappointed by the lack of enthusiasm for the project. The general consensus is that economists would love to leverage other people’s code but don’t want to give away their proprietary models.  They should know there is no such thing as a free lunch! 

 I did hear that project specific repositories were found to be useful but I think of those as collaboration tools rather than a dissemination platform.  That said, one economist did end his email to me with the following plea:  “lots of authors provide code on their websites, but there is no authoritative host. Will you start one please?”

/san/

IASSIST Quarterly (2011: Fall)

Sharing data and building information

With this issue (volume 35-3, 2011) of the IASSIST Quarterly (IQ) we return to the regular format of a collection of articles not within the same specialist subject area as we have seen in recent special issues of IQ. Naturally the three articles presented here are related to the IQ subject area in general, as in: assisting research with data, acquiring data from research, and making good use of the user community. This last topic could also be spelled “involvement”. The hope is that these articles will carry involvement to the IASSIST community, so that the gained knowledge can be shared and practised widely.


“Mind the gap” is a caveat to passengers on the London Underground. The authors of this article are Susan Noble, Celia Russell and Richard Wiseman, all affiliated with ESDS-International hosted by Mimas at the University of Manchester in the UK. The ESDS, standing for “Economic and Social Data Service”, are extending their reach beyond the UK. In the article “Mind the Gap: Global Data Sharing” they are looking into how today’s research on the important topics of climate change, economic crises, migration and health requires cross-national data sharing. Clearly these topics are international (e.g. the weather or air pollution does not stop at national borders), but the article discusses how existing barriers prevent global data sharing. The paper is based on a presentation in a session on “Sharing data: High Rewards, Formidable Barriers” at the IASSIST 2009 conference. It is demonstrated how even international data produced by intergovernmental organizations like the International Monetary Fund, the International Energy Agency, OECD, the United Nations and the World Bank are often only available with an expensive subscription, presented in complex incomprehensible tables, through special interfaces; such barriers are making the international use of the data difficult. Because of missing metadata standards it is difficult to evaluate the quality of the dataset and to search for and locate the data resources required. The paper highlights the development of e-learning materials that can raise awareness and ease access to international data. In this case the example is e-learning for the “United Nations Millennium Development Goals”.


The second paper is also related to the sharing of data with an introduction to the international level. “The Research-Data-Centre in Research-Data-Centre Approach: A First Step Towards Decentralised International Data Sharing” is written by Stefan Bender and Jörg Heining from the Institute for Employment Research (IAB) in Nuremberg, Germany. In order to preserve the confidentiality of single entities, access to complete datasets is often restricted to monitored on-site analysis. Although off-site access is facilitated in other countries, Germany has relied on on-site security. However, an opportunity has been presented where Research Data Centre sites are placed at Statistical Offices around Germany, and also at a Michigan centre for demography. The article contains historical information on approaches and developments in other countries and has a special focus on the German solution. The project will gain experience in the complex balance between confidentiality and analysis, and the differences between national laws.


The paper by Stuart Macdonald from EDINA in Scotland originated as a poster session at the IASSIST 2010 conference. The name of the paper is “AddressingHistory: a Web2.0 community engagement tool and API”. The community consists of members within and outside academia, as local history groups and genealogists are using the software to enhance and combine data from historical Scottish Post Office Directories with large-scale historical maps. The background and technical issues are presented in the paper, which also looks into issues and perspectives of user generated content. The “crowdsourcing” tool did successfully generate engagement and there are plans for further development, such as upload and attachment of photos of people, buildings, and landmarks to enrich the collection.

Articles for the IQ are always very welcome. They can be papers from IASSIST conferences or other conferences and workshops, from local presentations or papers especially written for the IQ. If you don’t have anything to offer right now, then please prepare yourself for the next IASSIST conference and start planning for participation in a session there. Chairing a conference session with the purpose of aggregating and integrating papers for a special issue IQ is much appreciated as the information in the form of an IQ issue reaches many more people than the session participants and will be readily available on the IASSIST website at http://www.iassistdata.org.

Authors are very welcome to take a look at the instructions and layout:
http://iassistdata.org/iq/instructions-authors


Authors can also contact me via e-mail: kbr@sam.sdu.dk. Should you be interested in compiling a special issue for the IQ as guest editor(s) I will also be delighted to hear from you.

 

Karsten Boye Rasmussen

December 2011

86 helpful tools for the data professional PLUS 45 bonus tools

I have been working on this (mostly) annotated collection of tools and articles that I believe would be of help to both the data dabbler and professional. If you are a data scientist, data analyst or data dummy, chances are there is something in here for you. I included a list of tools, such as programming languages and web-based utilities, data mining resources, some prominent organizations in the field, repositories where you can play with data, events you may want to attend and important articles you should take a look at.

The second segment (BONUS!) of the list includes a number of art and design resources the infographic designers might like including color palette generators and image searches. There are also some invisible web resources (if you're looking for something data-related on Google and not finding it) and metadata resources so you can appropriately curate your data. This is in no way a complete list so please contact me here with any suggestions!

Data Tools

  1. Google Refine - A power tool for working with messy data (formerly Freebase Gridworks)
  2. The Overview Project - Overview is an open-source tool to help journalists find stories in large amounts of data, by cleaning, visualizing and interactively exploring large document and data sets. Whether from government transparency initiatives, leaks or Freedom of Information requests, journalists are drowning in more documents than they can ever hope to read.
  3. Refine, reuse and request data | ScraperWiki - ScraperWiki is an online tool to make acquiring useful data simpler and more collaborative. Anyone can write a screen scraper using the online editor. In the free version, the code and data are shared with the world. Because it's a wiki, other programmers can contribute to and improve the code.
  4. Data Curation Profiles - This website is an environment where academic librarians of all kinds, special librarians at research facilities, archivists involved in the preservation of digital data, and those who support digital repositories can find help, support and camaraderie in exploring avenues to learn more about working with research data and the use of the Data Curation Profiles Tool.
  5. Google Chart Tools - Google Chart Tools provide a perfect way to visualize data on your website. From simple line charts to complex hierarchical tree maps, the chart galley provides a large number of well-designed chart types. Populating your data is easy using the provided client- and server-side tools.
  6. 22 free tools for data visualization and analysis
  7. The R Journal - The R Journal is the refereed journal of the R project for statistical computing. It features short to medium length articles covering topics that might be of interest to users or developers of R.
  8. CS 229: Machine Learning - A widely referenced course by Professor Andrew Ng, CS 229: Machine Learning provides a broad introduction to machine learning and statistical pattern recognition. Topics include supervised learning, unsupervised learning, learning theory, reinforcement learning and adaptive control. Recent applications of machine learning, such as to robotic control, data mining, autonomous navigation, bioinformatics, speech recognition, and text and web data processing are also discussed.
  9. Google Research Publication: BigTable - Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
  10. Scientific Data Management - An introduction.
  11. Natural Language Toolkit - Open source Python modules, linguistic data and documentation for research and development in natural language processing and text analytics, with distributions for Windows, Mac OSX and Linux.
  12. Beautiful Soup - Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping.
  13. Mondrian: Pentaho Analysis - Pentaho Open source analysis OLAP server written in Java. Enabling interactive analysis of very large datasets stored in SQL databases without writing SQL.
  14. The Comprehensive R Archive Network - R is `GNU S', a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc. Please consult the R project homepage for further information. CRAN is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for R. Please use the CRAN mirror nearest to you to minimize network load.
  15. DataStax - Software, support, and training for Apache Cassandra.
  16. Machine Learning Demos
  17. Visual.ly - Infographics & Visualizations. Create, Share, Explore
  18. Google Fusion Tables - Google Fusion Tables is a modern data management and publishing web application that makes it easy to host, manage, collaborate on, visualize, and publish data tables online.
  19. Tableau Software - Fast Analytics and Rapid-fire Business Intelligence from Tableau Software.
  20. WaveMaker - WaveMaker is a rapid application development environment for building, maintaining and modernizing business-critical Web 2.0 applications.
  21. Visualization: Annotated Time Line - Google Chart Tools - Google Code An interactive time series line chart with optional annotations. The chart is rendered within the browser using Flash.
  22. Visualization: Motion Chart - Google Chart Tools - Google Code A dynamic chart to explore several indicators over time. The chart is rendered within the browser using Flash.
  23. PhotoStats Create gorgeous infographics about your iPhone photos, with Photostats.
  24. Ionz Ionz will help you craft an infographic about yourself.
  25. chart builder Powerful tools for creating a variety of charts for online display.
  26. Creately Online diagramming and design.
  27. Pixlr Editor A powerful online photo editor.
  28. Google Public Data Explorer ?The Google Public Data Explorer makes large datasets easy to explore, visualize and communicate. As the charts and maps animate over time, the changes in the world become easier to understand. You don't have to be a data expert to navigate between different views, make your own comparisons, and share your findings.
  29. Fathom Fathom Information Design helps clients understand and express complex data through information graphics, interactive tools, and software for installations, the web, and mobile devices. Led by Ben Fry. Enough said!
  30. healthymagination | GE Data Visualization Visualizations that advance the conversation about issues that shape our lives, and so we encourage visitors to download, post and share these visualizations.
  31. ggplot2 ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.
  32. Protovis Protovis composes custom views of data with simple marks such as bars and dots. Unlike low-level graphics libraries that quickly become tedious for visualization, Protovis defines marks through dynamic properties that encode data, allowing inheritance, scales and layoutsto simplify construction.Protovis is free and open-source, provided under the BSD License. It uses JavaScript and SVG for web-native visualizations; no plugin required (though you will need a modern web browser)! Although programming experience is helpful, Protovis is mostly declarative and designed to be learned by example.
  33. d3.js D3.js is a small, free JavaScript library for manipulating documents based on data.
  34. MATLAB - The Language Of Technical Computing MATLAB® is a high-level language and interactive environment that enables you to perform computationally intensive tasks faster than with traditional programming languages such as C, C++, and Fortran.
  35. OpenGL - The Industry Standard for High Performance Graphics OpenGL.org is a vendor-independent and organization-independent web site that acts as one-stop hub for developers and consumers for all OpenGL news and development resources. It has a very large and continually expanding developer and end-user community that is very active and vested in the continued growth of OpenGL.
  36. Google Correlate Google Correlate finds search patterns which correspond with real-world trends.
  37. Revolution Analytics - Commercial Software & Support for the R Statistics Language Revolution Analytics delivers advanced analytics software at half the cost of existing solutions. By building on open source R—the world’s most powerful statistics software—with innovations in big data analysis, integration and user experience, Revolution Analytics meets the demands and requirements of modern data-driven businesses.
  38. 22 Useful Online Chart & Graph Generators
  39. The Best Tools for Visualization Visualization is a technique to graphically represent sets of data. When data is large or abstract, visualization can help make the data easier to read or understand. There are visualization tools for search, music, networks, online communities, and almost anything else you can think of. Whether you want a desktop application or a web-based tool, there are many specific tools are available on the web that let you visualize all kinds of data.
  40. Visual Understanding Environment The Visual Understanding Environment (VUE) is an Open Source project based at Tufts University. The VUE project is focused on creating flexible tools for managing and integrating digital resources in support of teaching, learning and research. VUE provides a flexible visual environment for structuring, presenting, and sharing digital information.
  41. Bime - Cloud Business Intelligence | Analytics & Dashboards Bime is a revolutionary approach to data analysis and dashboarding. It allows you to analyze your data through interactive data visualizations and create stunning dashboards from the Web.
  42. Data Science Toolkit A collection of data tools and open APIs curated by our own Pete Warden. You can use it to extract text from a document, learn the political leanings of a particular neighborhood, find all the names of people mentioned in a text and more.
  43. BuzzData BuzzData lets you share your data in a smarter, easier way. Instead of juggling versions and overwriting files, use BuzzData and enjoy a social network designed for data.
  44. SAP - SAP Crystal Solutions: Simple, Affordable, and Open BI Tools for Everyday Use
  45. Project Voldemort
  46. ggplot. had.co.nz

Data Mining

  1. Weka -nWeka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License.
  2. PSPP- PSPP is a program for statistical analysis of sampled data. It is a Free replacement for the proprietary program SPSS, and appears very similar to it with a few exceptions. The most important of these exceptions are, that there are no “time bombs”; your copy of PSPP will not “expire” or deliberately stop working in the future. Neither are there any artificial limits on the number of cases or variables which you can use. There are no additional packages to purchase in order to get “advanced” functions; all functionality that PSPP currently supports is in the core package.PSPP can perform descriptive statistics, T-tests, linear regression and non-parametric tests. Its backend is designed to perform its analyses as fast as possible, regardless of the size of the input data. You can use PSPP with its graphical interface or the more traditional syntax commands.
  3. Rapid I- Rapid-I provides software, solutions, and services in the fields of predictive analytics, data mining, and text mining. The company concentrates on automatic intelligent analyses on a large-scale base, i.e. for large amounts of structured data like database systems and unstructured data like texts. The open-source data mining specialist Rapid-I enables other companies to use leading-edge technologies for data mining and business intelligence. The discovery and leverage of unused business intelligence from existing data enables better informed decisions and allows for process optimization.The main product of Rapid-I, the data analysis solution RapidMiner is the world-leading open-source system for knowledge discovery and data mining. It is available as a stand-alone application for data analysis and as a data mining engine which can be integrated into own products. By now, thousands of applications of RapidMiner in more than 30 countries give their users a competitive edge. Among the users are well-known companies as Ford, Honda, Nokia, Miele, Philips, IBM, HP, Cisco, Merrill Lynch, BNP Paribas, Bank of America, mobilkom austria, Akzo Nobel, Aureus Pharma, PharmaDM, Cyprotex, Celera, Revere, LexisNexis, Mitre and many medium-sized businesses benefitting from the open-source business model of Rapid-I.
  4. R Project - R is a language and environment for statistical computing and graphics. It is a GNU projectwhich is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

Organizations

  1. Data.gov
  2. SDM group at LBNL
  3. Open Archives Initiative
  4. Code for America | A New Kind of Public Service
  5. The # DataViz Daily
  6. Institute for Advanced Analytics | North Carolina State University | Professor Michael Rappa · MSA Curriculum
  7. BuzzData | Blog, 25 great links for data-lovin' journalists
  8. MetaOptimize - Home - Machine learning, natural language processing, predictive analytics, business intelligence, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization
  9. had.co.nz
  10. Measuring Measures - Measuring Measures

Repositories

  1. Repositories | DataCite
  2. Data | The World Bank
  3. Infochimps Data Marketplace + Commons: Download Sell or Share Databases, statistics, datasets for free | Infochimps
  4. Factual Home - Factual
  5. Flowing Media: Your Data Has Something To Say
  6. Chartsbin
  7. Public Data Explorer
  8. StatPlanet
  9. ManyEyes
  10. 25+ more ways to bring data into R

Events

  1. Welcome | Visweek 2011
  2. O'Reilly Strata: O'Reilly Conferences
  3. IBM Information On Demand 2011 and Business Analytics Forum
  4. Data Scientist Summit 2011
  5. IBM Virtual Performance 2011
  6. Wolfram Data Summit 2011—Conference on Data Repositories and Ideas
  7. Big Data Analytics: Mobile, Social and Web

Articles

  1. Data Science: a literature review | (R news & tutorials)
  2. What is "Data Science" Anyway?
  3. Hal Varian on how the Web challenges managers - McKinsey Quarterly - Strategy - Innovation
  4. The Three Sexy Skills of Data Geeks « Dataspora
  5. Rise of the Data Scientist
  6. dataists » A Taxonomy of Data Science
  7. The Data Science Venn Diagram « Zero Intelligence Agents
  8. Revolutions: Growth in data-related jobs
  9. Building data startups: Fast, big, and focused - O'Reilly Radar

BONUS! Art Design

  1. Periodic Table of Typefaces
  2. Color Scheme Designer 3
  3. Color Palette Generator Generate A Color Palette For Any Image
  4. COLOURlovers
  5. Colorbrewer: Color Advice for Maps

Image Searches

  1. American Memory from the Library of Congress The home page for the American Memory Historical Collections from the Library of Congress. American Memory provides free access to historical images, maps, sound recordings, and motion pictures that document the American experience. American Memory offers primary source materials that chronicle historical events, people, places, and ideas that continue to shape America.
  2. Galaxy of Images | Smithsonian Institution Libraries
  3. Flickr Search
  4. 50 Websites For Free Vector Images Download
  5. Design weblog for designers, bloggers and tech users. Covering useful tools, tutorials, tips and inspirational photos.
  6. Images Google Images. The most comprehensive image search on the web.
  7. Trade Literature - a set on Flickr
  8. Compfight / A Flickr Search Tool
  9. morgueFile free photos for creatives by creatives
  10. stock.xchng - the leading free stock photography site
  11. The Ultimate Collection Of Free Vector Packs - Smashing Magazine
  12. How to Create Animated GIFs Using Photoshop CS3 - wikiHow
  13. IAN Symbol Libraries (Free Vector Symbols and Icons) - Integration and Application Network
  14. Usability.gov
  15. best icons
  16. Iconspedia
  17. IconFinder
  18. IconSeeker

Invisible Web

  1. 10 Search Engines to Explore the Invisible Web Like the header says...
  2. Scirus - for scientific information The most comprehensive scientific research tool on the web. With over 410 million scientific items indexed at last count, it allows researchers to search for not only journal content but also scientists' homepages, courseware, pre-print server material, patents and institutional repository and website information.
  3. TechXtra: Engineering, Mathematics, and Computing TechXtra is a free service which can help you find articles, books, the best websites, the latest industry news, job announcements, technical reports, technical data, full text eprints, the latest research, thesis & dissertations, teaching and learning resources and more, in engineering, mathematics and computing.
  4. Welcome to INFOMINE: Scholarly Internet Resource Collections INFOMINE is a virtual library of Internet resources relevant to faculty, students, and research staff at the university level. It contains useful Internet resources such as databases, electronic journals, electronic books, bulletin boards, mailing lists, online library card catalogs, articles, directories of researchers, and many other types of information.
  5. The WWW Virtual Library The WWW Virtual Library (VL) is the oldest catalogue of the Web, started by Tim Berners-Lee, the creator of HTML and of the Web itself, in 1991 at CERN in Geneva. Unlike commercial catalogues, it is run by a loose confederation of volunteers, who compile pages of key links for particular areas in which they are expert; even though it isn't the biggest index of the Web, the VL pages are widely recognised as being amongst the highest-quality guides to particular sections of the Web.
  6. Intute Intute is a free online service that helps you to find web resources for your studies and research. With millions of resources available on the Internet, it can be difficult to find useful material. We have reviewed and evaluated thousands of resources to help you choose key websites in your subject. The Virtual Training Suite can also help you develop your Internet research skills through tutorials written by lecturers and librarians from universities across the UK.
  7. CompletePlanet - Discover over 70,000+ databases and specially search engines There are hundreds of thousands of databases that contain Deep Web content. CompletePlanet is the front door to these Deep Web databases on the Web and to the thousands of regular search engines — it is the first step in trying to find highly topical information. By tracing through CompletePlanet's subject structure or searching Deep Web sites, you can go to various topic areas, such as energy or agriculture or food or medicine, and find rich content sites not accessible using conventional search engines. BrightPlanet initially developed the CompletePlanet compilation to identify and tap into many hundreds and thousands of search sources simultaneously to automatically deliver high-quality content to its corporate and enterprise customers. It then decided to make CompletePlanet available as a public service to the Internet search public.
  8. Infoplease: Encyclopedia, Almanac, Atlas, Biographies, Dictionary, Thesaurus. Information Please has been providing authoritative answers to all kinds of factual questions since 1938—first as a popular radio quiz show, then starting in 1947 as an annual almanac, and since 1998 on the Internet at www.infoplease.com. Many things have changed since 1938, but not our dedication to providing reliable information, in a way that engages and entertains.
  9. DeepPeep: discover the hidden web DeepPeep is a search engine specialized in Web forms. The current beta version currently tracks 45,000 forms across 7 domains. DeepPeep helps you discover the entry points to content in Deep Web (aka Hidden Web) sites, including online databases and Web services. Advanced search allows you to perform more specific queries. Besides specifying keywords, you can also search for specific form element labels, i.e., the description of the form attributes.
  10. IncyWincy: The Invisible Web Search Engine IncyWincy is a showcase of Net Research Server (NRS) 5.0, a software product that provides a complete search portal solution, developed by LoopIP LLC. LoopIP licenses the NRS engine and provides consulting expertise in building search solutions.

Metadata

  1. Description Schema: MODS (Library of Congress) and Outline of elements and attributes in MODS version 3.4: MetadataObject This document contains a listing of elements and their related attributes in MODS Version 3.4 with values or value sources where applicable. It is an "outline" of the schema. Items highlighted in red indicate changes made to MODS in Version 3.4.All top-level elements and all attributes are optional, but you must have at least one element. Subelements are optional, although in some cases you may not have empty containers. Attributes are not in a mandated sequence and not repeatable (per XML rules). "Ordered" below means the subelements must occur in the order given. Elements are repeatable unless otherwise noted."Authority" attributes are either followed by codes for authority lists (e.g., iso639-2b) or "see" references that link to documents that contain codes for identifying authority lists.For additional information about any MODS elements (version 3.4 elements will be added soon), please see the MODS User Guidelines.
  2. wiki.dbpedia.org : About DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data. We hope this will make it easier for the amazing amount of information in Wikipedia to be used in new and interesting ways, and that it might inspire new mechanisms for navigating, linking and improving the encyclopaedia itself.
  3. Semantic Web - W3C In addition to the classic “Web of documents” W3C is helping to build a technology stack to support a “Web of data,” the sort of data you find in databases. The ultimate goal of the Web of data is to enable computers to do more useful work and to develop systems that can support trusted interactions over the network. The term “Semantic Web” refers to W3C’s vision of the Web of linked data. Semantic Web technologies enable people to create data stores on the Web, build vocabularies, and write rules for handling data. Linked data are empowered by technologies such as RDF, SPARQL, OWL, and SKOS.
  4. RDA: Resource Description & Access | www.rdatoolkit.org Designed for the digital world and an expanding universe of metadata users, RDA: Resource Description and Access is the new, unified cataloging standard. The online RDA Toolkit subscription is the most effective way to interact with the new standard. More on RDA.
  5. Cataloging Cultural Objects Cataloging Cultural Objects: A Guide to Describing Cultural Works and Their Images (CCO) is a manual for describing, documenting, and cataloging cultural works and their visual surrogates. The primary focus of CCO is art and architecture, including but not limited to paintings, sculpture, prints, manuscripts, photographs, built works, installations, and other visual media. CCO also covers many other types of cultural works, including archaeological sites, artifacts, and functional objects from the realm of material culture.
  6. Library of Congress Authorities (Search for Name, Subject, Title and Name/Title) Using Library of Congress Authorities, you can browse and view authority headings for Subject, Name, Title and Name/Title combinations; and download authority records in MARC format for use in a local library system. This service is offered free of charge.
  7. Search Tools and Databases (Getty Research Institute) Use these search tools to access library materials, specialized databases, and other digital resources.
  8. Art & Architecture Thesaurus (Getty Research Institute) Learn about the purpose, scope and structure of the AAT. The AAT is an evolving vocabulary, growing and changing thanks to contributions from Getty projects and other institutions. Find out more about the AAT's contributors.
  9. Getty Thesaurus of Geographic Names (Getty Research Institute) Learn about the purpose, scope and structure of the TGN. The TGN is an evolving vocabulary, growing and changing thanks to contributions from Getty projects and other institutions. Find out more about the TGN's contributors.
  10. DCMI Metadata Terms
  11. The Digital Object Identifier System
  12. The Federal Geographic Data Committee — Federal Geographic Data Committee

10 significant visualisation developments

Interesting collection of visualization developments:

 

10 significant visualisation developments: January to June 2011 Visualising Data (July 7th, 2011)

IASSIST Quarterly (IQ) volume 34-2 now on the web

The new issue of the IASSIST Quarterly is now available on the web. This is the volume 34 (number 2, 2010).

 http://iassistdata.org/iq/issue/34/2

The layout has changed. We hope you’ll enjoy the new style presented. It seems to be a more modern format and more suited for the PDF presentation on the web. Walter Piovesan – our publication officer – had a biking accident. To show that nothing is so bad that it is not good for something Walter used his recovery time to redesign the IQ. Furthermore, Walter is the person in charge of the upcoming 2011 IASSIST conference, so he is a busy guy. And I’m happy to say that Walter should be fit for the conference.

This issue of the IQ features the following papers:

Rein Murakas and Andu Rämmer from the Estonian Social Science Data Archive (ESSDA) at the University of Tartu describe in their paper "Social Science Data Archiving and Needs of the Public Sector: the Case of Estonia" how the archive had a historical background in the empirical research of the Soviet Union.

From the historical background we move to web 2.0 in a paper  by Angela Hariche, Estelle Loiseau and Philippa Lysaght on "Wikiprogress and Wikigender: a way forward for online collaboration". The authors are working at the OECD and the paper's statement is that "collaborative platforms such as wikis along with advances in data visualisation are a way forward for the collection, analysis and dissemination of data across countries and societies”.

The third paper addresses an issue of central importance for most data archives. The question concerns balancing data confidentiality and the legitimate requirements of data users. This is a key problem of the Secure Data Service (SDS) at the UK Data Archive, University of Essex. The paper "Secure Data Service: an improved access to disclosive data" by Reza Afkhami, Melanie Wright, and Mus Ahmet shows how the SDS will allow researchers remote access to secure servers at the UK Data Archive.

The last article has the title "A user-driven and flexible procedure for data linking". The authors are Cees van der Eijk and Eliyahu V. Sapir from the Methods and Data Institute at the University of Nottingham. The data linking relates to research combining several different datasets. The implementation is developed for the PIREDEU project in comparative electoral research. The authors are combining traditional survey data with data from party manifestos and state-level data.

Articles for the IQ are always very welcome. They can be papers from IASSIST or other conferences, from local presentations or papers directly  written for the IQ.

Notice that chairing a conference session with the purpose of aggregating and integrating papers for a special issue IQ is much appreciated as the information reaches many more people than the session participants and will be readily available on the IASSIST website.

Authors are very welcome to take a look at the description for layout and sending papers to the IQ:

http://iassistdata.org/iq/instructions-authors

Authors can also contact me via e-mail: kbr @ sam.sdu.dk. Should you be interested in compiling a special issue for the IQ as guest editor or editors I will also be delighted to hear from you.

Karsten Boye Rasmussen, editor

Wrangle, Refine, and Represent (Data Visualization Tools from the CAR Conference)

I wanted to share a blog post from our local Data and GIS blog that may be of interest to the IASSIST community.  Each of the tools varies in it's focus and applicability for data work- but they might be helpful for various data tasks focusing on cleaning and representing data online.

http://blogs.library.duke.edu/data/2011/03/14/wrangle-refine-and-represent/ more...

  • IASSIST Quarterly

    Publications Special issue: A pioneer data librarian
    Welcome to the special volume of the IASSIST Quarterly (IQ (37):1-4, 2013). This special issue started as exchange of ideas between Libbie Stephenson and Margaret Adams to collect

    more...

  • Resources

    Resources

    A space for IASSIST members to share professional resources useful to them in their daily work. Also the IASSIST Jobs Repository for an archive of data-related position descriptions. more...

  • community

    • LinkedIn
    • Facebook
    • Twitter

    Find out what IASSISTers are doing in the field and explore other avenues of presentation, communication and discussion via social networking and related online social spaces. more...