The ICE-D database project
The point of this posting is to describe a database project that I have been working on for the past few months.
There has been a lot of talk over the past ten years or so about building a “community database” for cosmogenic-nuclide data. The motivation is pretty obvious. There exists an enormous inventory of cosmogenic-nuclide measurements that are potentially useful for synoptic studies of paleoclimate, ice sheet change, and Earth surface processes. However, because of continual research into production rate systematics, how we calculate exposure ages, erosion rates, etc. changes all the time. So if you want to compare exposure ages from papers published at different times, you have to go back and find all the raw observations in all the papers and use them to recalculate the exposure ages according to a consistent calibration data set, scaling scheme, and age calculation method. This is a big pain in the neck and, in my view more importantly, it leads to unreviewable papers. There is simply no way to verify in detail the large and complex exposure-age compilation spreadsheets in papers that apply compilations of this sort to address large-scale questions.
It would be oh-so-much easier if you could have (i) a single database that stored only raw observations and was generally believed to be fairly accurate, coupled with (ii) some kind of software to dynamically calculate exposure ages with whatever the currently accepted calibration data set and scaling scheme is. This is not a particularly unique problem, or a unique solution. Everyone pretty much knows what is needed here.
The existing online exposure age calculator that has been in use since 2008 sort of provides this capability — one can maintain a spreadsheet of data extracted from old papers and paste it all into the online calculator, thus generating a compilation of ages or erosion rates that are computed in a consistent way. It is important that this capability exists and it is much better than nothing, and a lot of projects and papers that have collated large amounts of cosmogenic-nuclide data to address large-scale questions have only been possible because of this aspect of the online calculator. But this is silly, right? Why should everyone wade through the supplementary data for the same old papers and then maintain their own separate, mutually sort-of-inconsistent spreadsheets of the same thing? Although the idea that this is a valuable exercise for the student is not completely without merit, in a general sense it is an inefficient use of resources that maximizes the number of opportunities for error as well as the amount of work needed every time new data or production rate calibrations appear.
So for those of you who are interested in applying the vast number of existing cosmogenic-nuclide measurements that exist out there to address synoptic questions in paleoclimate and Earth surface processes, the vision is pretty clear. It is obvious what is needed. There has been plenty of discussion of what and how. However, basically nothing has happened. The reasons for this are beyond the scope of the current post, but are fairly routine in the overall area of scientific data discovery and management, and are well documented in both Earth science and other fields.
The point of the current post is that even though I’ve certainly made my contribution (or non-contribution) to past inaction, I am tired of waiting for something to happen. Thus, I have built exactly this sort of database for cosmogenic-nuclide exposure-age data for Antarctica. Antarctica is an interesting case because, first, cosmogenic-nuclide exposure-age data from ice-free parts of Antarctica are by far the most extensive data set we have for reconstructing past ice thickness change in Antarctica and thus Antarctica’s contribution to past sea level change, so it is really, legitimately, important to be able to look at these data together at continental scale. Second, Antarctic exposure-age data have been collected over more than two decades, spanning innumerable changes in how we calculate production rates and exposure ages, so they are a great example of an intercomparison mess. And third, there are not that many exposure-age measurements for Antarctica — two or three thousand total measurements all together — so the data set is manageable in scale.
The project is located here:
In addition, Brad Herried of the Polar Geospatial Center at the University of Minnesota has built a geographic interface at this address:
For those of you who (i) are personally involved in collecting this type of data in Antarctica, and/or (ii) have access to the PGC high-resolution imagery, the map interface is just awesome. It’s fascinating and immersive to be able to look at exactly where all the samples collected in previous work are throughout the entire continent. On the other hand, it suddenly makes the continent feel a lot smaller.
A few important features of the project:
1. Functionally what is happening is that the data are stored in a MySQL database running on the Google Cloud SQL service. Then the web pages are served by Python code, running on Google App Engine, that extracts data from the database and interacts programmatically with the web service API for the online exposure age calculator . This all employs commonly used software tools and none of this is rocket science.
2. The data that are in there fall into two main categories.
One, the majority of published cosmogenic-nuclide data for Antarctica are in there and indexed according to publication. I’m still working on entering some published data, in particular older data from the Dry Valleys area that are not as well documented in some ways as most newer data. So there are some gaps there that are being gradually filled. In general, data are not very complete for the Dry Valleys, but they are fairly complete — at least as regards published data — from elsewhere.
Two, every single bit of exposure-age data I have ever collected in Antarctica, published or unpublished, is there. I should qualify that by saying that there are some data collected in collaborative projects, where I didn’t collect the samples and I was mainly providing noble gas measurements, that are not there. Yet. But I am working on that. The point is that there are tons of unpublished data in there. I feel great about this — because it is no fun to feel guilty about sitting on a huge hoard of unpublished data that were collected at public expense. Also notable is that the samples collected by me are attached to extremely comprehensive background data, including a lot (= thousands) of photos of samples and sample sites. Here’s an example. Basically, you now have all the information you need to tell me that I collected the wrong sample in the field.
3. The data that are not there fall into two categories:
One, older papers that I have not yet gotten around to extracting data from, as noted above.
Two, several large hoards of unpublished exposure-age data collected by researchers who are actively working in Antarctica. Some of the most important data collected in recent years, that should be both published and represented here, are not. Those responsible know who they are.
4. Everything that is in there that I did not collect personally is, to the best of my knowledge, an accurate representation of what was published in source papers. But I am sure there are many publication and transcription errors that exceed the scope of my knowledge. One goal of this project that is not yet implemented is an editing interface, so that those who know about errors in existing data and/or new data can contribute to improving the overall product. Really what this should be is not a means of data archiving but something more like a structured data wiki. It’s not there yet, but that’s the goal.
If you do have specific knowledge of data that aren’t there or data that are there but are incorrect, let me know.
5. I’m still working on the interface code and I will be for the foreseeable future. Possibly for years. The web interface may, and probably will, be broken at any time. It’s a mess.
6. It’s a bit slow. Working on that. Be patient.
6. This is not a manifestation of any larger project. There were no meetings. Except for Brad Herried at PGC who built the map interface, no one else was involved to any significant extent, although several people helped by reminding me about papers/studies that I forgot or never knew about. Except for the resources contributed by PGC (which is funded by the NSF Antarctic and Arctic research programs) to build and run the map interface, there is no specific funding source. In fact, I am paying Google $10/month for SQL database hosting, so if you think this project is worthwhile you are welcome to contribute. In addition, this means that if you send me an email with the sentence “Greg, I think you should add a feature that does…something,” then the next sentence of the email should read “I am really good at writing Python code and I would very much like to help you make that happen.”
Finally, to summarize what’s been accomplished or not accomplished:
1. This project solves the problem of making a dynamic cosmogenic-nuclide exposure-age database. In principle the idea of dynamically calculating exposure ages from stored raw observations isn’t complicated, but no one has done it. Now someone has. It’s not nearly as smooth as, for example, the Neotoma database, but it works. Progress.
2. This shows what is possible and what we should have done years ago. There is no obstacle to doing this with basin-scale erosion-rate data, alpine glacial moraine data, or anything else. With modern cloud computing services, it is easy to do this cheaply, efficiently, and scalably. It really is.
3. I haven’t discussed this in any detail above, but it is possible to interact with the database programmatically to do synoptic analyses. In fact, that’s the whole point. Thus, as the database itself and the age calculation methods evolve and improve, the synoptic analyses that stem from the database can automatically evolve as well. This is important. More later on this.
4. The geographic interface is awesome. Awesome.
5. I am not there yet in terms of making this more of a data wiki than a data archive. That’s an important idea, and it’s what is going to make this actually a useful tool, but it’s not yet implemented at all.