Another database project — for production rate calibration data
In a previous post I described the ICE-D: Antarctica project to collate exposure-age data from Antarctica. This post is about something similar that I am working on with Pierre-Henri Blard and his colleagues at CRPG. Basically, we’ve now done the same thing with production rate calibration data, which is more useful and more important for several reasons that I will now enumerate in too much detail. If you can’t wait that long, here is the link:
First, some review of what is going on here. To compute exposure ages from cosmogenic-nuclide concentrations, we need to know what the production rate of that nuclide is. We determine this with two steps. We start with some sort of a “scaling scheme,” which is just a model of how production rates vary with location, elevation, and time. Then we need to fit that scaling scheme to a “production rate calibration data set,” which is a set of measurements of cosmogenic-nuclide concentrations in rock surfaces whose exposure ages we already know from some sort of independent evidence. So we measure the average production rate during whatever the exposure time was at some sites whose exposure age we already know, and then we use a scaling scheme to combine data from multiple locations and then apply the results to sites whose exposure age we would like to measure.
The other thing that falls out of this process is the ability to evaluate how well a scaling scheme works. If we can do a good job of fitting the scaling scheme to a set of real calibration data from different locations, elevations, and ages, then we conclude that the scaling scheme is doing a good job. Consult a paper by Brian Borchers and others to see how this works. Basically, this is how we decide which scaling scheme we should be using if we want to compute exposure ages in the most accurate way possible.
Obviously, production-rate calibration data are quite central to this whole process. Another thing that is important is that this whole issue just became a lot more complicated. Expect this to become more blog fodder in the future, but the summary is that the recently completed CRONUS-Earth project, as well as some other efforts, have resulted in a proliferation of scaling methods, online exposure age calculation schemes, and calibration data. In addition, the rate that new production-rate calibration data get generated is reasonably high — a handful of papers describing new calibration data appear each year — which means that the total data set is now rather larger than what was included in the main calibration-and-evaluation project that was done a couple of years ago as part of CRONUS and is described in the paper by Borchers and others. And it’s growing.
So this creates a couple of problems.
One is just a data assimilation problem. What we really want is for the production rate calibration process to be, basically, self-updating — when new calibration data are generated, they are incorporated into our current best estimate of production rates. We, in large part meaning me, have done a terrible job of this in the past few years; there really weren’t any systematic updates of how well scaling models fit the entire set of calibration between the 2008 paper by myself and others and the 2015 Borchers paper. That’s embarrassing, because it was clear very shortly after the 2008 paper was published that the calibration data set in that paper isn’t very good. The data assimilation problem isn’t particularly computationally difficult; we know how to do this. However, there are some obstacles, the first one being the fact that the calibration data needs to be all in one place so that software can easily access the current data set, whatever it may be, in a fast and low-hassle fashion.
The other is the issue of how to decide which of the various scaling and age-calculation options to use, which of course is wound up in the data-assimilation problem because the correct answer will likely change as new data appear. A paper by Fred Phillips and others, that summarizes some of the where-do-we-go-next discussions at the end of the CRONUS project, envisions the existence of an ongoing international committee that will decide what the best way to compute exposure ages is, and prescribe this method to folks who want to do exposure-age calculations. The idea seems to be that this committee would evaluate various proposed scaling schemes and calibration data and determine which were acceptable, which were not, and which one is best. I’m not a co-author of this paper, and I think this is a bad idea. It creates disincentives for folks who are not on the committee to engage with trying to make the overall field of exposure-dating better, and incentives for people on the committee (who will, of course, be selected because they are responsible for the present state of the art, whatever it is) to maintain the status quo. In my view, this approach would impede progress. As an Economist subscriber who lives near Silicon Valley, of course, I think this is a software problem and not a governance problem: if you create software tools to make it easy to evaluate calibration data and scaling schemes, then people can figure things out for themselves without any help from a committee, and the best-performing calculation methods will float to the top because they are the best-performing. Of course, I don’t have this software yet. But regardless, the first thing that needs to happen here is to make it possible for this software, or anything else, to easily get the entire set of existing calibration data, whatever it is and whenever you want it.
So that’s the initial problem that a calibration data database needs to solve: putting all the calibration data we know about in one place and delivering it to anyone or anything who wants to use it. At present, this problem is not solved. What we have currently is the usual terrible situation in geochronology where various researchers are all maintaining their own mutually inconsistent Excel spreadsheets that each include some fraction of the existing calibration data, are variably up-to-date, and contain different errors and omissions. This situation, of course, maximizes confusion, redundant work, potential for error, and general hassle; and minimizes accuracy, repeatability, and transparency. We can do better. Specifically, the ICE-D calibration data project aims to replace the inconsistent-spreadsheet situation with an online database of all known production rate calibration data that has the following properties: (1) it is generally believed to be reasonably complete and accurate; (2) it only exists in one place so that there are not multiple inconsistent versions; and (3) it can easily be ingested into any software that wants to use these data to do calculations, whether elaborate online exposure age calculator frameworks or odd bits of MATLAB code running on one’s local machine.
So here are the details:
What. The ICE-D: Production Rate Calibration database project.
Where. It’s here: http://calibration.ice-d.org.
Who. I and Pierre-Henri Blard of CRPG/CNRS are at present collaborating on putting it together. If you are involved with collecting production rate calibration data, or you think some data are missing or wrong, you should help too. That does require some knowledge of relational databases and MySQL. Give one of us a call.
Disclaimer. This project is not part of the CRONUS-Earth project.
What, in more detail. As with the ICE-D:Antarctica database, data live in a MySQL database hosted on the Google Cloud SQL service. There is a front end running in Python on Google App Engine. At present, the front end provides various browser interfaces (by location, publication, etc.) to look at data associated with individual samples or sites. It looks very similar to the Antarctica one. Some interesting features are as follows:
Nonprescriptiveness. The database is organized such that samples can be grouped into “calibration data sets.” For example, the database contains many beryllium-10 measurements from calibration sites that were not included in the CRONUS calibration exercise described in the Borchers and others paper. However, it’s possible to access just the samples that were used in that study as a distinct “calibration data set.” The idea here is to make it possible to replicate previous calibration studies using the data that were used in that study, even as the overall data set grows. It’s also to make the project non-prescriptive: the database should contain all data that can plausibly be described as decent calibration data, but you should be able to decide which data you want to use.
No more spreadsheets. What we want here is for any software to be able to ingest the most up-to-date calibration data from anywhere. This is the fun part. For example, say I am running MATLAB on my laptop and I want to ingest some calibration data to do some kind of calculations. I can get the current up-to-date data for the “Primary Be-10 calibration data set” described in the Borchers et al. paper noted above by using the ‘urlread’ function of MATLAB as follows:
urls = ['http://calibration.ice-d.org/cds/4'];
s = urlread(urls);
l1 = '<!-- begin v3 --><pre>';
l2 = '</pre><!-- end v3 -->';
ss = s((strfind(s,l1)+length(l1):strfind(s,l2)-1));
What you just did was to read the entire calibration data set (102 measurements) in online exposure age calculator v3 input format into a string variable. The following then parses it into a useful data structure:
in = validate_v3_input(ss);
That uses the ‘validate_v3_input’ function from the version 3 online exposure age calculator, which isn’t in really great shape yet so it’s not posted anywhere, but if you want a copy let me know. The point is, you don’t need the spreadsheet any more. You just need an internet connection and five lines of code.
Repository of associated knowledge. This is a neat feature that I am pretty sure no one other than me will use. Lots of people have been involved in collecting production rate calibration data over the years and all of them have odd bits of knowledge about sites and samples that aren’t part of the very short list of numerical data (elevation, nuclide concentration, etc.) associated with each sample. Still, this harder-to-quantify descriptive information might be useful and it would be great to have it in the same place with the numerical data. Thus, the browser interface has a ‘discussion’ feature. If you log in with a Google ID (sorry about that, but it’s the minimum level of authentication needed for decent security practice) you can say whatever you want about sites and samples. If you know that a Be-10 standardization recorded in the database is wrong, say so, so someone knows to fix it. If you were concerned about the amount of moss on the rock surface, say so. If you know something unpublished about the radiocarbon age constraints, you can add that to the permanent record. Wouldn’t it be useful if all this sort of info was all together with the numerical data? Absolutely. The database also has provision for site and sample photos, which are included for projects I was personally involved with, so I have photos. However, enabling photo upload from the general population is a bigger coding project and that hasn’t happened yet.
Where is it going? A couple of improvements and applications are really going to happen. First, Pierre-Henri and his colleagues at CRPG are working on a MATLAB-based online exposure age calculator that will use this database to update calibration data as needed. This is in testing and it is happening. Second, the database is reasonably complete right now for Be-10, Al-26, and He-3 calibration data, but nothing else. In-situ-produced carbon-14 in quartz will happen — that is just a matter of data entry. Including chlorine-36 data is part of the plan, but is more difficult simply because a lot more data need to be recorded. That’s more speculative, but there’s been a bit of progress. Other items on the wish list include making it look less like a static database and more like Wikipedia, so more people can contribute. That’s certainly feasible but a lot more programming work; we’ll need help for that. Finally, we’ll need means to use these data in whatever computational environments people are using for exposure-age calculations. The above example shows how it works for MATLAB, but that could be a lot smoother and, obviously, it would be useful to do the same thing for stuff like Python, R, and, yes, Excel if you really must.
Summary. You don’t need to keep the incomprehensible spreadsheet updated any more. Done with spreadsheets.