Summary of fourth beta release and changes
This page documents previous version releases of the databank. These versions are also archived in the databank ftp area.Summary of third beta release and changes
Below are summarized the four beta releases that preceded the official version 1 release of holdings. These are presented in the order in which they were released (oldest first).
Summary of first version beta release
The first version beta release consists of a recommended version of the merge of over 40 source decks of data submitted to the databank to date. These sources range from a single station record to compilations of tens of thousands of stations. Most consist of 'raw' data (although unknown processing may have occurred). Where sources have been quality controlled or homogenized this is explicitly documented in the flags added in Stage 2. Before proceeding to the merge process some prioritization of the sources is required. The recommended merge version prioritization is summarized below. This places GHCN-D raw which is the de facto stage 3 daily deck as top priority. This ensures vertical coherency between daily and monthly holdings allowing a degree of data archeology to finer temporal scales. Remaining decks are ordered based upon a combination of their provenance, whether the data have been quality controlled or adjusted and station length. Priority is given to records of better provenance (where stage 0 exists), that are 'raw' and that have longer records. Some nuancing of the ordering has been performed based upon expert judgement. The source decks in the order presented to the recommended merge are recorded below:
The merge is performed pairwise between the master deck and each candidate deck until no data decks remain to be considered.
Between the master deck and each source deck the merge process is a two step process. First geographical metadata (location, height and station name) are compared for each candidate station to all master stations. Stations which are sufficiently similar are flagged for further consideration. If no stations are sufficiently similar the station is deemed unique and added to the master database. If stations were flagged as potential metadata matches then further consideration is given as to data similarity over any overlap periods. Here two decisions are possible: it is the same station and a merge is performed; or it is unique and added to the master holdings as a new station. At all points a third option exists to withhold the candidate station. This could arise from any of: metadata conflicts; data redundancy or a lack of sufficient confidence to assign either as unique or a merge station. Effectively this automated process is trying to replicate what a human analyst would do but cannot given the several million pairwise comparisons required. The process is first run on max / min records because artefacts may affect different errors into these two elements. Averages are then calculated and all sources with average temperatures are presented again for a second run through of the merge algorithm. Particularly early in the record for some stations solely monthly average values (calculated in an unknown manner) are available.
The beta of the first version release consists of just over 39,000 stations some graphical summaries are presented below comparing to the GHCNv3 product currently used as the basis for the land surface temperature products released by NOAA NCDC and NASA GISS.
The merged product is being released as a recommended product and a number of variants. These variants allow analysts to understand the sensitivity of their results to reasonable choices as to how to do the merge.
The code to convert the stage 1 data decks to a common stage 2 format and to create the merged stage 3 holdings is freely available. This is unsupported code made freely available without restriction. Compilers are made available where this is necessary.
The databank follows the version control protocols set out for GHCNv3. Namely:
The formal designation is glsd.x.y.z[optionally -betan].yyyymmdd
versioning format facilitates documentation and communication of updates
and modifications that occur as a normal part of the envisaged
life-cycle of the databank. The frequency of regular updates to the
databank has yet to be determined.
Summary of second version beta release and changes
The second version beta relesae changes are described in full in the pdf file here. In brief the following scientific changes were made:
1. Following a suggestion by climate blogger Nick Stokes we incorporated a station start date similarity critreria into the metadata probability calculation
2. We removed / replaced / updated a number of sources. The largest impact comes from replacing GSOD with HadISD. We found numerous issues with geolocation and data units within GSOD which led us to remove it. Both GSOD and HadISD are devived from the same underlying ISD database maintained at NCDC.
3. We updated the data comparison look-up table critical values to amke it more likely to merge and less so to unique a record
The same 6 images as above are reproduced again below for this beta version.
1. A blacklist of candidate stations was generated to either fix known errors with its metadata/data, or withhold the station completely. This is a required input file for the code to run and is provided with this beta release
2. Some minor code changes were applied, including withholding stations when the metadata probability was near perfect, but the data comparisons were so poor the station became unique (when it should have merged). In addition, odd characters were removed from the station name before the Jaccard Index was run.
3. The format of stage 3 data was changed so that it was consistent with all stage 2 data. In addition, all data provenance flags have been ported over in order to be open and transparent
4. Algorithm output is included with each variant result, in order to provide information about each candidate station and how it made it's decision to merge / unique / withhold.
The third beta version release consists of just over 31,500 stations some graphical summaries are presented below comparing to the GHCNv3 product currently used as the basis for the land surface temperature products released by NOAA NCDC and NASA GISS.
There was a final fourth beta release. The associated readme file is available at ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage3/updates_to_beta4.pdf. The fourth beta release involves changes to the metadata format (see the README) and the provision of files in netcdf format. We have made every effort to make these netcdf files compliant with the CF conventions. This release involved minimal additional changes to the station list and so graphical updates are not given here for brevity. The main innovation was in the formatting of the data and process metadata.
Summary of intermediate release
From October 2013 until June 2014 a further intermediate release appeared which was close to the final version 1 release was available. The graphical summaries are reproduced below: