Databank structure and availability
The databank is currently under construction with a version 1 monthly temperature holdings under beta release (third beta version) and a full release of the first version scheduled for release upon paper acceptance (paper was submitted 4/16/13). The databank is available from ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/. The merged monthly databank product is available from ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage3/ and there is a readme describing the merge method here. Any data hosted on the databank server are available without restriction, but please be cautioned that until a full version release some changes to format are possible. Data is to be made available in four stages:
Analysts wishing to create products from the databank should run their algorithms on at least the recommended merge both in the most recent version and also frozen version 1.0.0 (when released). This latter frozen version will form the basis for the analogs created as part of the benchmarking and assessment.
Below is summarized the third beta release. Earlier releases can be found at the previous versions page.
Summary of first version beta release
The third version beta release consists of a recommended version of the merge of over 40 source decks of data submitted to the databank to date. These sources range from a single station record to compilations of tens of thousands of stations. Most consist of 'raw' data (although unknown processing may have occurred). Where sources have been quality controlled or homogenized this is explicitly documented in the flags added in Stage 2. Before proceeding to the merge process some prioritization of the sources is required. The recommended merge version prioritization is summarized below. This places GHCN-D raw which is the de facto stage 3 daily deck as top priority. This ensures vertical coherency between daily and monthly holdings allowing a degree of data archeology to finer temporal scales. Remaining decks are ordered based upon a combination of their provenance, whether the data have been quality controlled or adjusted and station length. Priority is given to records of better provenance (where stage 0 exists), that are 'raw' and that have longer records. Some nuancing of the ordering has been performed based upon expert judgement.
The merge is performed pairwise between the master deck and each candidate deck until no data decks remain to be considered.
Between the master deck and each source deck the merge process is a two step process. First geographical metadata (location, height and station name) are compared for each candidate station to all master stations. Stations which are sufficiently similar are flagged for further consideration. If no stations are sufficiently similar the station is deemed unique and added to the master database. If stations were flagged as potential metadata matches then further consideration is given as to data similarity over any overlap periods. Here two decisions are possible: it is the same station and a merge is performed; or it is unique and added to the master holdings as a new station. At all points a third option exists to withhold the candidate station. This could arise from any of: metadata conflicts; data redundancy or a lack of sufficient confidence to assign either as unique or a merge station. Effectively this automated process is trying to replicate what a human analyst would do but cannot given the several million pairwise comparisons required. The process is first run on max / min records because artefacts may affect different errors into these two elements. Averages are then calculated and all sources with average temperatures are presented again for a second run through of the merge algorithm. Particularly early in the record for some stations solely monthly average values (calculated in an unknown manner) are available.
The third beta version release consists of just over 31,500 stations some graphical summaries are presented below comparing to the GHCNv3 product currently used as the basis for the land surface temperature products released by NOAA NCDC and NASA GISS.
The merged product is being released as a recommended product and a number of variants. These variants allow analysts to understand the sensitivity of their results to reasonable choices as to how to do the merge.
The code to convert the stage 1 data decks to a common stage 2 format and to create the merged stage 3 holdings is freely available. This is unsupported code made freely available without restriction. Compilers are made available where this is necessary.
The databank follows the version control protocols set out for GHCNv3. Namely:
The formal designation is glsd.x.y.z[optionally -betan].yyyymmdd
versioning format facilitates documentation and communication of updates
and modifications that occur as a normal part of the envisaged life-cycle of the databank. The frequency of regular updates to the databank has yet to be determined.
Summary of second version beta release and changes
The second version beta relesae changes are described in full in the pdf file here. In brief the following scientific changes were made:
1. Following a suggestion by climate blogger Nick Stokes we incorporated a station start date similarity critreria into the metadata probability calculation
2. We removed / replaced / updated a number of sources. The largest impact comes from replacing GSOD with HadISD. We found numerous issues with geolocation and data units within GSOD which led us to remove it. Both GSOD and HadISD are devived from teh same underlying ISD database maintained at NCDC.
3. We updated the data comparison look-up table critical values to amke it more likely to merge and less so to unique a record
Summary of third version beta release and changes
The third beta release primarily addresses blacklisting issues associated with incorrectly specified location information (including stations over ocean areas) that required the construction of a manual list of exception handling as input to the merge program. In addition some additional formatting options of the merged set were included.
How to submit data to the databank
Have data or a lead to data? Please email email@example.com
Want help in approaching data holders? Please feel free to use the data request cover letter.
Want to know what data is required and how to submit it? Please follow the data submission guidelines.
Databank effort publicity
A poster for presentation at the World Climate Research Program's Open Science Conference outlines the state of databank devlopment as of October 2011.
A presentation was given at the 9th International Temperature Symposium while the databank was still under development so some aspects have been superceded.
A poster describing the beta release was presented at the GFCS User Conference that preceded WMO Congress Extraodinary Session in October 2012 (original here is in higher resolution but third party hosted)
Other data sources
Marine surface data, necessary to characterize truly global changes, are available through http://icoads.noaa.gov/.
Databank Working Group Membership (as at 5/25/12)
Jay Lawrimore: NOAA National Climatic Data Center
WMO Region I:
Albert Mhanda (ACMAD, Niger)
WMO Region II:
Vyacheslav Razuvaev, Russian Research Institute of Hydrometeorological Information
Kenji Kamiguchi: Japan Meteorological Agency
WMO Region III
Matilde Rusticucci: Univ. of Buenos Aires, Argentina
Madeleine Renom: Universidad de la Republica, Montevideo, Uruguay
Waldenio Gambi Almeida: Instituto Nacional de Pesquisas Espaciais, Centro de Previsão de Tempo e Estudos Climáticos, Brazil
WMO Region IV
Matthew Menne: NOAA National Climatic Data Center
Steve Worley: National Center for Atmospheric Research
John Christy: University of Alabama Huntsville
WMO Region V
Meghan Flannery: Australian Bureau of Meteorology
WMO Region VI
Albert Klein-Tank: KNMI
Jeremy Tandy: UK Met Office
David Lister: CRU, UEA, UK
Peter Thorne: CICS-NC / NOAA NCDC
Working Group Documentation
Terms of reference (adopted 6/8/11, expire 6/8/13)