Databank



Databank structure and availability

The databank is currently being finalized for a version 1 release product of monthly temperature holdings following the acceptance of the methods paper. The databank in the current version (beta4) is available from ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/. The merged monthly databank product is available from ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage3/ and there is a readme describing the merge method here. Any data hosted on the databank server are made available without restriction. Data is, where possible, available in the following four stages:
  • Stage 0: raw data image or information as to hard copy location
  • Stage 1: Data in native format provided
  • Stage 2: Data converted into a common format and with provenance and version control information appended
  • Stage 3: Merged collation of stage 2 data including merging / mingling of records for stations in multiple stage 2 decks.
Not all station records will have data back to stage 0 but all stations will contain provenance information to the extent that it can be uniquely ascertained. No information is withheld. We are indebted to many individuals and organizations, not least members of the working group and associated task teams, for helping to collate available records and submit them to the databank.

Analysts wishing to create products from the databank should run their algorithms on at least the recommended merge both in the most recent version and also frozen version 1.0.0 (which will be maintained when superceded). This latter frozen version will form the basis for the analogs created as part of the benchmarking and assessment.

Below is summarized information on the upcoming release. Earlier beta releases can be found at the previous versions page.

Summary of the databank


Sources

The first release will consist of a recommended version of the merge of over 50 source decks of data submitted to the databank to date. These sources range from a single station record to compilations of tens of thousands of stations. Most consist of 'raw' data (although unknown processing may have occurred so it is safer to consider these 'basic' data holdings). Where we know sources have been quality controlled or homogenized this is explicitly documented in the flags added in Stage 2. Before proceeding to the merge process some prioritization of the sources is required. The recommended merge version prioritization places GHCN-D raw which is the de facto stage 3 daily deck as top priority. This ensures vertical coherency between daily and monthly holdings allowing a degree of data archeology to finer temporal scales. Remaining decks are ordered based upon a combination of their provenance, whether the data have been quality controlled or adjusted and station length. Priority is given to records of better provenance (where stage 0 exists), that are 'raw' and that have longer records. Some nuancing of the ordering has been performed based upon expert judgement.

Merge process

The merge is performed pairwise between the master deck and each candidate deck until no data decks remain to be considered.

Between the master deck and each source deck the merge process is a two step process. First geographical metadata (location, height, station name and station record start date) are compared for each candidate station to all master stations. Stations which are sufficiently similar are flagged for further consideration. If no stations are sufficiently similar the station is deemed unique and added to the master database. If stations were flagged as potential metadata matches then further consideration is given as to data similarity over any overlap periods. Here two decisions are possible: it is the same station and a merge is performed; or it is unique and added to the master holdings as a new station. At all points a third option exists to withhold the candidate station. This could arise from any of: metadata conflicts; data redundancy or a lack of sufficient confidence to assign either as unique or a merge station. Effectively this automated process is trying to replicate what a human analyst would do but cannot given the several million pairwise comparisons required. The process is first run on max / min records because artefacts may affect different errors into these two elements. Averages are then calculated and all sources with average temperatures are presented again for a second run through of the merge algorithm. Particularly early in the record for some stations solely monthly average values (calculated in an unknown manner) are available.

Merge results

The first release will consist of just under 32,000 stations. Some graphical summaries are presented below comparing to the GHCNMv3 product currently used as the basis for the land surface temperature products released by NOAA NCDC and NASA GISS.

Stations plotted by length (longer records overplot shorter)
Comparison of station count to GHCNMv3
%age of possible 5 degree gridboxes that contain land sampled compared to GHCNMv3
 
Change in station count: US vs. Rest of World
 
Station count by length of record

Global anomalies of the basic data in GHCNMv3 and the first databank. Neither have been homogenized and therefore neither constitutes a climate data record.


The merged product is being released as a recommended product and a number of variants. These variants allow analysts to understand the sensitivity of their results to reasonable choices as to how to do the merge.

Code

The code to convert the stage 1 data decks to a common stage 2 format and to create the merged stage 3 holdings is freely available. This is unsupported code made freely available without restriction. Compilers are made available where this is necessary.

Version control

The databank follows the version control protocols set out for GHCNv3. Namely:

The formal designation is glsd.x.y.z[optionally -betan].yyyymmdd

where

  • x = major upgrades of unspecified nature and always accompanied by a peer reviewed manuscript
  • y = substantial modifications to the databank, including a new set of stations or substantive changes to merge algorithms. Accompanied by a technical note published on the ftp site.
  • z = minor revisions to both data and processing software that are tracked in "status and errata".
  • yyyy = year in which the update to the databank occurred.
  • mm = month in which the update to the databank occurred.
  • dd = day in which the update to the databank occurred.

This versioning format facilitates documentation and communication of updates and modifications that occur as a normal part of the envisaged life-cycle of the databank. The frequency of regular updates to the databank has yet to be formally determined.


How to submit data to the databank

Have data or a lead to data? Please email data.submission@surfacetemperatures.org

Want help in approaching data holders? Please feel free to use the data request cover letter.

Want to know what data is required and how to submit it? Please follow the data submission guidelines.

Databank effort publicity

A poster for presentation at the World Climate Research Program's Open Science Conference outlines the state of databank devlopment as of October 2011.

A presentation was given at the 9th International Temperature Symposium while the databank was still under development so some aspects have been superceded.

A poster describing the beta release was presented at the GFCS User Conference that preceded WMO Congress Extraodinary Session in October 2012 (original here is in higher resolution but third party hosted)

Other data sources

Marine surface data, necessary to characterize truly global changes, are available through http://icoads.noaa.gov/

Databank Working Group Membership (as at 1/2/14)


Chair:
    Jay Lawrimore: NOAA National Climatic Data Center
WMO Region I:
    Albert Mhanda (ACMAD, Niger)
WMO Region II:
   Vyacheslav Razuvaev, Russian Research Institute of Hydrometeorological Information
   Kenji Kamiguchi: Japan Meteorological Agency
   Vlad Shaimardanov, Russian Research Institute of Hydrometeorological Information
WMO Region III
    Matilde Rusticucci: Univ. of Buenos Aires, Argentina
    Madeleine Renom: Universidad de la Republica, Montevideo, Uruguay
    Waldenio Gambi Almeida: Instituto Nacional de Pesquisas Espaciais, Centro de Previsão de Tempo e Estudos Climáticos, Brazil

WMO Region IV
    Matthew Menne: NOAA National Climatic Data Center
    Byron Gleason: NOAA National Climatic Data Center
    Jared Rennie: CICS NC, North Carolina State University
    Steve Worley: National Center for Atmospheric Research
    John Christy: University of Alabama Huntsville
WMO Region V
    Meghan Flannery: Australian Bureau of Meteorology
WMO Region VI
    Albert Klein-Tank: KNMI
    Jeremy Tandy:  UK Met Office
    David Lister: CRU, UEA, UK
    Colin Morice, UK Met Office
Ex-officio Member
    Peter Thorne: NERSC, Bergen, Norway


Working Group Documentation

Terms of reference (previous terms of reference)

Terms of reference for databank merge task team

Č
Ċ
ď
Peter Thorne,
Aug 2, 2011, 7:11 AM
Ċ
ď
Peter Thorne,
Jun 14, 2012, 8:43 AM
Ċ
ď
Peter Thorne,
Jul 18, 2013, 6:38 AM
Ċ
ď
Peter Thorne,
Feb 23, 2012, 9:41 AM
Ċ
ď
Peter Thorne,
Oct 31, 2011, 6:04 AM
Ċ
ď
Peter Thorne,
Oct 3, 2012, 5:07 PM
Ċ
ď
Peter Thorne,
Jul 14, 2011, 8:42 AM
Ċ
ď
Peter Thorne,
Nov 3, 2010, 7:18 AM
Ċ
ď
Peter Thorne,
Apr 18, 2011, 11:40 AM
Ċ
ď
Peter Thorne,
Dec 9, 2013, 1:16 AM
Ċ
ď
Peter Thorne,
Apr 14, 2014, 11:30 PM
Ċ
ď
Peter Thorne,
Aug 2, 2011, 7:03 AM
Ċ
ď
Peter Thorne,
Jan 9, 2014, 11:30 PM
Ċ
ď
Peter Thorne,
Jan 3, 2011, 7:24 AM
Ċ
ď
Peter Thorne,
Oct 2, 2012, 9:03 AM
Ċ
ď
Peter Thorne,
Sep 29, 2011, 8:44 AM
Ċ
ď
Peter Thorne,
Aug 26, 2011, 6:00 AM
Ċ
ď
Peter Thorne,
Jun 14, 2012, 8:43 AM
Ċ
ď
Peter Thorne,
Oct 3, 2012, 1:49 PM
Comments