Databank



Databank structure and availability

The databank is currently under construction with a version 1 monthly temperature holdings under beta release (third beta version) and a full release of the first version scheduled for release upon paper acceptance (paper was submitted 4/16/13). The databank is available from ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/. The merged monthly databank product is available from ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage3/ and there is a readme describing the merge method here. Any data hosted on the databank server are available without restriction, but please be cautioned that until a full version release some changes to format are possible. Data is to be made available in four stages:
  • Stage 0: raw data image or information as to hard copy location
  • Stage 1: Data in native format provided
  • Stage 2: Data converted into a common format and with provenance and version control information appended
  • Stage 3: Merged collation of stage 2 data including merging / mingling of records for stations in multiple stage 2 decks.
Not all station records will have data back to stage 0 but all stations will contain provenance information to the extent that it can be uniquely ascertained. We are indebted to many individuals and organizations, not least members of the working group and associated task teams, for helping to collate available records and submit them to the databank.

Analysts wishing to create products from the databank should run their algorithms on at least the recommended merge both in the most recent version and also frozen version 1.0.0 (when released). This latter frozen version will form the basis for the analogs created as part of the benchmarking and assessment.

Below is summarized the third beta release. Earlier releases can be found at the previous versions page.

Summary of first version beta release


Sources

The third version beta release consists of a recommended version of the merge of over 40 source decks of data submitted to the databank to date. These sources range from a single station record to compilations of tens of thousands of stations. Most consist of 'raw' data (although unknown processing may have occurred). Where sources have been quality controlled or homogenized this is explicitly documented in the flags added in Stage 2. Before proceeding to the merge process some prioritization of the sources is required. The recommended merge version prioritization is summarized below. This places GHCN-D raw which is the de facto stage 3 daily deck as top priority. This ensures vertical coherency between daily and monthly holdings allowing a degree of data archeology to finer temporal scales. Remaining decks are ordered based upon a combination of their provenance, whether the data have been quality controlled or adjusted and station length. Priority is given to records of better provenance (where stage 0 exists), that are 'raw' and that have longer records. Some nuancing of the ordering has been performed based upon expert judgement.


Merge process

The merge is performed pairwise between the master deck and each candidate deck until no data decks remain to be considered.

Between the master deck and each source deck the merge process is a two step process. First geographical metadata (location, height and station name) are compared for each candidate station to all master stations. Stations which are sufficiently similar are flagged for further consideration. If no stations are sufficiently similar the station is deemed unique and added to the master database. If stations were flagged as potential metadata matches then further consideration is given as to data similarity over any overlap periods. Here two decisions are possible: it is the same station and a merge is performed; or it is unique and added to the master holdings as a new station. At all points a third option exists to withhold the candidate station. This could arise from any of: metadata conflicts; data redundancy or a lack of sufficient confidence to assign either as unique or a merge station. Effectively this automated process is trying to replicate what a human analyst would do but cannot given the several million pairwise comparisons required. The process is first run on max / min records because artefacts may affect different errors into these two elements. Averages are then calculated and all sources with average temperatures are presented again for a second run through of the merge algorithm. Particularly early in the record for some stations solely monthly average values (calculated in an unknown manner) are available.

Merge results

The third beta version release consists of just over 31,500 stations some graphical summaries are presented below comparing to the GHCNv3 product currently used as the basis for the land surface temperature products released by NOAA NCDC and NASA GISS.

Stations plotted by length (longer records overplot shorter)
Comparison of station count to GHCNMv3
%age of possible 5 degree gridboxes that contain land sampled compared to GHCNMv3
 
Change in station count: US vs. Rest of World
 
Station count by length of record


The merged product is being released as a recommended product and a number of variants. These variants allow analysts to understand the sensitivity of their results to reasonable choices as to how to do the merge.

Code

The code to convert the stage 1 data decks to a common stage 2 format and to create the merged stage 3 holdings is freely available. This is unsupported code made freely available without restriction. Compilers are made available where this is necessary.

Version control

The databank follows the version control protocols set out for GHCNv3. Namely:

The formal designation is glsd.x.y.z[optionally -betan].yyyymmdd

where

  • x = major upgrades of unspecified nature and always accompanied by a peer reviewed manuscript
  • y = substantial modifications to the databank, including a new set of stations or substantive changes to merge algorithms. Accompanied by a technical note published on the ftp site.
  • z = minor revisions to both data and processing software that are tracked in "status and errata".
  • yyyy = year in which the update to the databank occurred.
  • mm = month in which the update to the databank occurred.
  • dd = day in which the update to the databank occurred.

This versioning format facilitates documentation and communication of updates and modifications that occur as a normal part of the envisaged life-cycle of the databank. The frequency of regular updates to the databank has yet to be determined.


Summary of second version beta release and changes

The second version beta relesae changes are described in full in the pdf file here. In brief the following scientific changes were made:

1. Following a suggestion by climate blogger Nick Stokes we incorporated a station start date similarity critreria into the metadata probability calculation

2. We removed / replaced / updated a number of sources. The largest impact comes from replacing GSOD with HadISD. We found numerous issues with geolocation and data units within GSOD which led us to remove it. Both GSOD and HadISD are devived from teh same underlying ISD database maintained at NCDC.

3. We updated the data comparison look-up table critical values to amke it more likely to merge and less so to unique a record

Summary of third version beta release and changes

The third beta release primarily addresses blacklisting issues associated with incorrectly specified location information (including stations over ocean areas) that required the construction of a manual list of exception handling as input to the merge program. In addition some additional formatting options of the merged set were included.



How to submit data to the databank

Have data or a lead to data? Please email data.submission@surfacetemperatures.org

Want help in approaching data holders? Please feel free to use the data request cover letter.

Want to know what data is required and how to submit it? Please follow the data submission guidelines.

Databank effort publicity

A poster for presentation at the World Climate Research Program's Open Science Conference outlines the state of databank devlopment as of October 2011.

A presentation was given at the 9th International Temperature Symposium while the databank was still under development so some aspects have been superceded.

A poster describing the beta release was presented at the GFCS User Conference that preceded WMO Congress Extraodinary Session in October 2012 (original here is in higher resolution but third party hosted)

Other data sources

Marine surface data, necessary to characterize truly global changes, are available through http://icoads.noaa.gov/

Databank Working Group Membership (as at 5/25/12)


Chair:
    Jay Lawrimore: NOAA National Climatic Data Center
WMO Region I:
    Albert Mhanda (ACMAD, Niger)
WMO Region II:
   Vyacheslav Razuvaev, Russian Research Institute of Hydrometeorological Information
   Kenji Kamiguchi: Japan Meteorological Agency
WMO Region III
    Matilde Rusticucci: Univ. of Buenos Aires, Argentina
    Madeleine Renom: Universidad de la Republica, Montevideo, Uruguay
    Waldenio Gambi Almeida: Instituto Nacional de Pesquisas Espaciais, Centro de Previsão de Tempo e Estudos Climáticos, Brazil
WMO Region IV
    Matthew Menne: NOAA National Climatic Data Center
    Steve Worley: National Center for Atmospheric Research
    John Christy: University of Alabama Huntsville
WMO Region V
    Meghan Flannery: Australian Bureau of Meteorology
WMO Region VI
    Albert Klein-Tank: KNMI
    Jeremy Tandy:  UK Met Office
    David Lister: CRU, UEA, UK
Ex-officio Member
    Peter Thorne: CICS-NC / NOAA NCDC


Working Group Documentation

Terms of reference (adopted 6/8/11, expire 6/8/13)

Č
Ċ
ď
Peter Thorne,
Aug 2, 2011, 7:11 AM
Ċ
ď
Peter Thorne,
Jun 14, 2012, 8:43 AM
Ċ
ď
Peter Thorne,
Feb 23, 2012, 9:41 AM
Ċ
ď
Peter Thorne,
Oct 31, 2011, 6:04 AM
Ċ
ď
Peter Thorne,
Oct 3, 2012, 5:07 PM
Ċ
ď
Peter Thorne,
Jul 14, 2011, 8:42 AM
Ċ
ď
Peter Thorne,
Nov 3, 2010, 7:18 AM
Ċ
ď
Peter Thorne,
Apr 18, 2011, 11:40 AM
Ċ
ď
Peter Thorne,
Aug 2, 2011, 7:03 AM
Ċ
ď
Peter Thorne,
Jan 3, 2011, 7:24 AM
Ċ
ď
Peter Thorne,
Oct 2, 2012, 9:03 AM
Ċ
ď
Peter Thorne,
Sep 29, 2011, 8:44 AM
Ċ
ď
Peter Thorne,
Aug 26, 2011, 6:00 AM
Ċ
ď
Peter Thorne,
Jun 14, 2012, 8:43 AM
Ċ
ď
Peter Thorne,
Oct 3, 2012, 1:49 PM
Comments