Other features

Multi-file support

Multiple files can also be aggregated over a given dimension (or the record dimension). In this example, 3 sea surface temperature fields from the 1992-01-01 to 1992-01-03 are aggregated using the OPeNDAP service from PODAAC.

using NCDatasets, Printf, Dates

function url(dt)
  doy = @sprintf("%03d",Dates.dayofyear(dt))
  y = @sprintf("%04d",Dates.year(dt))
  yyyymmdd = Dates.format(dt,"yyyymmdd")
  return "https://podaac-opendap.jpl.nasa.gov:443/opendap/allData/ghrsst/data/GDS2/L4/GLOB/CMC/CMC0.2deg/v2/$y/$doy/$(yyyymmdd)120000-CMC-L4_GHRSST-SSTfnd-CMC0.2deg-GLOB-v02.0-fv02.0.nc"
end

ds = NCDataset(url.(DateTime(1992,1,1):Dates.Day(1):DateTime(1992,1,3)),aggdim = "time");
SST2 = ds["analysed_sst"][:,:,:];
close(ds)

If there is a network or server issue, you will see an error message like NetCDF: I/O failure.

CF Standard Names

The CF Conventions do not define how the different NetCDF variables are named, but the meaning of a variable is defined by the standard_name attribute.

using NCDatasets, DataStructures
ds = NCDataset(tempname(),"c")

nclon = defVar(ds,"lon", 1:10, ("lon",),attrib = OrderedDict(
    "standard_name"             => "longitude",
))
nclat = defVar(ds,"lat", 1:11, ("lat",),attrib = OrderedDict(
    "standard_name"             => "latitude",
))
ncvar = defVar(ds,"bat", zeros(10,11), ("lon", "lat"), attrib = OrderedDict(
    "standard_name"             => "height",
))

ncbat = ds[CF"height"]
# the same as
# ncbat = varbyattrib(ds,standard_name = "height")[1]

name(ncbat)
# output
"bat"

If there are multiple variables with the standard_name equal to height, an error is returned because it is ambiguous which variable should be accessed.

All variables whose dimensions are also dimensions of ncbat are considered as related and can also be accessed by sub-setting ncbat with their variable names of CF Standard name:

nclon_of_bat = ncbat[CF"longitude"]
# same as
# nclon_of_bat = ncbat["lon"]
name(nclon_of_bat)
# output
"lon"

The previous call to ncbat[CF"longitude"] would also worked if there are multiple variables with a standard name longitude defined in a dataset as long as they have different dimension names (which is commonly the case for model output on staggered grid such as Regional Ocean Modeling System).

Views

In Julia, a view of an array is a subset of an array but whose elements still point to the original parent array. If one modifies an element of a view, the corresponding element in the parent array is modified too:

A = zeros(4,4)
subset = @view A[2:3,2:4]
# or
# subset = view(A,2:3,2:4)

subset[1,1] = 2
A[2,2]
# output
2.0

Views do not use copy of the array. The parent array and the indices of the view are obtained via the function parent and parentindices.

parent(subset) == A
# true, as both arrays are the same

parentindices(subset)
# output
(2:3, 2:4)

In NCDatasets, variables can also be sliced as a view:

using NCDatasets, DataStructures
ds = NCDataset(tempname(),"c")

nclon = defVar(ds,"lon", 1:10, ("lon",))
nclat = defVar(ds,"lat", 1:11, ("lat",))
ncvar = defVar(ds,"bat", zeros(10,11), ("lon", "lat"), attrib = OrderedDict(
    "standard_name"             => "height",
))

ncvar_subset = @view ncvar[2:4,2:3]
# or
# ncvar_subset = view(ncvar,2:4,2:3)

ncvar_subset[1,1] = 2
# ncvar[2,2] is now 2

ncvar_subset.attrib["standard_name"]

# output
"height"

This is useful for example when even the sliced array is too large to be loaded in RAM or when all attributes need to be preserved for the sliced array.

The variables lon and lat are related to bat because all dimensions of the variables lon and lat are also dimensions of bat (which is commonly the case for coordinate variables). Such related variables can be retrieved by indexing the NetCDF variables with the name of the corresponding variable:

lon_subset = ncvar_subset["lon"]
lon_subset[:] == [2, 3, 4]
# output
true

A view of a NetCDF variable also implements the function parent and parentindices with the same meaning as for julia Arrays.

A whole dataset can also be sliced using a view(ds, dim1=range1, dim2=range2...). For example:

ds_subset = view(ds, lon = 2:3, lat = 2:4)
# or
# ds_subset = @view ds[lon = 2:3, lat = 2:4]
ds_subset.dim["lon"]

# output
2

Such sliced datasets can for example be saved into a new NetCDF file using write:

write("slice.nc",ds_subset)

Any dimension not mentioned in the @view call is not sliced. While @view produces a slice based on indices, the NCDatasets.@select macro produces a slice (of an NetCDF variable or dataset) based on the values of other related variables (typically coordinates).

Data selection based on values

CommonDataModel.@selectMacro
vsubset = CommonDataModel.@select(v,expression)
dssubset = CommonDataModel.@select(ds,expression)

Return a subset of the variable v (or dataset ds) satisfying the condition expression as a view. The condition has the following form:

condition₁ && condition₂ && condition₃ ... conditionₙ

Every condition should involve a single 1D variable (typically a coordinate variable, referred as coord below). If v is a variable, the related 1D variable should have a shared dimension with the variable v. All local variables need to have a $ prefix (see examples below). This macro is experimental and subjected to change.

Every condition can either perform:

  • a nearest match: coord ≈ target_coord (for type \approx followed by the TAB-key). Only the data corresponding to the index closest to target_coord is loaded.

  • a nearest match with tolerance: coord ≈ target_coord ± tolerance. As before, but if the difference between the closest value in coord and target_coord is larger (in absolute value) than tolerance, an empty array is returned.

  • a condition operating on scalar values. For example, a condition equal to 10 <= lon <= 20 loads all data with the longitude between 10 and 20 or abs(lat) > 60 loads all variables with a latitude north of 60° N and south of 60° S (assuming that the has the 1D variables lon and lat for longitude and latitude).

Only the data which satisfies all conditions is loaded. All conditions must be chained with an && (logical and). They should not contain additional parenthesis or other logical operators such as || (logical or).

To convert the view into a regular array one can use collect, Array or regular indexing. As in julia, views of scalars are wrapped into a zero dimensional arrays which can be dereferenced by using []. Modifying a view will modify the underlying file (if the file is opened as writable, otherwise an error is issued).

As for any view, one can use parentindices(vsubset) to get the indices matching a select query.

Examples

Create a sample file with random data:

using NCDatasets, Dates
using CommonDataModel: @select
# or
# using NCDatasets: @select

fname = "sample_file.nc"
lon = -180:180
lat = -90:90
time = DateTime(2000,1,1):Day(1):DateTime(2000,1,3)
SST = randn(length(lon),length(lat),length(time))

ds = NCDataset(fname,"c")
defVar(ds,"lon",lon,("lon",));
defVar(ds,"lat",lat,("lat",));
defVar(ds,"time",time,("time",));
defVar(ds,"SST",SST,("lon","lat","time"));


# load by bounding box
v = @select(ds["SST"],30 <= lon <= 60 && 40 <= lat <= 80)

# substitute a local variable in condition using $
lonr = (30,60) # longitude range
latr = (40,80) # latitude range

v = @select(ds["SST"],$lonr[1] <= lon <= $lonr[2] && $latr[1] <= lat <= $latr[2])

# You can also select based on `ClosedInterval`s from `IntervalSets.jl`.
# Both 30..60 and 60 ± 20 construct `ClosedInterval`s, see their documentation for details.

lon_interval = 30..60
lat_interval = 60 ± 20
v = @select(ds["SST"], lon ∈ $lon_interval && lat ∈ $lat_interval)

# get the indices matching the select query
(lon_indices,lat_indices,time_indices) = parentindices(v)

# get longitude matchting the select query
v_lon = v["lon"]

# find the nearest time instance
v = @select(ds["SST"],time ≈ DateTime(2000,1,4))

# find the nearest time instance but not earlier or later than 2 hours
# an empty array is returned if no time instance is present

v = @select(ds["SST"],time ≈ DateTime(2000,1,3,1) ± Hour(2))

close(ds)

Any 1D variable with the same dimension name can be used in @select. For example, if we have a time series of temperature and salinity, the temperature values can also be selected based on salinity:

using NCDatasets, Dates
using CommonDataModel: @select
fname = "sample_series.nc"
# create a sample time series
time = DateTime(2000,1,1):Day(1):DateTime(2009,12,31)
salinity = randn(length(time)) .+ 35
temperature = randn(length(time))

NCDataset(fname,"c") do ds
    defVar(ds,"time",time,("time",));
    defVar(ds,"salinity",salinity,("time",));
    defVar(ds,"temperature",temperature,("time",));
end

ds = NCDataset(fname)

# load all temperature data from January where the salinity is larger than 35.
v = @select(ds["temperature"],Dates.month(time) == 1 && salinity >= 35)

# this is equivalent to:
v2 = ds["temperature"][findall(Dates.month.(time) .== 1 .&& salinity .>= 35)]

v == v2
# returns true
close(ds)
Note

For optimal performance, one should try to load contiguous data ranges, in particular when the data is loaded over HTTP/OPeNDAP.

source

Fill values and missing values

In the NetCDF CF conventions there are the attributes _FillValue (single scalar) and missing_value (single scalar or possibly a vector with multiple missing values). Value in the netCDF file matching, these attributes are replaced by the Missing type in Julia per default. However, for some applications, it is more convenient to use another special value like the special floating point number NaN.

For example, this is a netCDF file where the variable var contains a missing which is automatically replaced by the fill value 9999.

using NCDatasets
data = [1. 2. 3.; missing 20. 30.]
ds = NCDataset("example.nc","c")
defVar(ds,"var",data,("lon","lat"),fillvalue = 9999.)

The raw data as stored in the NetCDF file is available using the property .var:

ds["var"].var[:,:]
# 2×3 Matrix{Float64}:
#     1.0   2.0   3.0
#  9999.0  20.0  30.0

Per default, the fill value is replaced by missing when indexing ds["var"]:

ds["var"][:,:]
# 2×3 Matrix{Union{Missing, Float64}}:
# 1.0        2.0   3.0
#  missing  20.0  30.0

The function nomissing allows to replace all missing value with a different value like NaN:

var_nan = nomissing(ds["var"][:,:],NaN)
# 2×3 Matrix{Float64}:
#   1.0   2.0   3.0
#  NaN    20.0  30.0
close(ds)

Such substitution can also be made more automatic using the experimental parameter maskingvalue that can be user per variable:

ds = NCDataset("example.nc","r")
ncvar_nan = cfvariable(ds,"var",maskingvalue = NaN)
ncvar_nan[:,:]
# 2×3 Matrix{Float64}:
#   1.0   2.0   3.0
# NaN    20.0  30.0
close(ds)

Or per data-set:

ds = NCDataset("example.nc","r", maskingvalue = NaN)
ds["var"][:,:]
# 2×3 Matrix{Float64}:
#   1.0   2.0   3.0
# NaN    20.0  30.0
close(ds)

Note, choosing the maskingvalue affects the element type of the NetCDF variable using julia type promotion rules, in particular note that following vector:

[1, NaN]
# 2-element Vector{Float64}:
#    1.0
#  NaN

is a vector with the element type Float64 and not Union{Float64,Int}. All integers are thus promoted to floating point number as NaN is a Float64. Since NaN is considered as a Float64 in Julia, we have also a promotion to Float64 in such cases:

[1f0, NaN]
# 2-element Vector{Float64}:
#   1.0
# NaN

where 1f0 is the Float32 number 1. Consider to use NaN32 to avoid this promotion (which is automatically converted to 64-bit NaN for a Float64 array):

using NCDatasets
data32 = [1f0 2f0 3f0; missing 20f0 30f0]
data64 = [1. 2. 3.; missing 20. 30.]
ds = NCDataset("example_float32_64.nc","c")
defVar(ds,"var32",data32,("lon","lat"),fillvalue = 9999f0)
defVar(ds,"var64",data64,("lon","lat"),fillvalue = 9999.)
close(ds)

ds = NCDataset("example_float32_64.nc","r", maskingvalue = NaN32)
ds["var32"][:,:]
# 2×3 Matrix{Float32}:
#   1.0   2.0   3.0
# NaN    20.0  30.0

ds["var64"][:,:]
# 2×3 Matrix{Float64}:
#   1.0   2.0   3.0
# NaN    20.0  30.0

Promoting an integer to a floating point number can lead to loss of precision. These are the smallest integers that cannot be represented as 32 and 64-bit floating numbers:

Float32(16_777_217) == 16_777_217 # false
Float64(9_007_199_254_740_993) == 9_007_199_254_740_993 # false

NaN should not be used for an array of dates, character or strings as it will result in an array with the element type Any following julia's promotion rules. The use of missing as fill value, is thus preferable in the general case.

Experimental functions

CommonDataModel.ancillaryvariablesFunction
ncvar = CommonDataModel.ancillaryvariables(ncv::CFVariable,modifier)

Return the first ancillary variables from the NetCDF (or other format) variable ncv with the standard name modifier modifier. It can be used for example to access related variable like status flags.

source
Base.filterFunction
data = CommonDataModel.filter(ncv, indices...; accepted_status_flags = nothing)

Load and filter observations by replacing all variables without an acepted status flag to missing. It is used the attribute ancillary_variables to identify the status flag.

# da["data"] is 2D matrix
good_data = NCDatasets.filter(ds["data"],:,:, accepted_status_flags = ["good_data","probably_good_data"])
source