Other features
Multi-file support
Multiple files can also be aggregated over a given dimension (or the record dimension). In this example, 3 sea surface temperature fields from the 1992-01-01 to 1992-01-03 are aggregated using the OPeNDAP service from PODAAC.
using NCDatasets, Printf, Dates
function url(dt)
doy = @sprintf("%03d",Dates.dayofyear(dt))
y = @sprintf("%04d",Dates.year(dt))
yyyymmdd = Dates.format(dt,"yyyymmdd")
return "https://podaac-opendap.jpl.nasa.gov:443/opendap/allData/ghrsst/data/GDS2/L4/GLOB/CMC/CMC0.2deg/v2/$y/$doy/$(yyyymmdd)120000-CMC-L4_GHRSST-SSTfnd-CMC0.2deg-GLOB-v02.0-fv02.0.nc"
end
ds = NCDataset(url.(DateTime(1992,1,1):Dates.Day(1):DateTime(1992,1,3)),aggdim = "time");
SST2 = ds["analysed_sst"][:,:,:];
close(ds)
If there is a network or server issue, you will see an error message like NetCDF: I/O failure
.
CF Standard Names
The CF Conventions do not define how the different NetCDF variables are named, but the meaning of a variable is defined by the standard_name
attribute.
using NCDatasets, DataStructures
ds = NCDataset(tempname(),"c")
nclon = defVar(ds,"lon", 1:10, ("lon",),attrib = OrderedDict(
"standard_name" => "longitude",
))
nclat = defVar(ds,"lat", 1:11, ("lat",),attrib = OrderedDict(
"standard_name" => "latitude",
))
ncvar = defVar(ds,"bat", zeros(10,11), ("lon", "lat"), attrib = OrderedDict(
"standard_name" => "height",
))
ncbat = ds[CF"height"]
# the same as
# ncbat = varbyattrib(ds,standard_name = "height")[1]
name(ncbat)
# output
"bat"
If there are multiple variables with the standard_name
equal to height
, an error is returned because it is ambiguous which variable should be accessed.
All variables whose dimensions are also dimensions of ncbat
are considered as related and can also be accessed by sub-setting ncbat
with their variable names of CF Standard name:
nclon_of_bat = ncbat[CF"longitude"]
# same as
# nclon_of_bat = ncbat["lon"]
name(nclon_of_bat)
# output
"lon"
The previous call to ncbat[CF"longitude"]
would also worked if there are multiple variables with a standard name longitude
defined in a dataset as long as they have different dimension names (which is commonly the case for model output on staggered grid such as Regional Ocean Modeling System).
Views
In Julia, a view of an array is a subset of an array but whose elements still point to the original parent array. If one modifies an element of a view, the corresponding element in the parent array is modified too:
A = zeros(4,4)
subset = @view A[2:3,2:4]
# or
# subset = view(A,2:3,2:4)
subset[1,1] = 2
A[2,2]
# output
2.0
Views do not use copy of the array. The parent array and the indices of the view are obtained via the function parent
and parentindices
.
parent(subset) == A
# true, as both arrays are the same
parentindices(subset)
# output
(2:3, 2:4)
In NCDatasets, variables can also be sliced as a view:
using NCDatasets, DataStructures
ds = NCDataset(tempname(),"c")
nclon = defVar(ds,"lon", 1:10, ("lon",))
nclat = defVar(ds,"lat", 1:11, ("lat",))
ncvar = defVar(ds,"bat", zeros(10,11), ("lon", "lat"), attrib = OrderedDict(
"standard_name" => "height",
))
ncvar_subset = @view ncvar[2:4,2:3]
# or
# ncvar_subset = view(ncvar,2:4,2:3)
ncvar_subset[1,1] = 2
# ncvar[2,2] is now 2
ncvar_subset.attrib["standard_name"]
# output
"height"
This is useful for example when even the sliced array is too large to be loaded in RAM or when all attributes need to be preserved for the sliced array.
The variables lon
and lat
are related to bat
because all dimensions of the variables lon
and lat
are also dimensions of bat
(which is commonly the case for coordinate variables). Such related variables can be retrieved by indexing the NetCDF variables with the name of the corresponding variable:
lon_subset = ncvar_subset["lon"]
lon_subset[:] == [2, 3, 4]
# output
true
A view of a NetCDF variable also implements the function parent
and parentindices
with the same meaning as for julia Array
s.
A whole dataset can also be sliced using a view(ds, dim1=range1, dim2=range2...)
. For example:
ds_subset = view(ds, lon = 2:3, lat = 2:4)
# or
# ds_subset = @view ds[lon = 2:3, lat = 2:4]
ds_subset.dim["lon"]
# output
2
Such sliced datasets can for example be saved into a new NetCDF file using write
:
write("slice.nc",ds_subset)
Any dimension not mentioned in the @view
call is not sliced. While @view
produces a slice based on indices, the NCDatasets.@select
macro produces a slice (of an NetCDF variable or dataset) based on the values of other related variables (typically coordinates).
Data selection based on values
CommonDataModel.@select
— Macrovsubset = CommonDataModel.@select(v,expression)
dssubset = CommonDataModel.@select(ds,expression)
Return a subset of the variable v
(or dataset ds
) satisfying the condition expression
as a view. The condition has the following form:
condition₁ && condition₂ && condition₃ ... conditionₙ
Every condition should involve a single 1D variable (typically a coordinate variable, referred as coord
below). If v
is a variable, the related 1D variable should have a shared dimension with the variable v
. All local variables need to have a $
prefix (see examples below). This macro is experimental and subjected to change.
Every condition can either perform:
a nearest match:
coord ≈ target_coord
(for≈
type\approx
followed by the TAB-key). Only the data corresponding to the index closest totarget_coord
is loaded.a nearest match with tolerance:
coord ≈ target_coord ± tolerance
. As before, but if the difference between the closest value incoord
andtarget_coord
is larger (in absolute value) thantolerance
, an empty array is returned.a condition operating on scalar values. For example, a
condition
equal to10 <= lon <= 20
loads all data with the longitude between 10 and 20 orabs(lat) > 60
loads all variables with a latitude north of 60° N and south of 60° S (assuming that the has the 1D variableslon
andlat
for longitude and latitude).
Only the data which satisfies all conditions is loaded. All conditions must be chained with an &&
(logical and). They should not contain additional parenthesis or other logical operators such as ||
(logical or).
To convert the view into a regular array one can use collect
, Array
or regular indexing. As in julia, views of scalars are wrapped into a zero dimensional arrays which can be dereferenced by using []
. Modifying a view will modify the underlying file (if the file is opened as writable, otherwise an error is issued).
As for any view, one can use parentindices(vsubset)
to get the indices matching a select query.
Examples
Create a sample file with random data:
using NCDatasets, Dates
using CommonDataModel: @select
# or
# using NCDatasets: @select
fname = "sample_file.nc"
lon = -180:180
lat = -90:90
time = DateTime(2000,1,1):Day(1):DateTime(2000,1,3)
SST = randn(length(lon),length(lat),length(time))
ds = NCDataset(fname,"c")
defVar(ds,"lon",lon,("lon",));
defVar(ds,"lat",lat,("lat",));
defVar(ds,"time",time,("time",));
defVar(ds,"SST",SST,("lon","lat","time"));
# load by bounding box
v = @select(ds["SST"],30 <= lon <= 60 && 40 <= lat <= 80)
# substitute a local variable in condition using $
lonr = (30,60) # longitude range
latr = (40,80) # latitude range
v = @select(ds["SST"],$lonr[1] <= lon <= $lonr[2] && $latr[1] <= lat <= $latr[2])
# You can also select based on `ClosedInterval`s from `IntervalSets.jl`.
# Both 30..60 and 60 ± 20 construct `ClosedInterval`s, see their documentation for details.
lon_interval = 30..60
lat_interval = 60 ± 20
v = @select(ds["SST"], lon ∈ $lon_interval && lat ∈ $lat_interval)
# get the indices matching the select query
(lon_indices,lat_indices,time_indices) = parentindices(v)
# get longitude matchting the select query
v_lon = v["lon"]
# find the nearest time instance
v = @select(ds["SST"],time ≈ DateTime(2000,1,4))
# find the nearest time instance but not earlier or later than 2 hours
# an empty array is returned if no time instance is present
v = @select(ds["SST"],time ≈ DateTime(2000,1,3,1) ± Hour(2))
close(ds)
Any 1D variable with the same dimension name can be used in @select
. For example, if we have a time series of temperature and salinity, the temperature values can also be selected based on salinity:
using NCDatasets, Dates
using CommonDataModel: @select
fname = "sample_series.nc"
# create a sample time series
time = DateTime(2000,1,1):Day(1):DateTime(2009,12,31)
salinity = randn(length(time)) .+ 35
temperature = randn(length(time))
NCDataset(fname,"c") do ds
defVar(ds,"time",time,("time",));
defVar(ds,"salinity",salinity,("time",));
defVar(ds,"temperature",temperature,("time",));
end
ds = NCDataset(fname)
# load all temperature data from January where the salinity is larger than 35.
v = @select(ds["temperature"],Dates.month(time) == 1 && salinity >= 35)
# this is equivalent to:
v2 = ds["temperature"][findall(Dates.month.(time) .== 1 .&& salinity .>= 35)]
v == v2
# returns true
close(ds)
For optimal performance, one should try to load contiguous data ranges, in particular when the data is loaded over HTTP/OPeNDAP.
Fill values and missing values
In the NetCDF CF conventions there are the attributes _FillValue
(single scalar) and missing_value
(single scalar or possibly a vector with multiple missing values). Value in the netCDF file matching, these attributes are replaced by the Missing
type in Julia per default. However, for some applications, it is more convenient to use another special value like the special floating point number NaN
.
For example, this is a netCDF file where the variable var
contains a missing
which is automatically replaced by the fill value 9999.
using NCDatasets
data = [1. 2. 3.; missing 20. 30.]
ds = NCDataset("example.nc","c")
defVar(ds,"var",data,("lon","lat"),fillvalue = 9999.)
The raw data as stored in the NetCDF file is available using the property .var
:
ds["var"].var[:,:]
# 2×3 Matrix{Float64}:
# 1.0 2.0 3.0
# 9999.0 20.0 30.0
Per default, the fill value is replaced by missing
when indexing ds["var"]
:
ds["var"][:,:]
# 2×3 Matrix{Union{Missing, Float64}}:
# 1.0 2.0 3.0
# missing 20.0 30.0
The function nomissing
allows to replace all missing value with a different value like NaN
:
var_nan = nomissing(ds["var"][:,:],NaN)
# 2×3 Matrix{Float64}:
# 1.0 2.0 3.0
# NaN 20.0 30.0
close(ds)
Such substitution can also be made more automatic using the experimental parameter maskingvalue
that can be user per variable:
ds = NCDataset("example.nc","r")
ncvar_nan = cfvariable(ds,"var",maskingvalue = NaN)
ncvar_nan[:,:]
# 2×3 Matrix{Float64}:
# 1.0 2.0 3.0
# NaN 20.0 30.0
close(ds)
Or per data-set:
ds = NCDataset("example.nc","r", maskingvalue = NaN)
ds["var"][:,:]
# 2×3 Matrix{Float64}:
# 1.0 2.0 3.0
# NaN 20.0 30.0
close(ds)
Note, choosing the maskingvalue
affects the element type of the NetCDF variable using julia type promotion rules, in particular note that following vector:
[1, NaN]
# 2-element Vector{Float64}:
# 1.0
# NaN
is a vector with the element type Float64
and not Union{Float64,Int}
. All integers are thus promoted to floating point number as NaN
is a Float64
. Since NaN is considered as a Float64
in Julia, we have also a promotion to Float64
in such cases:
[1f0, NaN]
# 2-element Vector{Float64}:
# 1.0
# NaN
where 1f0
is the Float32
number 1. Consider to use NaN32
to avoid this promotion (which is automatically converted to 64-bit NaN for a Float64
array):
using NCDatasets
data32 = [1f0 2f0 3f0; missing 20f0 30f0]
data64 = [1. 2. 3.; missing 20. 30.]
ds = NCDataset("example_float32_64.nc","c")
defVar(ds,"var32",data32,("lon","lat"),fillvalue = 9999f0)
defVar(ds,"var64",data64,("lon","lat"),fillvalue = 9999.)
close(ds)
ds = NCDataset("example_float32_64.nc","r", maskingvalue = NaN32)
ds["var32"][:,:]
# 2×3 Matrix{Float32}:
# 1.0 2.0 3.0
# NaN 20.0 30.0
ds["var64"][:,:]
# 2×3 Matrix{Float64}:
# 1.0 2.0 3.0
# NaN 20.0 30.0
Promoting an integer to a floating point number can lead to loss of precision. These are the smallest integers that cannot be represented as 32 and 64-bit floating numbers:
Float32(16_777_217) == 16_777_217 # false
Float64(9_007_199_254_740_993) == 9_007_199_254_740_993 # false
NaN
should not be used for an array of dates, character or strings as it will result in an array with the element type Any
following julia's promotion rules. The use of missing
as fill value, is thus preferable in the general case.
Experimental functions
CommonDataModel.ancillaryvariables
— Functionncvar = CommonDataModel.ancillaryvariables(ncv::CFVariable,modifier)
Return the first ancillary variables from the NetCDF (or other format) variable ncv
with the standard name modifier modifier
. It can be used for example to access related variable like status flags.
Base.filter
— Functiondata = CommonDataModel.filter(ncv, indices...; accepted_status_flags = nothing)
Load and filter observations by replacing all variables without an acepted status flag to missing
. It is used the attribute ancillary_variables
to identify the status flag.
# da["data"] is 2D matrix
good_data = NCDatasets.filter(ds["data"],:,:, accepted_status_flags = ["good_data","probably_good_data"])