Glider Tools: profile data processing

Glider tools is a Python 3.8+ package designed to process data from the first level of processing to a science ready dataset. The package is designed to easily import data to a standard column format: numpy.ndarray, pandas.DataFrame or xarray.DataArray (we recommend the latter which has full support for metadata). Cleaning and smoothing functions are flexible and can be applied as required by the user. We provide examples and demonstrate best practices as developed by the SOCCO Group.

For the original publication of this package see: https://doi.org/10.3389/fmars.2019.00738.

For recommendations or bug reports, please visit https://github.com/GliderToolsCommunity/GliderTools/issues/new

Installation

Notes on how to install

Conda

The easiest way to install the packge is with conda: conda install -c conda-forge glidertools.

PyPI

You can also install with pip: pip install glidertools.

GitHub

For the most up to date version of GliderTools, you can install directly from the online repository hosted on GitLab.

  1. Clone glidertools to your local machine: git clone https://github.com/GliderToolsCommunity/GliderTools
  2. Change to the parent directory of GliderTools
  3. Install glidertools with pip install -e ./GliderTools. This will allow changes you make locally, to be reflected when you import the package in Python

Cheat Sheet

_images/package_overview.pngcheat sheet image

Loading data

To start using Glider Tools you first need to import the package to the interactive workspace.

Import GliderTools

# pylab for more MATLAB like environment and inline displays plots below cells
%pylab inline

# if gsw Warning shows, manually install gsw if possible - will still work without
import glidertools as gt
from cmocean import cm as cmo  # we use this for colormaps
Populating the interactive namespace from numpy and matplotlib

Working with Seaglider base station files

GliderTools supports loading Seaglider files, including scicon data (different sampling frequencies). There is a function that makes it easier to find variable names that you’d like to load: gt.load.seaglider_show_variables

This function is demonstrated in the cell below. The function accepts a list of file names and can also receive a string with a wildcard placeholder (*) and basic regular expressions are also supported. In the example below we use a simple asterisk placeholder for all the files.

Note that the function chooses only one file from the passed list or glob string - this file name will be shown. The returned table shows the variable name, dimensions, units and brief comment if it is available.

filenames = '/Users/luke/Work/Data/sg542/p5420*.nc'

gt.load.seaglider_show_variables(filenames)
information is based on file: /Users/luke/Work/Data/sg542/p5420177.nc

<table will be displayed here>

Working with VOTO Seaexplorer files or xarray-datasets

Glidertools supports loading Seaexplorer files. This is implemented and tested with VOTO https://observations.voiceoftheocean.org_ datasets in mind currently, but we are happy about feedback/pullrequests how it works for other SeaExplorer datasets. VOTO data can either be downloaded from the website using a browser or, more comfortable, from an ERDAP server https://erddap.observations.voiceoftheocean.org/erddap/index.html_ . See the demo notebook <https://github.com/voto-ocean-knowledge/download_glider_data>_ to get started with downloads over the API.

After download of a .nc file or xarray-Dataset, it can be read into Glidertools by calling gt.load.voto_seaexplorer_nc or gt.load.voto_seaexplorer_dataset respectively. Resulting datasets can be merged by calling gt.load.voto_concat_datasets. The import of the data into GliderTools is hereby finished, remaining steps on this wiki-page are optional.

Load variables

From the variable listing, one can choose multiple variables to load. Note that one only needs the variable name to load the data. Below, we’ve created a list of variables that we’ll be using for this demo.

The gt.load.seaglider_basestation_netCDFs function is used to load a list of variables. It requires the filename string or list (as described above) and keys. It may be that these variables are not sampled at the same frequency. In this case, the loading function will load the sampling frequency dimensions separately. The function will try to find a time variable for each sampling frequency/dimension.

Coordinates and automatic time fetching

All associated coordinate variables will also be loaded with the data if coordinates are documented. These may included latitude, longitude, depth and time (naming may vary). If time cannot be found for a dimension, a time variable from a different dimension with the same number of observations is used instead. This insures that data can be merged based on the time of sampling.

Merging data based on time

If the return_merged is set to True, the function will merge the dimensions if the dimension has an associated time variable.

The function returns a dictionary of xarray.Datasets - a Python package that deals with coordinate indexed multi-dimensional arrays. We recommend that you read the documentation (http://xarray.pydata.org/en/stable/) as this package is used throughout GliderTools. This allows the original metadata to be copied with the data. The dictionary keys are the names of the dimensions. If return_merged is set to True an additional entry under the key merged will be included.

The structure of a dimension output is shown below. Note that the merged data will use the largest dimension as the primary dataset and the other data will be merged onto that time index. Data is linearly interpolated to the nearest time measurement of the primary index, but only by one measurement to ensure transparancy.

names = [
    'ctd_depth',
    'ctd_time',
    'ctd_pressure',
    'salinity',
    'temperature',
    'eng_wlbb2flvmt_Chlsig',
    'eng_wlbb2flvmt_wl470sig',
    'eng_wlbb2flvmt_wl700sig',
    'aanderaa4330_dissolved_oxygen',
    'eng_qsp_PARuV',
]

ds_dict = gt.load.seaglider_basestation_netCDFs(
    filenames, names,
    return_merged=True,
    keep_global_attrs=False
)
DIMENSION: sg_data_point
{
    ctd_pressure, eng_wlbb2flvmt_wl470sig, eng_wlbb2flvmt_wl700sig, temperature,
    ctd_time, ctd_depth, latitude, aanderaa4330_dissolved_oxygen, salinity,
    eng_wlbb2flvmt_Chlsig, longitude
}


100%|██████████| 336/336 [00:04<00:00, 73.66it/s]



DIMENSION: qsp2150_data_point
{eng_qsp_PARuV, time}


100%|██████████| 336/336 [00:01<00:00, 181.67it/s]



Merging dimensions on time indicies: sg_data_point, qsp2150_data_point,

The returned data contains the dimensions of the requested variables a merged object is also returned if return_merged=True


print(ds_dict.keys())
dict_keys(['sg_data_point', 'qsp2150_data_point', 'merged'])

Metadata handling

If the keyword arguement keep_global_attrs=True, the attributes from the original files (for all that are the same) are passed on to the output Datasets from the original netCDF attributes. The variable attributes (units, comments, axis…) are passed on by default, but can also be set to False if not wanted. GliderTools functions will automatically pass on these attributes to function outputs if a xarray.DataArray with attributes is given. All functions applied to data will also be recorded under the variable attribute processing.

The merged dataset contains all the data interpolated to the nearest observation of the longest dimension the metadata is also shown for the example below

ds_dict['merged']
xarray.Dataset>
Dimensions:                        (merged: 382151)
Coordinates:
    ctd_depth                      (merged) float64 -0.08821 0.018 ... -0.1422
    latitude                       (merged) float64 -42.7 -42.7 ... -43.0 -43.0
    longitude                      (merged) float64 8.744 8.744 ... 8.5 8.5
    ctd_time_dt64                  (merged) datetime64[ns] 2015-12-08T07:36:16 ...

Dimensions without coordinates: merged
Data variables:
    ctd_pressure                   (merged) float64 -0.08815 0.01889 ... -0.1432
    eng_wlbb2flvmt_wl470sig        (merged) float64 375.0 367.0 ... 98.0 91.0
    eng_wlbb2flvmt_wl700sig        (merged) float64 2.647e+03 ... 137.0
    temperature                    (merged) float64 11.55 11.54 ... 11.06 10.97
    ctd_time                       (merged) float64 1.45e+09 ... 1.455e+09
    aanderaa4330_dissolved_oxygen  (merged) float64 nan nan nan ... 269.1 269.1
    salinity                       (merged) float64 nan nan nan ... 34.11 34.11
    eng_wlbb2flvmt_Chlsig          (merged) float64 145.0 126.0 ... 215.0 215.0
    dives                          (merged) float64 1.0 1.0 1.0 ... 344.5 344.5
    eng_qsp_PARuV                  (merged) float64 0.551 0.203 ... 0.021 0.023
    time                           (merged) float64 1.45e+09 ... 1.455e+09
    time_dt64                      (merged) datetime64[ns] 2015-12-08T07:36:16 ...

Attributes:
    date_created:             2019-07-11 14:08:40
    number_of_dives:          344.0
    files:                    ['p5420001.nc', 'p5420002.nc', 'p5420004.nc', '...
    time_coverage_start:      2015-12-08 07:36:16
    time_coverage_end:        2016-02-08 04:39:04
    geospatial_vertical_min:  -0.6323553853732649
    geospatial_vertical_max:  1011.1000623417478
    geospatial_lat_min:       -43.085757609206
    geospatial_lat_max:       -42.70088638031523
    geospatial_lon_min:       8.29983279020758
    geospatial_lon_max:       8.7753734452125
    processing:               [2019-07-11 14:08:40] imported data with Glider...

Renaming for ease of access

When renaming, just be sure that there are no variables with names that you are trying to replace. In the example below we remove time in case it exists in the files.

# Here we drop the time variables imported for the PAR variable
# we don't need these anymore. You might have to change this
# depening on the dataset
merged = ds_dict['merged']
if 'time' in merged:
    merged = merged.drop(["time", "time_dt64"])


# To make it easier and clearer to work with, we rename the
# original variables to something that makes more sense. This
# is done with the xarray.Dataset.rename({}) function.
# We only use the merged dataset as this contains all the
# imported dimensions.
# NOTE: The renaming has to be specific to the dataset otherwise an error will occur
dat = merged.rename({
    'salinity': 'salt_raw',
    'temperature': 'temp_raw',
    'ctd_pressure': 'pressure',
    'ctd_depth': 'depth',
    'ctd_time_dt64': 'time',
    'ctd_time': 'time_raw',
    'eng_wlbb2flvmt_wl700sig': 'bb700_raw',
    'eng_wlbb2flvmt_wl470sig': 'bb470_raw',
    'eng_wlbb2flvmt_Chlsig': 'flr_raw',
    'eng_qsp_PARuV': 'par_raw',
    'aanderaa4330_dissolved_oxygen': 'oxy_raw',
})

print(dat)

# variable assignment for conveniant access
depth = dat.depth
dives = dat.dives
lats = dat.latitude
lons = dat.longitude
time = dat.time
pres = dat.pressure
temp = dat.temp_raw
salt = dat.salt_raw
par = dat.par_raw
bb700 = dat.bb700_raw
bb470 = dat.bb470_raw
fluor = dat.flr_raw

# name coordinates for quicker plotting
x = dat.dives
y = dat.depth

Quality Control

Note that this summary carries on from the Loading data page.

The cleaning module contains several tools that help to remove erroneous data - profiles or points. These filters can be applied globally (IQR and standard devation limits), vertically (running average filters) or horizontally (horizontal filters on gridded data only).

There are also two approaches one can use to clean data: 1) filtering out bad points/dives; 2) smoothing data.

Original Data

Below we use salinity to demonstrate the different functions available to users.

dives = dat.dives
depth = dat.depth
salt = dat.salinity_raw

x = np.array(dives)  # ensures these are arrays
y = np.array(depth)

gt.plot(dives, depth, salt, cmap=cmo.haline, robust=True)
plt.xlim(50, 150)
plt.title('Original Data')
plt.show()

_images/output_14_0.pngpng

Global filtering: outlier limits (IQR & STD)

These functions find upper and lower limits for data outliers using interquartile range and standard deviations of the entire dataset. Multipliers can be set to make the filters more or less strict

salt_iqr = gt.cleaning.outlier_bounds_iqr(salt, multiplier=1.5)
salt_std = gt.cleaning.outlier_bounds_std(salt, multiplier=1.5)

# Plotting
gt.plot(x, y, salt_iqr, cmap=cmo.haline, robust=True)
plt.title('Outlier Bounds IQR Method')
plt.xlim(50,150)

gt.plot(x, y, salt_std, cmap=cmo.haline, robust=True)
plt.title('Outlier Bounds Stdev Method')
plt.xlim(50,150)

plt.show()

_images/output_16_0.pngpng _images/output_16_1.pngpng

Horizontal filtering: differential outliers

Erroneous measurements often occur sequentially - i.e. in the vertical. The vertical filtering approaches would thus miss any outliers as rolling windows are often used. It is thus useful to have an approach that compares dives in the horizontal. The horizontal_diff_outliers first grids data and then calculates where gradients (rolling mean - measurement) are outliers (same as outlier_bounds_std). If a certain fraction of measurements in a dive exceed the threshold, then that dive is deemed a bad dive. The example below shows three dives that have anomalous measurements. These fall well within the global bounds of acceptable data, but horizontally that are masked out.

salt_horz = gt.cleaning.horizontal_diff_outliers(
    x, y, salt,
    multiplier=3,
    depth_threshold=400,
    mask_frac=0.1
)

gt.plot(x, y, salt, cmap=cmo.haline)
plt.title('Original dataset')
plt.xlim(150,250)
plt.show()

gt.plot(x, y, salt_horz, cmap=cmo.haline)
plt.title('Horizontal Differential Outliers removed')
plt.xlim(150,250)
plt.show()

_images/output_19_0.pngpng _images/output_19_1.pngpng

Vertical smoothing approaches

Despiking

This approach was used by Briggs et al. (2010). The idea is to apply a rolling filter to the data (along the time dimension). This forms the baseline. The difference from the original data are spikes.

There are two rolling filters that can be applied to the data. The median approach is the equivalent of a rolling median. The minmax approach first applies a rolling minimum and then rolling maximum to data. This is useful particularly for optics data where spikes are particles in the water column and are not normally distributed.

In the case of salinity, the median approach is likely best, as “spikes” would be positive and negative (Gaussian distribution).

salt_base, salt_spike = gt.cleaning.despike(salt, window_size=5, spike_method='median')

fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)

gt.plot(x, y, salt_base, cmap=cmo.haline, ax=ax[0])
ax[0].set_title('Despiked using median filter')
ax[0].cb.set_label('Salinity baseline')
ax[0].set_xlim(50,150)
ax[0].set_xlabel('')

gt.plot(x, y, salt_spike, cmap=cm.RdBu_r, vmin=-6e-3, vmax=6e-3, ax=ax[1])
ax[1].cb.set_label('Salinity spikes')
ax[1].set_xlim(50,150)

plt.xticks(rotation=0)
plt.show()

_images/output_22_0.pngpng

Rolling window

The rolling window method simply applies an aggregating function (mean, median, std, min, max) to the dataset. Because the above example is equivalent to a rolling median, we show what a rolling 75th percentile looks like instead.

This could be used to create additional filters by users. Note that in this more complex example we create a wrapper function for the percentile so that we can tell the percentile function that we want the 75th percentile and we want to calculate this along the nth axis.

def seventyfith(x, axis=0):
    # wrapper function so we can pass axis and percentile to
    # the input function
    return np.percentile(x, 75, axis=axis)

# other numpy functions also work: np.mean, np.median, np.std
salt_roll75 = gt.cleaning.rolling_window(salt, seventyfith, window=5)
salt_rollavg = gt.cleaning.rolling_window(salt, mean, window=5)

# PLOTTING
fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)

gt.plot(x, y, salt_roll75, cmap=cmo.haline, ax=ax[0])
ax[0].set_title('75$^{th}$ for a rolling window with size 5')
ax[0].cb.set_label('Salinity baseline')
ax[0].set_xlim(50,150)
ax[0].set_xlabel('')

gt.plot(x, y, salt_roll75 - salt, cmap=cm.RdBu_r, vmin=-6e-3, vmax=6e-3, ax=ax[1])
ax[1].cb.set_label('Difference from original data')
ax[1].set_xlim(50,150)

plt.xticks(rotation=0)
plt.show()

_images/output_24_0.pngpng

Savitzky-Golay

The Savitzky-Golay function fits a low order polynomial to a rolling window of the time series. This has the result of smoothing the data. A larger window with a lower order polynomial with have a smoother fit.

We recommend a 2nd order kernel. Here we use first order to show that the difference can be quite big.

salt_savgol = gt.cleaning.savitzky_golay(salt, window_size=11, order=1)

# PLOTTING
fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)

gt.plot(x, y, salt_savgol, cmap=cmo.haline, ax=ax[0])
ax[0].set_title('Smoothing the data with Savitzky-Golay')
ax[0].cb.set_label('Smoothed salinity')
ax[0].set_xlim(50,150)
ax[0].set_xlabel('')

gt.plot(x, y, salt_savgol - salt, cmap=cm.RdBu, vmin=-6e-3, vmax=6e-3, ax=ax[1])
ax[1].cb.set_label('Difference from original')
ax[1].set_xlim(50,150)

plt.show()

_images/output_26_0.pngpng

Wrapper functions

Wrapper functions have been designed to make this process more efficient, which is demonstrated below with temperature and salinity.

temp_qc = gt.calc_physics(temp, x, y,
                          iqr=1.5, depth_threshold=0,
                          spike_window=5, spike_method='median',
                          savitzky_golay_window=11, savitzky_golay_order=2)

# PLOTTING
fig, ax = plt.subplots(3, 1, figsize=[9, 8.5], sharex=True, dpi=90)

gt.plot(x, y, temp, cmap=cmo.thermal, ax=ax[0])
gt.plot(x, y, temp_qc, cmap=cmo.thermal, ax=ax[1])
gt.plot(x, y, temp_qc - temp, cmap=cm.RdBu_r, vmin=-0.05, vmax=0.05, ax=ax[2])

[a.set_xlabel('') for a in ax]

ax[0].cb.set_label('Original Data')
ax[1].cb.set_label('Cleaned Data')
ax[2].cb.set_label('Difference from Original')

plt.show()
==================================================
Physics Variable:
    Removing outliers with IQR * 1.5: 0 obs
    Removing spikes with rolling median (spike window=5)
    Smoothing with Savitzky-Golay filter (window=11, order=2)

_images/output_28_1.pngpng

salt_qc = gt.calc_physics(salt, x, y,
                          mask_frac=0.2, iqr=2.5,
                          spike_window=5, spike_method='median',
                          savitzky_golay_window=11, savitzky_golay_order=2)

# PLOTTING
fig, ax = plt.subplots(3, 1, figsize=[9, 8.5], sharex=True, dpi=90)

gt.plot(x, y, salt, cmap=cmo.haline, ax=ax[0])
gt.plot(x, y, salt_qc, cmap=cmo.haline, ax=ax[1])
gt.plot(x, y, salt_qc - salt, cmap=cm.RdBu_r, vmin=-0.02, vmax=0.02, ax=ax[2])

[a.set_xlabel('') for a in ax]

ax[0].cb.set_label('Original Data')
ax[1].cb.set_label('Cleaned Data')
ax[2].cb.set_label('Difference from Original')

plt.show()
==================================================
Physics Variable:
    Removing outliers with IQR * 2.5: 1551 obs
    Removing spikes with rolling median (spike window=5)
    Removing horizontal outliers (fraction=0.2, multiplier=2.5)
    Smoothing with Savitzky-Golay filter (window=11, order=2)

_images/output_29_1.pngpng

dat['temp_qc'] = temp
dat['salt_qc'] = salt

Secondary physical variables

Density

GliderTools provides a wrapper to calculate potential density. This is done by first calculating potential temperature and then calculating absolute salinity. A reference depth of 0 is used by default

dens0 = gt.physics.potential_density(salt_qc, temp_qc, pres, lats, lons)
dat['density'] = dens0
gt.plot(dat.dives, dat.depth, dens0, cmap=cmo.dense)
plt.xlim(50,150)
plt.show()

_images/output_36_0.pngpng

Mixed Layer Depth

import matplotlib.pyplot as plt
mld = gt.physics.mixed_layer_depth(ds, 'density', verbose=False)
mld_smoothed = mld.rolling(10, min_periods=3).mean()

mld_mask = gt.utils.mask_below_depth(ds, mld)
mld_grid = gt.grid_data(ds.dives, ds.depth, mld_mask, verbose=False)

fig, ax = plt.subplots(1, 2, figsize=[9, 3], dpi=100, sharey=True)

mld_smoothed.plot(ax=ax[0])
gt.plot(mld_grid, ax=ax[1])

[a.set_ylim(100, 0) for a in ax]

ax[0].set_ylabel('Depth (m)')
[a.set_xlabel('Dives') for a in ax]
plt.xticks(rotation=0)

fig.tight_layout()
/Users/luke/Git/GliderTools/glidertools/helpers.py:61: GliderToolsWarning:

Primary input variable is not xr.DataArray data type - no metadata to pass on.

_images/output_38_1.pngpng

Optics (BB, PAR, Chl)

The optics module contains functions that process backscatter, PAR and fluorescence.

There is a wrapper function for each of these variables that applies several functions related to cleaning and processing. We show each step of the wrapper function seperately and then summarise with the wrapper function.

Backscatter

theta = 124
xfactor = 1.076

gt.plot(x, y, bb700, cmap=cmo.delta, vmin=60, vmax=200)
xlim(200,340)
title('Original Data')
show()

_images/output_41_0.pngpng

Outlier bounds method

See the cleaning section for more information on gt.cleaning.outlider_bounds_[]

bb700_iqr = gt.cleaning.outlier_bounds_iqr(bb700, multiplier=3)
bb700_std = gt.cleaning.outlier_bounds_std(bb700, multiplier=3)

fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)

gt.plot(x, y, bb700_iqr, cmap=cmo.delta, ax=ax[0], vmin=60, vmax=200)
gt.plot(x, y, bb700_std, cmap=cmo.delta, ax=ax[1], vmin=60, vmax=200)

[a.set_xlabel('') for a in ax]
[a.set_xlim(200, 340) for a in ax]

ax[0].set_title('Outlier IQR')
ax[1].set_title('Outlier STD')

plt.show()

_images/output_43_0.pngpng

Removing bad profiles

This function masks bad dives based on mean + std x [1] or median + std x [1] at a reference depth.

# find_bad_profiles returns boolean mask and dive numbers
# we index only the mask
bad_profiles = gt.optics.find_bad_profiles(dives, depth, bb700,
                                           ref_depth=300,
                                           stdev_multiplier=1,
                                           method='median')[0]

fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)
# ~ reverses True to False and vice versa - i.e. we mask bad bad profiles
gt.plot(x, y, bb700, cmap=cmo.delta, ax=ax[0], vmin=60, vmax=200)
gt.plot(x, y, bb700.where(~bad_profiles), cmap=cmo.delta, ax=ax[1], vmin=60, vmax=200)

[a.set_xlabel('') for a in ax]
[a.set_xlim(40, 120) for a in ax]

ax[0].set_title('All backscatter data')
ax[1].set_title('Bad profiles masked')

plt.show()

_images/output_45_0.pngpng

Conversion from counts to total backscatter

The scale and offset function uses the factory calibration dark count and scale factor.

The bback total function uses the coefficients from Zhang et al. (2009) to convert the raw counts into total backscatter (m-1), correcting for temperature and salinity. The $\chi$ factor and $\theta$ in this example were taken from Sullivan et al. (2013) and Slade & Boss (2015).

beta = gt.flo_functions.flo_scale_and_offset(bb700.where(~bad_profiles), 49, 3.217e-5)
bbp = gt.flo_functions.flo_bback_total(beta, temp_qc, salt_qc, theta, 700, xfactor)

fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)

gt.plot(x, y, beta, cmap=cmo.delta, ax=ax[0], robust=True)
gt.plot(x, y, bbp, cmap=cmo.delta, ax=ax[1], robust=True)

[a.set_xlabel('') for a in ax]
[a.set_xlim(200, 340) for a in ax]
[a.set_ylim(400, 0) for a in ax]

ax[0].set_title('$\u03B2$')
ax[1].set_title('b$_{bp}$ (m$^{-1}$)')

plt.show()

_images/output_47_0.pngpng

Correcting for an in situ dark count

Sensor drift from factory calibration requires an additional correction, the calculation of a dark count in situ. This is calculated from the 95th percentile of backscatter measurements between 200 and 400m.

bbp = gt.optics.backscatter_dark_count(bbp, depth)

gt.plot(x, y, bbp, cmap=cmo.delta, robust=True)
xlim(200,340)
title('b$_{bp}$ (m$^{-1}$)')
show()

_images/output_49_0.pngpng

Despiking

Following the methods outlined in Briggs et al. (2011) to both identify spikes in backscatter and remove them from the baseline backscatter signal. The spikes are retained as the data can be used to address specific science questions, but their presence can decrease the accuracy of the fluorescence quenching function.

bbp_horz = gt.cleaning.horizontal_diff_outliers(x, y, bbp, depth_threshold=10, mask_frac=0.05)
bbp_baseline, bbp_spikes = gt.cleaning.despike(bbp_horz, 7, spike_method='minmax')


fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)

gt.plot(x, y, bbp_baseline, cmap=cmo.delta, ax=ax[0], robust=True)
gt.plot(x, y, bbp_spikes, ax=ax[1], cmap=cm.Spectral_r, vmin=0, vmax=0.004)

[a.set_xlabel('') for a in ax]
[a.set_xlim(200, 340) for a in ax]

ax[0].set_title('Despiked b$_{bp}$ (m$^{-1}$)')
ax[1].set_title('b$_{bp}$ (m$^{-1}$) spikes')

plt.show()

_images/output_51_0.pngpng

Adding the corrected variables to the original dataframe

dat['bbp700'] = bbp_baseline
dat['bbp700_spikes'] = bbp_spikes

Wrapper function demonstration

A wrapper function was also designed, which is demonstrated below with the second wavelength (700 nm). The default option is for verbose to be True, which will provide an output of the different processing steps.

bbp_baseline, bbp_spikes = gt.calc_backscatter(
    bb700, temp_qc, salt_qc, dives, depth, 700, 49, 3.217e-5,
    spike_window=11, spike_method='minmax', iqr=2., profiles_ref_depth=300,
    deep_multiplier=1, deep_method='median', verbose=True)

dat['bbp700'] = bbp_baseline
dat['bbp700_spikes'] = bbp_spikes

ax = gt.plot(x, y, dat.bbp700, cmap=cmo.delta),

[a.set_xlim(200, 340) for a in ax]

plt.show()
==================================================
bb700:
	Removing outliers with IQR * 2.0: 8606 obs
	Mask bad profiles based on deep values (depth=300m)
	Number of bad profiles = 27/672
	Zhang et al. (2009) correction
	Dark count correction
	Spike identification (spike window=11)

_images/output_55_1.pngpng

bbp_baseline, bbp_spikes = gt.calc_backscatter(
    bb470, temp_qc, salt_qc, dives, depth, 470, 47, 1.569e-5,
    spike_window=7, spike_method='minmax', iqr=3, profiles_ref_depth=300,
    deep_multiplier=1, deep_method='median', verbose=True)

dat['bbp470'] = bbp_baseline
dat['bbp470_spikes'] = bbp_spikes

gt.plot(x, y, dat.bbp470, cmap=cmo.delta)
plt.show()
==================================================
bb470:
	Removing outliers with IQR * 3: 2474 obs
	Mask bad profiles based on deep values (depth=300m)
	Number of bad profiles = 16/672
	Zhang et al. (2009) correction
	Dark count correction
	Spike identification (spike window=7)

_images/output_56_1.pngpng

PAR

PAR Scaling

This function uses the factory calibration to convert from $\mu$V to $\mu$E m$^{-2}$ s$^{-1}$.

par_scaled = gt.optics.par_scaling(par, 6.202e-4, 10.8)

fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)

gt.plot(x, y, par, cmap=cmo.solar, ax=ax[0], robust=True)
gt.plot(x, y, par_scaled, cmap=cmo.solar, ax=ax[1], robust=True)

[a.set_xlabel('') for a in ax]
[a.set_xlim(200, 340) for a in ax]
[a.set_ylim(70, 0) for a in ax]

ax[0].set_title('PAR ($\mu$V)')
ax[1].set_title('PAR ($\mu$E m$^{-2}$ m$^{-1}$)')

plt.show()

_images/output_59_0.pngpng

Correcting for an in situ dark count

Sensor drift from factory calibration requires an additional correction, the calculation of a dark count in situ. This is calculated from the median of PAR measurements, with additional masking applied for values before 23:01 and outside the 90th percentile.

par_dark = gt.optics.par_dark_count(par_scaled, dives, depth, time)

gt.plot(x, y, par_dark, robust=True, cmap=cmo.solar)
xlim(200,340)
ylim(70,0)
title('PAR ($\mu$E m$^{-2}$ m$^{-1}$)')
show()

_images/output_61_0.pngpng

PAR replacement

This function removes the top 5 metres from each dive profile, and then algebraically recalculates the surface PAR using an exponential equation.

par_filled = gt.optics.par_fill_surface(par_dark, dives, depth, max_curve_depth=80)
par_filled[par_filled < 0] = 0
par_filled = par_filled.fillna(0)
i = dives == 232

fig, ax = subplots(1, 2, figsize=[6,6], dpi=100)

ax[0].plot(par_dark[i], depth[i], lw=0.5, marker='o', ms=5)
ax[0].plot(par_filled[i], depth[i], lw=0.5, marker='o', ms=3)
ax[1].plot(par_filled[i] - par_dark[i], depth[i], lw=0, marker='o')

ax[0].set_ylim(80,0)
ax[0].set_ylabel('Depth (m)')
ax[0].set_xlabel('PAR ($\mu$E m$^{-2}$ m$^{-1}$)')

ax[1].set_ylim(80,0)
ax[1].set_xlim(-350,350)
ax[1].set_yticklabels('')
ax[1].set_xlabel('Difference between profiles')

fig.tight_layout()
plt.show()

_images/output_64_0.pngpng

gt.plot(x, y, par_filled, robust=True, cmap=cmo.solar)
xlim(200,340)
ylim(100,0)
title('PAR ($\mu$E m$^{-2}$ m$^{-1}$)')
show()

_images/output_65_0.pngpng

Wrapper function demonstration

par_qc = gt.calc_par(par, dives, depth, time,
                     6.202e-4, 10.8,
                     curve_max_depth=80,
                     verbose=True).fillna(0)

gt.plot(x, y, par_qc, robust=True, cmap=cmo.solar)
ylim(80, 0)
show()
==================================================
PAR
	Dark correction
	Fitting exponential curve to data

_images/output_67_1.pngpng

Deriving additional variables

Euphotic Depth and Light attenuation coefficient
euphotic_depth, kd = gt.optics.photic_depth(
    par_filled, dives, depth,
    return_mask=False,
    ref_percentage=1
)
fig, ax = subplots(1, 1, figsize=[6,4], dpi=100)
p1 = plot(euphotic_depth.index, euphotic_depth, label='Euphotic Depth')
ylim(120,0)
ylabel('Euphotic Depth (m)')
xlabel('Dives')
ax2 = ax.twinx()
p2 = plot(kd.index, kd, color='orange', lw=0, marker='o', ms=2, label='K$_d$')
ylabel('K$_d$', rotation=270, labelpad=20)

lns = p1+p2
labs = [l.get_label() for l in lns]
ax2.legend(lns, labs, loc=3, numpoints=1)

show()

_images/output_71_0.pngpng

Fluorescence

Quenching Correcting Method as outlined in Thomalla et al. (2017)

gt.plot(x, y, fluor, cmap=cmo.delta, robust=True)
xlim(150,300)
title('Original Data')
show()

_images/output_74_0.pngpng

Outlier bounds method

flr_iqr = gt.cleaning.outlier_bounds_iqr(fluor, multiplier=3)

gt.plot(x, y, flr_iqr, cmap=cmo.delta, robust=True)
title('Outlier Bounds IQR Method')
xlim(150,300)
show()

_images/output_76_0.pngpng

Removing bad profiles

This function masks bad dives based on mean + std x [3] or median + std x [3] at a reference depth.

bad_profiles = gt.optics.find_bad_profiles(dives, depth, flr_iqr,
                                           ref_depth=300,
                                           stdev_multiplier=4,
                                           method='mean')
flr_goodprof = flr_iqr.where(~bad_profiles[0])

fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)

gt.plot(x, y, flr_iqr, cmap=cmo.delta, ax=ax[0], robust=True)
gt.plot(x, y, flr_goodprof, cmap=cmo.delta, ax=ax[1], robust=True)

[a.set_xlabel('') for a in ax]
[a.set_xlim(90, 150) for a in ax]
[a.set_ylim(300, 0) for a in ax]

ax[0].set_title('Bad Profiles Included')
ax[1].set_title('Bad Profiles Discarded')

plt.show()

_images/output_78_0.pngpng

Correcting for an in situ dark count

Sensor drift from factory calibration requires an additional correction, the calculation of a dark count in situ. This is calculated from the 95th percentile of fluorescence measurements between 300 and 400m.

flr_dark = gt.optics.fluorescence_dark_count(flr_iqr, dat.depth)

gt.plot(x, y, flr_dark, cmap=cmo.delta, robust=True)
xlim(150,300)
show()

_images/output_80_0.pngpng

Despiking

flr_base, flr_spikes = gt.cleaning.despike(flr_dark, 11, spike_method='median')

fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)

gt.plot(x, y, flr_base, cmap=cmo.delta, ax=ax[0], robust=True)
gt.plot(x, y, flr_spikes, cmap=cm.RdBu_r, ax=ax[1], vmin=-5, vmax=5)

[a.set_xlabel('') for a in ax]
[a.set_xlim(150, 300) for a in ax]
[a.set_ylim(300, 0) for a in ax]

ax[0].set_title('Despiked Fluorescence')
ax[1].set_title('Fluorescence spikes')

plt.show()

_images/output_82_0.pngpng

Quenching Correction

This function uses the method outlined in Thomalla et al. (2017), briefly it calculates the quenching depth and performs the quenching correction based on the fluorescence to backscatter ratio. The quenching depth is calculated based upon the different between night and daytime fluorescence.

The default setting is for the preceding night to be used to correct the following day’s quenching (night_day_group=True). This can be changed so that the following night is used to correct the preceding day. The quenching depth is then found from the difference between the night and daytime fluorescence, using the steepest gradient of the {5 minimum differences and the points the difference changes sign (+ve/-ve)}.

The function gets the backscatter/fluorescence ratio between from the quenching depth to the surface, and then calculates a mean nighttime ratio for each night. The quenching ratio is calculated from the nighttime ratio and the daytime ratio, which is then applied to fluorescence to correct for quenching. If the corrected value is less than raw, then the function will return the original raw data.

flr_qc, quench_layer = gt.optics.quenching_correction(
    flr_base, dat.bbp470, dives, depth, time, lats, lons,
    sunrise_sunset_offset=1, night_day_group=True)

fig, ax = plt.subplots(2, 1, figsize=[9, 6], sharex=True, dpi=90)

gt.plot(x, y, flr_qc, cmap=cmo.delta, ax=ax[0], robust=True)
gt.plot(x, y, quench_layer, cmap=cm.RdBu_r, ax=ax[1], vmin=-.5, vmax=2)

[a.set_xlabel('') for a in ax]
[a.set_xlim(150, 300) for a in ax]
[a.set_ylim(100, 0) for a in ax]

ax[0].set_title('Quenching Corrected Fluorescence')
ax[1].set_title('Quenching Layer')

plt.show()

_images/output_84_0.pngpng

Wrapper function

flr_qnch, flr, qnch_layer, [fig1, fig2] = gt.calc_fluorescence(
    fluor, dat.bbp700, dives, depth, time, lats, lons, 53, 0.0121,
    profiles_ref_depth=300, deep_method='mean', deep_multiplier=1,
    spike_window=11, spike_method='median', return_figure=True,
    night_day_group=False, sunrise_sunset_offset=2, verbose=True)

dat['flr_qc'] = flr
==================================================
Fluorescence
	Mask bad profiles based on deep values (ref depth=300m)
	Number of bad profiles = 19/672
	Dark count correction
	Quenching correction
	Spike identification (spike window=11)
	Generating figures for despiking and quenching report

_images/output_86_1.pngpng

_images/output_86_2.pngpng

Calibration with bottle samples

Bottle calibration can also be done using the calibration module.

The bottle file needs to be in a specific format with dates (datetime64 format), depth and the variable values. This can be imported with any method available. I recommend pandas.read_csv as shown in the example below. Note that latitude and longitude are not taken into account, thus the user needs to make sure that the CTD cast was in the correct location (and time, but this will be used to match the glider).

import pandas as pd

fname = '/Users/luke/Work/Publications/2019_Gregor_Front_glider/figures/SOSCEX 3 PS1.csv'
cal = pd.read_csv(fname, parse_dates=['datetime'], dayfirst=True)

The calibration.bottle_matchup function returns an array that matches the size of the ungridded glider data. The matching is done based on depth and time from both the glider and the CTD. The function will show how many samples have been matched and the smallest time difference between a CTD rosette cast and a dive (any time on the dive).

Using depth

%autoreload 2

dat['bottle_sal'] = gt.calibration.bottle_matchup(
    dat.dives, dat.depth, dat.time,
    cal.depth, cal.datetime, cal.sal)

model = gt.calibration.robust_linear_fit(dat.salt_qc, dat.bottle_sal, fit_intercept=True, epsilon=1.5)
dat['salinity_qc'] = model.predict(dat.salt_qc)
[stn 0/5]  FAILED: 2015-07-28 10:25 Couldn't find samples within constraints
[stn 1/5]  FAILED: 2015-07-28 16:15 Couldn't find samples within constraints
[stn 2/5]  FAILED: 2015-12-08 03:23 Couldn't find samples within constraints
[stn 3/5] SUCCESS: 2016-01-05 17:46 (15 of 15 samples) match-up within 0.0 minutes
[stn 4/5] SUCCESS: 2016-02-08 03:14 (12 of 17 samples) match-up within 0.0 minutes
(13, 1) (100, 1)

_images/output_92_1.pngpng

Using Density

%autoreload 2

dat['bottle_sal'] = gt.calibration.bottle_matchup(
    dat.dives, dat.density, dat.time,
    cal.density, cal.datetime, cal.sal)

model = gt.calibration.robust_linear_fit(dat.salt_qc, dat.bottle_sal, fit_intercept=True, epsilon=1.5)
dat['salinity_qc'] = model.predict(dat.salt_qc)
[stn 0/5]  FAILED: 2015-07-28 10:25 Couldn't find samples within constraints
[stn 1/5]  FAILED: 2015-07-28 16:15 Couldn't find samples within constraints
[stn 2/5]  FAILED: 2015-12-08 03:23 Couldn't find samples within constraints
[stn 3/5] SUCCESS: 2016-01-05 17:46 (15 of 15 samples) match-up within 0.0 minutes
[stn 4/5] SUCCESS: 2016-02-08 03:14 (16 of 17 samples) match-up within 0.0 minutes
(6, 1) (100, 1)

_images/output_94_1.pngpng

Gridding and interpolation

Vertical gridding

It is often more convenient and computationally efficient to work with data that has been gridded to a standard vertical grid (i.e. depths have been binned). GliderTools offers very easy to use and efficient tools to grid data once all the processing has been completed.

The first task is to select the bin size of the data that will be gridded. GliderTools automatically selects bin sizes according to the sampling frequency of the dataset for every 50m. This is shown in the figure below, where the 2D histogram shows the sampling frequency (by depth) and the line shows the automatically selected bin size rounded up to the nearest 0.5m.

ax = gt.plot.bin_size(dat.depth, cmap=mpl.cm.Blues)
ax.set_xlim(0, 6)
line = ax.get_children()[1]
line.set_linewidth(6)
line.set_color('orange')

legend = ax.get_children()[-2]
legend.set_visible(False)

_images/output_97_0.pngpng

Gridding with automatic bin sizes

Gridding the data then becomes easy with automatic binning. But note that the x-coordinate has the be semi-discrete, e.g. dives number or dive time stamp average. You’ll see that the gridding function also returns the mean bin size and then the average sampling frequency.

The function can return either an xr.DataArray or a pd.DataFrame. The DataArray is the default as metadata can be stored in these files (including coordinate information).

Gridded data can be passed to the plot function without x- and y-coordinates, as these are contained in the gridded data.

In fact, data is silently passed through the gridding function when x- and y-coordinates are included in the gt.plot function

flr_gridded = gt.grid_data(dives, depth, flr)

ax = gt.plot(flr_gridded, cmap=cmo.delta)
ax.set_ylim(200, 0)
Mean bin size = 1.99
Mean depth binned (50 m) vertical sampling frequency = 2.53

_images/output_99_2.pngpng

Gridding with manually defined bins

There is also the option to manuualy define your bins if you’d prefer. A custom bin array needs to be created. Use np.arange to create sections of the bins and combine them with np.r_ as shown below:

custom_bin = np.r_[
    np.arange(0, 100, 0.5),
    np.arange(100, 400, 1.0),
    np.arange(400, 1000, 2.0)]

flr_gridded = gt.grid_data(x, y, flr, bins=custom_bin)

# The plot below is the standard plotting procedure for an xarray.DataArray
gt.plot(flr_gridded, cmap=cmo.delta)
ylim(200, 0)
Mean bin size = 1.25
Mean depth binned (50 m) vertical sampling frequency = 2.53





(200, 0)

_images/output_101_2.pngpng

2D interpolation with objective mapping (Kriging)

Users may want to interpolate data horizontally when working with finescale gradients. Several studies have used the objmap MATLAB function that uses objective mapping (a.k.a. Kriging). Kriging is an advanced form of inverse distance weighted interpolation, where points influence the interpolation based on the distance from an interpolation point, where the influence falls off with a Gauassian function. This is an expensive function when the dataset is large (due to a matrix inverse operation). The computational cost is reduced by breaking the problem into smaller pieces using a quadtree that iteratively breaks data into smaller problems.

GliderTools provides a Python implementation of the MATLAB function. We have added parallel capability to speed the processing up, but this operation is still costly and could take several hours if an entire section is interpolated. We thus recommend that smaller sections are interpolated.

# first we select a subset of data (50k points)
subs = dat.isel(merged=slice(0, 50000))

# we then get time values - this makes creating the interpolation grid easier
var = subs.flr_qc
time = subs.time.values
depth = subs.depth
dives = subs.dives
dist = np.r_[0, gt.utils.distance(subs.longitude, subs.latitude).cumsum()]

Part 1: Semivariance

Interpolating any variable requires some knowlege about the spatial autocorrelation of that variable. A semivariogram allows one to get this information from the data. The basic idea of a semivariogram is to assess the similarity between data at different lengthscales (lags), where a low semivariance shows coherence and a large semivariance shows a mismatch. This information is required to interpolate data with sensible estimates and error estimates.

GliderTools offers a derivation of a variogram tool (gt.mapping.variogram) that makes the process of finding these parameters a little easier, though there is a fair deal of subjectivity, depending on the scale of the question at hand, and tinkering are required to make a sensible interpolation.

1.1. Choosing a subset of the data for semivariance estimation

The variogram function selects a number of dives (number depends on max_points) and performs the analysis on the subset of dives rathern than selecting random points. We thus recommend that a subset of the data is used to perform the analysis. In the example below, we take a subset of the data that as particularly high variability that we are interested in preserving. This subset is < 250m depth and limited to the first 20 dives. This should be tailored to the variable that you’re interested in.

m = (depth<150) & (dives > 30) & (dives < 46)
ax = gt.plot(dives, depth, var)
ax.plot(dives[m], depth[m], '-m', ms=3, alpha=0.7)
[<matplotlib.lines.Line2D at 0x1c728526d8>]

_images/output_106_1.pngpng

1.2. Initial estimate of semivariance

We can now find an initial estimate of the semivariance. This initial estimate will not scale the x/y coordinates for anisotropy (different scales of variability). The variogram function also accepts a boolean mask as an keyword argument. This will reduce the input data to the subset of data that you’ve chosen.

The example below shows this initial estimate. We’re looking for an estimate where the Gaussian model fits the semi-variance as well as possible, given that the variance paramters are acceptable. These variance parameters are: sill, nugget, x and y length-scales. The function automatically adjusts the range to be one and scales the x and y parameters accordingly.

The variogram function can take time (datetime64), but we use distance (in metres) to demonstrate the the anisotropic scaling.

vargram = gt.mapping.variogram(var, dist, depth, dives, mask=m)

_images/output_108_0.pngpng

The example above shows that x and y are scaled, but the Gaussian model does not fit the semivariance very well. The range is 1, because it is scaled accordingly. The sill and nugget are very similar - this is not a good result.

1.3. Finding the correct x and y length scales (anisotropy)

We can now scale the data with the xy_ratio. The ratio represents the scaling of x/y. For example, if x and y are both in metres (as in this case), we need to set a small xy_ratio as x has a much longer lengthscale. With some trial and error we choose a ratio of 0.0005, which fits the semivariogram relatively well and has a reasonably low y scaling estimate.

You’ll see that the Gaussian model does not fit the semivariance exactly - this is OK. The important thing is that the first plateau matches the sill.

We can now use these values for interpolating.

vargram = gt.mapping.variogram(var, dist, depth, dives, mask=m, xy_ratio=0.0005)

_images/output_111_0.pngpng

2. Interpolation

2.1 Preparing the interpolation grid

To perform the interpolation we first need to create the grid onto which data will be interpolated. In the example below we use distance from the origin as the x-coordinate. Time can also be used and has to be in a np.datetime64 format - we show a commented example of this. The y-coordinate is depth.

# creating the x- and y-interpolation coordinates
# and a 1m vertical grid and a horizontal grid with 500 points
xi = np.linspace(dist.min(), dist.max(), 500)
yi = np.arange(0, depth[var.notnull()].max(), 1, dtype=float)

# time can also be used. This is a commented example of how to create
# a time grid for interpolation.
# xi = np.arange(time.min(), time.max(), 30, dtype='datetime64[m]')
2.2 Interpolation with the semivariance parameters

The interpolation has a number of parameters that can be changed or adapted to the dataset at hand. The commented inputs below describe these inputs.

%autoreload 2

interpolated = gt.mapping.interp_obj(
    dist, depth, var, xi, yi,

    # Kriging interoplation arguments
    partial_sill=1.1e4,  # taken from the semivariogram (sill - nugget)
    nugget=3e3,  # taken from the semivariogram
    lenscale_x=98942,  # in hours if x and xi are in datetime64
    lenscale_y=50,  # the vertical gridding influence
    detrend=True,  # if True use linear regression (z - z_hat), if False use average (z - z_mean)

    # Quadtree arguments
    max_points_per_quad=65,  # an optimsation setting ~100 is good
    min_points_per_quad=8,  # if neighbours have < N points, look at their neighbours

    # Parallel calculation inputs.
    n_cpus=3,  # the number of CPU's to use for the calculation - default is n-1
    parallel_chunk_size=512,  # when the dataset is very large, memory can become an issue
                              # this prevents large buildup of parallel results
)
Starting Interpolation with quadtree optimal interpolation
----------------------------------------------------------

Preparing for interpolations:
	Finding and removing nans
	Removing data trend with linear regression
	Building QuadTree

Interpolation information:
	basis points:        25226
	interp grid:         500, 404
	max_points_per_quad: 65
	min_points_per_quad: 8
	number of quads:     952
	detrend_method:      linear_regression
	partial_sill:        11000.0
	nugget:              3000.0
	lengthscales:        X = 98942
	                     Y = 50 m

Processing interpolation chunks in 2 parts over 3 CPUs:
	chunk 1/2 completed in 12s
	chunk 2/2 completed in 10s

Finishing off interoplation
	Adding back the trend
	Creating xarray dataset for output
fig, ax = plt.subplots(3, 1, figsize=[9, 9], sharex=True, dpi=90)

error_mask = (interpolated.variance / interpolated.nugget) < 1.05
interp_robust = interpolated.z.where(error_mask)

props = dict(vmin=0, vmax=300, cmap=cmo.delta)
gt.plot.scatter(dist, depth, var, ax=ax[0], **props)
gt.plot.pcolormesh(interp_robust, ax=ax[1], **props)
gt.plot.pcolormesh(interpolated.variance, ax=ax[2], vmin=interpolated.nugget, vmax=interpolated.nugget*1.08)

ax[2].plot(dist, depth, 'w-', zorder=40, alpha=0.8, lw=0.4)

[a.set_ylim(400, 0) for a in ax]
[a.set_xlabel('  ') for a in ax]

ax[0].get_children()[0].set_sizes([20])
ax[0].set_title('Uninterpolated data')
ax[1].set_title('Interpolated data')
ax[2].set_title('Interpolation variance with dives shown in white')
ax[2].set_xlabel('Distance (m)')

tks = xticks(rotation=0)

_images/output_116_0.pngpng

Saving data

We have not created an explicit way to save data in GliderTools. This is primarily due to the fact that the package is built on top of two packages that already do this very well: pandas and xarray. pandas is widely used and deals with tabular formatted data (2D). xarray widely used in earth sciences as it supports multi-dimensional indexing (3D+). We highly recommend that you read through the documentation for these packages as they are incredibly powerful and you would benefit from knowing these tools regardless of using GliderTools or not!

We have written GliderTools primarily with xarray as the backend, due to the ability to store attributes (or metadata) alongside the data - something that pandas does not yet do. Moreover, we have also created the tool so that metadata is passed to the output of each function, while appending the function call to the history attribute. This ensures that the user of the data knows when and what functions (and arguements) were called and for which version of GliderTools this was done.

Examples

First we give an example of how to save and read files to netCDF (which we recommend).

import xarray as xr

# xds is an xarray.DataFrame with record of dimensions, coordinates and variables
xds.to_netcdf('data_with_meta.nc')

# this file can simply be loaded in the same way, without using GliderTools
# all the information that was attached to the data is still in the netCDF
xds = xr.open_dataset('data_with_meta.nc')

In this second example we show how to save the data to a CSV. While this is a common and widely used format, we do not recommend this as the go to format, as all metadata is lost when the file is saved.

import pandas as pd

# If you prefer to save your data as a text file, you can easily do this with Pandas
# note that converting the file to a dataframe discards all the metadata
df = xds.to_dataframe()
df.to_csv('data_without_meta.csv')

# this file can simply be loaded in the same way, without using GliderTools
# there will be no more metadata attached to each variable
df = pd.read_csv('data_without_meta.csv')

# finally, you can also convert the file back to an xarray.Dataset
# however, the data will still be lost
xds = df.to_xarray()

Other tools and utilities

3D interactive plot

This is purely for investigative purposes, but provides a good way to interact with the data.

plotly_figure = gt.plot.section3D(
    dat.dives, dat.depth, dat.longitude, dat.latitude, dat.salt_qc,
    zmin=-500, vmax=.999, vmin=.005
)

_images/interactive_plot.pngpng

API Reference

The API reference is automatically generated from the function docstrings in the GliderTools package. Refer to the examples in the sidebar for reference on how to use the functions.

Loading Data

load.seaglider_basestation_netCDFs(files, …) Load a list of variables from the SeaGlider object as a pandas.DataFrame.
load.seaglider_show_variables(files)
load.ego_mission_netCDF(filename) Loads an EGO formatted glider mission file.
load.slocum_geomar_matfile(filename[, verbose]) Load .mat file generated with the geomar MATLAB script for Slocum data.
load.voto_seaexplorer_nc
load.voto_seaexplorer_dataset
load.voto_concat_datasets

High level processing

processing.calc_physics(variable, dives, depth) A standard setup for processing physics variables (temperature, salinity).
processing.calc_oxygen(o2raw, pressure, …) This function processes oxygen.
processing.calc_backscatter(bb_raw, tempC, …) The function processes the raw backscattering data in counts into total backscatter (bbp) in metres.
processing.calc_fluorescence(flr_raw, bbp, …) This function processes raw fluorescence and corrects for quenching using the Thomalla et al.
processing.calc_par(par_raw, dives, depth, …) Calculates the theoretical PAR based on an exponential curve fit.

Cleaning

cleaning.outlier_bounds_std(arr[, multiplier]) Mask values outside the upper and lower outlier limits by standard deviation
cleaning.outlier_bounds_iqr(arr[, multiplier]) Mask values outside the upper/lower outlier limits by interquartile range:
cleaning.horizontal_diff_outliers(dives, …) Find Z-score outliers (> 3) on the horizontal.
cleaning.mask_bad_dive_fraction(mask, dives, var) Find bad dives - where more than a fraction of the dive is masked
cleaning.data_density_filter(x, y[, …]) Use the 2D density cloud of observations to find outliers for any variables
cleaning.despike(var, window_size[, …]) Return a smooth baseline of data and the anomalous spikes
cleaning.despiking_report(dives, depth, raw, …) A report for the results of cleaning.despike.
cleaning.rolling_window(var, func, window) A rolling window function that is nan-resiliant
cleaning.savitzky_golay(var, window_size, order) Smooth (and optionally differentiate) data with a Savitzky-Golay filter.

Physics

physics.mixed_layer_depth(dives, depth, …) Calculates the MLD for ungridded glider array.
physics.potential_density(salt_PSU, temp_C, …) Calculate density from glider measurements of salinity and temperature.
physics.brunt_vaisala

Optics

optics.find_bad_profiles(dives, depth, var) Find profiles that exceed a threshold at a reference depth.
optics.par_dark_count(par, dives, depth, time) Calculates an in situ dark count from the PAR sensor.
optics.backscatter_dark_count(bbp, depth) Calculates an in situ dark count from the backscatter sensor.
optics.fluorescence_dark_count(flr, depth) Calculates an in situ dark count from the fluorescence sensor.
optics.par_scaling(par_uV, …) Scaling correction for par with factory calibration coefficients.
optics.par_fill_surface(par, dives, depth[, …]) Algebraically calculates the top 5 metres of the par profile.
optics.photic_depth(par, dives, depth[, …]) Algebraically calculates the euphotic depth.
optics.sunset_sunrise(time, lat, lon) Calculates the local sunrise/sunset of the glider location.
optics.quenching_correction(flr, bbp, dives, …) Corrects the fluorescence data based upon Thomalla et al.
optics.quenching_report(flr, flr_corrected, …) A report for the results of optics.quenching_correction.

Calibration

calibration.bottle_matchup(gld_dives, …[, …]) Performs a matchup between glider and bottle samples based on time and depth (or density).
calibration.model_figs(bottle_data, …[, ax]) Creates the figure for a linear model fit.
calibration.robust_linear_fit(gld_var, …) Perform a robust linear regression using a Huber Loss Function to remove outliers.

Gridding and Interpolation

mapping.interp_obj(x, y, z, xi, yi[, …]) Performs objective interpolation (or Kriging) of a 2D field.
mapping.grid_data(x, y, var[, bins, how, …]) Grids the input variable to bins for depth/dens (y) and time/dive (x).
mapping.variogram(variable, horz, vert, dives) Find the interpolation parameters and x and y scaling of the horizontal and vertical coordinate paramters for objective interpolation (Kriging).

Plotting

plot.plot_functions Plot data (gridded or not) as a section and more.

General Utilities

utils.time_average_per_dive(dives, time) Gets the average time stamp per dive.
utils.mask_above_depths
utils.mask_below_depths
utils.mask_profile_depth
utils.merge_dimensions(df1, df2[, interp_lim]) Merges variables measured at different time intervals.
utils.calc_glider_vert_velocity(time, depth) Calculate glider vertical velocity in cm/s
utils.calc_dive_phase(time, depth) Determine the glider dive phase
utils.calc_dive_number(time, depth) Determine the glider dive number (based on dive phase)
utils.dive_phase_to_number(phase)
utils.distance(lon, lat[, ref_idx]) Great-circle distance in m between lon, lat points.
utils.group_by_profiles

Package Structure

_images/package_structure.pngStructure image

What’s New

v2023.xx (unreleased)

New Features

Breaking changes

  • GliderTools defaults for Figure creation were changed. Automatic application of plt.tight_layout was dropped in favour of more flexible embedding of GliderTools plots into existing layouts/subplots.
  • The mixed layer depth algorithm was corrected. (GH#169, GH#168). By Martin Mohrmann. API change! Existing mixed layer computation code must be adapted.
  • Changed the behavior of find_dive_phase and calc_dive_number to use a smaller depth threshold when determining a valid dive (15 dbar down from 200 dbar). this is also now adjusteable. (GH#134) By Tom Hull.

Internal changes

v2022.12.13 (2022/12/13)

Internal changes

  • Refactoring and update of testing and development framework, update of flake, black and almost all python dependencies

Breaking changes

  • Fixed processing/calc_oxygen (:pull: 116, :issue: 112) By Callum Rollo.

Internal Changes

Documentation

Bug fixes

  • Replaced skyfield dependency with astral, fixing sunrise/sunset problems at high latitudes. By Isabelle Sindiswa Giddy.

v2021.03 (2021/3/30)

Documentation

Internal Changes

  • Migration of CI to conda based workflow with multiple python versions. (GH#54) By Julius Busecke.
  • Revamp distribution actions. (GH#82) By Julius Busecke.
  • Migrate from astral to skyfield (:pull:’121’) By ‘Isabelle Giddy <https://github.com/isgiddy>’_.

Citing GliderTools

https://zenodo.org/badge/141922866.svg

If you would like to cite or reference Glider Tools, please use:

Gregor, L., Ryan-Keogh, T. J., Nicholson, S.-A., du Plessis, M., Giddy, I., & Swart, S. (2019). GliderTools: A Python Toolbox for Processing Underwater Glider Data. Frontiers in Marine Science, 6(December), 1–13. https://doi.org/10.3389/fmars.2019.00738

Project Contributors

The following people have made contributions to the project (in alphabetical order by last name) and are considered “The GliderTools Developers”. These contributors will be added as authors upon the next major release of GliderTools (i.e. new DOI release).

Contribution Guide

Contributions are highly welcomed and appreciated. Every little help counts, so do not hesitate! You can make a high impact on glidertools just by using it, being involved in discussions

and reporting issues.

The following sections cover some general guidelines regarding development in glidertools for maintainers and contributors.

Nothing here is set in stone and can’t be changed. Feel free to suggest improvements or changes in the workflow.

Feature requests and feedback

We are eager to hear about your requests for new features and any suggestions about the API, infrastructure, and so on. Feel free to start a discussion about these on the discussions tab on github under the “ideas” section.

After discussion with a few community members, and agreement that the feature should be added and who will work on it, a new issue should be opened. In the issue, please make sure to explain in detail how the feature should work and keep the scope as narrow as possible. This will make it easier to implement in small PRs.

Report bugs

Report bugs for glidertools in the issue tracker with the label “bug”.

If you can write a demonstration test that currently fails but should pass that is a very useful commit to make as well, even if you cannot fix the bug itself.

Fix bugs

Look through the GitHub issues for bugs.

Talk to developers to find out how you can fix specific bugs.

Preparing Pull Requests

  1. Fork the glidertools GitHub repository. It’s fine to use glidertools as your fork repository name because it will live under your username.

  2. Clone your fork locally using git, connect your repository to the upstream (main project), and create a branch:

    $ git clone git@github.com:YOUR_GITHUB_USERNAME/glidertools.git # clone to local machine
    $ cd glidertools
    $ git remote add upstream git@github.com:GliderToolsCommunity/GliderTools.git # connect to upstream remote
    
    # now, to fix a bug or add feature create your own branch off "master":
    
    $ git checkout -b your-bugfix-feature-branch-name master # Create a new branch where you will make changes
    

    If you need some help with Git, follow this quick start guide: https://git.wiki.kernel.org/index.php/QuickStart

  3. Set up a [conda](environment) with all necessary dependencies:

    $ conda env create -f ci/environment-py3.8.yml
    
  4. Activate your environment:

    $ conda activate test_env_glidertools
    

    Make sure you are in this environment when working on changes in the future too.

  5. Install the GliderTools package:

    $ pip install -e . --no-deps
    
  6. Before you modify anything, ensure that the setup works by executing all tests:

    $ pytest
    

    You want to see an output indicating no failures, like this:

    $ ========================== n passed, j warnings in 17.07s ===========================
    
  7. Install pre-commit and its hook on the glidertools repo:

    $ pip install --user pre-commit
    $ pre-commit install
    

    Afterwards pre-commit will run whenever you commit. If some errors are reported by pre-commit you should format the code by running:

    $ pre-commit run --all-files
    

    and then try to commit again.

    https://pre-commit.com/ is a framework for managing and maintaining multi-language pre-commit hooks to ensure code-style and code formatting is consistent.

    You can now edit your local working copy and run/add tests as necessary. Please follow PEP-8 for naming. When committing, pre-commit will modify the files as needed, or will generally be quite clear about what you need to do to pass the commit test.

  8. Break your edits up into reasonably sized commits:

    $ git commit -a -m "<commit message>"
    $ git push -u
    

    Committing will run the pre-commit hooks (isort, black and flake8). Pushing will run the pre-push hooks (pytest and coverage)

    We highly recommend using test driven development, but our coverage requirement is low at the moment due to lack of tests. If you are able to write tests, please stick to xarray’s testing recommendations.

  9. Add yourself to the

    Project Contributors list via ./docs/authors.md.

  10. Finally, submit a pull request (PR) through the GitHub website using this data:

    head-fork: YOUR_GITHUB_USERNAME/glidertools
    compare: your-branch-name
    
    base-fork: GliderToolsCommunity/GliderTools
    base: master
    

    The merged pull request will undergo the same testing that your local branch had to pass when pushing.

  11. After your pull request is merged into the GliderTools/master, you will need to fetch those changes and rebase your master so that your master reflects the latest version of GliderTools. The changes should be fetched and incorporated (rebase) also right before you are planning to introduce changes.:

    $ git checkout master # switch back to master branch
    $ git fetch upstream  # Download all changes from central upstream repo
    $ git rebase upstream/master  # Apply the changes that have been made to central repo,
    $ # since your last fetch, onto you master.
    $ git branch -d your-bugfix-feature-branch-name  # to delete the branch after PR is approved
    

Release Instructions

This is a documentation repo for people in the group on how to do the integrated deployment.

NB RULE! Never commit to master.

  1. Change the version in the setup.py file. Must be format YYYY.<release number>
  2. Create a release with a tag that has the same format as the version above.
  3. The distribution will be built automatically and pushed to PyPi
  4. The DOI will also be updated on Zenodo. (untested, see #165)

Wishlist

A list of things we’d love to add to GliderTools and the work involved.

  1. Support for raw files from Slocum gliders and Seagliders with the following additional functionality
    • Thermal lag correction for each of the gliders supported in the suggestion above.
    • Support for hadware modules by model and manufacturer
  2. Make final data output compatible with www.OceanGliders.org data format, https://www.oceangliders.org/taskteams/data-management/