Xarray for ND-arrays and tensor functionality

Tips, discussions, how-tos, libraries...


Post Reply
User avatar
Paul Hockett
Posts: 22
Joined: Mon May 25, 2020 5:17 pm
Location: Ottawa
Contact:

Xarray for ND-arrays and tensor functionality

Post by Paul Hockett »

Image

Over the last few months, I've become a big fan of Xarray, and plan to use it in most (all?) my future Python projects. Xarray is, essentially, a library for defining and handling labelled ND-arrays, and takes a lot of the pain out of doing ND calculations (no more composing tricky vector calls, or deeply nested loops!).

Here's what the docs say:

Overview: Why xarray?

Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience.
What labels enable

Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”) are an essential part of computational science. They are encountered in a wide range of fields, including physics, astronomy, geoscience, bioinformatics, engineering, finance, and deep learning. In Python, NumPy provides the fundamental data structure and API for working with raw ND arrays. However, real-world datasets are usually more than just raw numbers; they have labels which encode information about how the array values map to locations in space, time, etc.

Xarray doesn’t just keep track of labels on arrays – it uses them to provide a powerful and concise interface. For example:

  • Apply operations over dimensions by name: x.sum('time').

  • Select values by label (or logical location) instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01').

  • Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.

  • Easily use the split-apply-combine paradigm with groupby: x.groupby('time.dayofyear').mean().

  • Database-like alignment based on coordinate labels that smoothly handles missing values: x, y = xr.align(x, y, join='outer').

    Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.

The N-dimensional nature of xarray’s data structures makes it suitable for dealing with multi-dimensional scientific data, and its use of dimension names instead of axis labels (dim='time' instead of axis=0) makes such arrays much more manageable than the raw numpy ndarray: with xarray, you don’t need to keep track of the order of an array’s dimensions or insert dummy dimensions of size 1 to align arrays (e.g., using np.newaxis).

The immediate payoff of using xarray is that you’ll write less code. The long-term payoff is that you’ll understand what you were thinking when you come back to look at it weeks or months later.

That last part is particular true for any problems imvolving multiple tensors and/or deeply nested summations. Although there is a bit of leg-work involved in setting up the arrays in the first place, once that part is done large tensor sums become trivial - they are coded quite directly, and exactly as they appear formally (with a few caveats of course). I've now made use of this for photoionization calculations in ePSproc (see also here for the formalism) and, so far, am pretty happy with the structure and performance and code (aside from some residual mess in the development version!). The only down-side, as far as I can tell so far, is that the current version of Xarray (v0.15.1) doesn't implement sparse arrays, so things can get quite RAM-heavy for large, but sparse, cases. This is, however, on their roadmap.

Interested...? For a starter, see the Xarray intro doc.


User avatar
Paul Hockett
Posts: 22
Joined: Mon May 25, 2020 5:17 pm
Location: Ottawa
Contact:

Re: Xarray for ND-arrays and tensor functionality

Post by Paul Hockett »

A few interesting comments on the state of play as of April 2019 can be found in Thoughts on the state of Xarray within the broader scientific Python ecosystem by Benoît Bovy.


Post Reply