Skip to main content

dcdf (k² Time Series Raster)

Background

At dClimate we are always researching ways of storing climate data in ways that are not only cloud-optimized but can leverage web3/p2p concepts such as trustlessness and immutability while at the same time not sacrificing on performance. The traditional GIS space has the tried and tested file formats such as NetCDF, Grib, TIFF, and more recently Zarr but we are curious what other novel data structures in the research space can assist us in addressing one of the biggest problems of our time.

Before we began our path to converting NetCDFs to Zarr natively on IPFS, which you can read about here, we had been exploring ways to store gridded data on IPFS without compromising query performance or storage space. In our research we came across various implementations of Quadtrees and then via papers on storing sparse web graphs, the k² Time Series Raster. The papers had mentioned 10-10,000x querying speed improvement over NetCDF and Geotiff while also being compressed out of the box. Some of the "cons" were that the approach was experimental, data would have to be pre-processed, schemas would have to be implemented in IPLD(still a todo!) alongside a general lack of import support / would require transformers/codecs (in IPLD) for deserialization support elsewhere (ArcGIS, QGIS etc). Nevertheless since an implementation did not yet exist we figured it would be a worthwhile endeavor to pursue.

Overview

The dcdf library is a Rust implementation(with a Python Wrapper interface) of the Heuristic K^2 Raster algorithm as outlined in the paper, "Space-efficient representations of raster time series" by Silva Coira, Paramá, de Bernardo, and Seco.. This implementation is focused on encoding, publishing, and reading time series raster climate data on distributed content-adddressable storage on IPFS leveraging IPLD for data modeling.

Data is stored in a binary format which can be intelligently broken up into hash addressable chunks for storage in any IPLD based datastore, such as IPFS. The chunked tree structure allows queries to be performed without having to retrieve the entire dataset for the time period in question, and the use of compact data structures allows space efficient files to be queried in place without a decompression step.

This is critical for climate data applications where data sizes can be massive (in the TBs) and analysis or even visualizations are done in constrained environments. This particular format shines where datasets are sparse as data compresses nicely. The dcdf library provides some examples wherein it ingests currently existing dClimate Zarr datasets as a native k² Time Series Raster for querying.

Future Work

We have many plans for this library as it seems like a promising alternative for particular dataset usecases such as visualization. You can see all the plans we have on our issues page but since our projects are open source we encourage anyone and everyone to contribute :)