Northlight: Declarative and Optimized Analysis of Atmospheric Datasets in SparkSQL

Abstract: Research in atmospheric physics, meteorology, and weather prediction requires the processing of very large multi-dimensional observational or modeled datasets on a daily basis. One of the numerous existing array engines looks like the natural choice for this task. Interestingly, the actual data analysis situation in the community looks surprisingly different: Researchers often process their data manually using hand-written Python or Julia scripts that directly operate on the raw data files. This results in poor performance due to a lack of data-driven optimizations, as well as poor scalability due to being restricted to a single physical machine. Reasons for this trend lie in the high complexity and upfront effort associated with any specialized system: Distributed large-scale engines must be set up carefully and data must be be converted/transferred into the the proprietary representation of the system. The users, who are typically not computer scientists or data management experts, must adopt and use a specialized multi-dimensional query language to formulate their analytical tasks. As a counter-measure, in this work, we present Northlight, a query processing engine for atmospheric datasets that is (a) easy to adopt for the Earth science community while (b) providing domain-specific automatic query optimization. Northlight is built on top of the established SparkSQL dataflow engine and connects to atmospheric datasets stored in multi-dimensional NetCDF files. As a consequence, it becomes possible to process these datasets simply via conventional SQL, which is sufficient for a large variety of analysis tasks in the community. At the same time, Northlight provides automatic query optimization specifically tailored towards the processing of observational datasets. We experimentally show that Northlight scales gracefully with the selectivity of the analysis tasks and outperforms a comparable pipeline by up to a factor of 6x.

Code available:

Paper available: