JODA - JSON On Demand Analysis | Efficient data wrangling for semi-structured JSON documents

What is JODA?

JODA is an efficient data wrangling tool for semi-structured JSON datasets. It can handle every scale of data, from small-scale to big data. Every system resource is fully utilized to reach the best performance. JODA creates indices adaptively, depending on the workload, to optimize for iterative workloads.

If you are just getting started, check out the following resources:

Installation Getting Started Language Reference Code Documentation User Defined Modules

Latest Release - v0.14.0

The latest release can always be found on GitHub.

0.14.0 Query Execution Redesign (09-02-2023)

In this release the query execution pipeline has been completely redesigned. This allowed us to extend JODA with many more exciting features, but may have positive and/or negative impacts on the performance, depending on the query.

The most important changes are:

JOINS with sub-queries.
Streaming support, including window aggregations
Multi-query optimization.
Iterator functions like MAP/FILTER/ANY/ALL.
Temporary (unnamed) datasets/intermediate results.
User-Defined modules (functions, aggregators, importers, exporters, indices)
Removed support of the DELETE expression in queries.

Breaking Changes

Removed support for DELETE expressions in queries. Please use CLI delete commands or HTTP delete endpoints instead.
Benchmark/Statistics output has changed to support the new query pipeline architecture.

Added

Support for temporary collections. LOAD FROM FILE ... is now a valid query which will not create a permanent dataset, but pass the imported documents to the rest of the query
Support for JOINS with sub-queries. Currently equality joins (LOAD A JOIN B ON(<attr>[, <attr>])) and theta joins (LOAD A JOIN B WHERE(<attr>)) are implemented
Iterator functions MAP/FILTER/ANY/ALL
All boolean binary operators can now also be used as functions (e.g. AND(a, b), NOT(a), OR(a, b))
- Additionally added XOR and IMPLICATION functions
All binary comparison operators can now also be used as functions (e.g. LESS(a, b), LESSEQ(a, b), EQUAL(a, b))
Streaming support. If JODA is invoked with preset queries and connected to a stream (pipe, TTY, …) the queries will be executed in streaming mode. This allows the continuous evaluation of queries over a potentially endless stream. Query results can be printed to to streams and other locations.
Window aggregations, which allow to compute aggregations over a tumbling window. Mostly useful for streaming mode
User defined modules can now be loaded to extend JODA with custom features written in python
- Functions allow to define custom functions in python which can be used in queries
- Aggregators allow to define custom aggregators in python which can be used in aggregation steps
- Importers allow import and connection of external data sources
- Exporters allow to export data to external data sources
- Indices allow to define custom indices which can be used to improve filter performance
Added new REGEX extract first function
Added CONCAT function to concatenate strings
Added SPLIT function to split strings
Added TRUTHY and FALSY functions to convert values to Boolean

Citation

If you use this project in your research, please cite it using our ICDE 2020 demo paper.

Bibtex:

@inproceedings{DBLP:conf/icde/Schafer020,
  author    = {Nico Sch{\"{a}}fer and
               Sebastian Michel},
  title     = {{JODA:} {A} Vertically Scalable, Lightweight {JSON} Processor for
               Big Data Transformations},
  booktitle = {36th {IEEE} International Conference on Data Engineering, {ICDE} 2020,
               Dallas, TX, USA, April 20-24, 2020},
  pages     = {1726--1729},
  publisher = {{IEEE}},
  year      = {2020},
  url       = {https://doi.org/10.1109/ICDE48307.2020.00155},
  doi       = {10.1109/ICDE48307.2020.00155},
  timestamp = {Fri, 05 Jun 2020 17:54:57 +0200},
  biburl    = {https://dblp.org/rec/conf/icde/Schafer020.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}