What is JODA?
JODA is an efficient data wrangling tool for semi-structured JSON datasets. It can handle every scale of data, from small-scale to big data. Every system resource is fully utilized to reach the best performance. JODA creates indices adaptively, depending on the workload, to optimize for iterative workloads.
If you are just getting started, check out the following resources:
Latest Release - v0.14.0
The latest release can always be found on GitHub.
0.14.0 Query Execution Redesign (09-02-2023)
In this release the query execution pipeline has been completely redesigned. This allowed us to extend JODA with many more exciting features, but may have positive and/or negative impacts on the performance, depending on the query.
The most important changes are:
- JOINS with sub-queries.
- Streaming support, including window aggregations
- Multi-query optimization.
- Iterator functions like MAP/FILTER/ANY/ALL.
- Temporary (unnamed) datasets/intermediate results.
- User-Defined modules (functions, aggregators, importers, exporters, indices)
- Removed support of the
DELETE
expression in queries.
Breaking Changes
- Removed support for
DELETE
expressions in queries. Please use CLIdelete
commands or HTTPdelete
endpoints instead. - Benchmark/Statistics output has changed to support the new query pipeline architecture.
Added
- Support for temporary collections.
LOAD FROM FILE ...
is now a valid query which will not create a permanent dataset, but pass the imported documents to the rest of the query - Support for JOINS with sub-queries. Currently equality joins (
LOAD A JOIN B ON(<attr>[, <attr>])
) and theta joins (LOAD A JOIN B WHERE(<attr>)
) are implemented - Iterator functions
MAP
/FILTER
/ANY
/ALL
- All boolean binary operators can now also be used as functions (e.g.
AND(a, b)
,NOT(a)
,OR(a, b)
)- Additionally added
XOR
andIMPLICATION
functions
- Additionally added
- All binary comparison operators can now also be used as functions (e.g.
LESS(a, b)
,LESSEQ(a, b)
,EQUAL(a, b)
) - Streaming support. If JODA is invoked with preset queries and connected to a stream (pipe, TTY, …) the queries will be executed in streaming mode. This allows the continuous evaluation of queries over a potentially endless stream. Query results can be printed to to streams and other locations.
- Window aggregations, which allow to compute aggregations over a tumbling window. Mostly useful for streaming mode
- User defined modules can now be loaded to extend JODA with custom features written in python
- Functions allow to define custom functions in python which can be used in queries
- Aggregators allow to define custom aggregators in python which can be used in aggregation steps
- Importers allow import and connection of external data sources
- Exporters allow to export data to external data sources
- Indices allow to define custom indices which can be used to improve filter performance
- Added new REGEX extract first function
- Added
CONCAT
function to concatenate strings - Added
SPLIT
function to split strings - Added
TRUTHY
andFALSY
functions to convert values to Boolean
Citation
If you use this project in your research, please cite it using our ICDE 2020 demo paper.
Bibtex:
@inproceedings{DBLP:conf/icde/Schafer020,
author = {Nico Sch{\"{a}}fer and
Sebastian Michel},
title = {{JODA:} {A} Vertically Scalable, Lightweight {JSON} Processor for
Big Data Transformations},
booktitle = {36th {IEEE} International Conference on Data Engineering, {ICDE} 2020,
Dallas, TX, USA, April 20-24, 2020},
pages = {1726--1729},
publisher = {{IEEE}},
year = {2020},
url = {https://doi.org/10.1109/ICDE48307.2020.00155},
doi = {10.1109/ICDE48307.2020.00155},
timestamp = {Fri, 05 Jun 2020 17:54:57 +0200},
biburl = {https://dblp.org/rec/conf/icde/Schafer020.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}