<- Back | |
# json2tsv: a JSON to TSV converter | |
Last modification on 2021-09-25 | |
Convert JSON to TSV or separated output. | |
json2tsv reads JSON data from stdin. It outputs each JSON type to a TAB- | |
Separated Value format per line by default. | |
## TAB-Separated Value format | |
The output format per line is: | |
nodename<TAB>type<TAB>value<LF> | |
Control-characters such as a newline, TAB and backslash (\n, \t and \\) are | |
escaped in the nodename and value fields. Other control-characters are | |
removed. | |
The type field is a single byte and can be: | |
* a for array | |
* b for bool | |
* n for number | |
* o for object | |
* s for string | |
* ? for null | |
Filtering on the first field "nodename" is easy using awk for example. | |
## Features | |
* Accepts all **valid** JSON. | |
* Designed to work well with existing UNIX programs like awk and grep. | |
* Straightforward and not much lines of code: about 475 lines of C. | |
* Few dependencies: C compiler (C99), libc. | |
* No need to learn a new (meta-)language for processing data. | |
* The parser supports code point decoding and UTF-16 surrogates to UTF-8. | |
* It does not output control-characters to the terminal for security reasons by | |
default (but it has a -r option if needed). | |
* On OpenBSD it supports »pledge(2)« for syscall restriction: | |
pledge("stdio", NULL). | |
* Supports setting a different field separator and record separator with the -F | |
and -R option. | |
## Cons | |
* For the tool there is additional overhead by processing and filtering data | |
from stdin after parsing. | |
* The parser does not do complete validation on numbers. | |
* The parser accepts some bad input such as invalid UTF-8 | |
(see »RFC8259 - 8.1. Character Encoding«). | |
json2tsv reads from stdin and does not do assumptions about a "closed | |
ecosystem" as described in the RFC. | |
* The parser accepts some bad JSON input and "extensions" | |
(see »RFC8259 - 9. Parsers«). | |
* Encoded NUL bytes (\u0000) in strings are ignored. | |
(see »RFC8259 - 9. Parsers«). | |
"An implementation may set limits on the length and character contents of | |
strings." | |
* The parser is not the fastest possible JSON parser (but also not the | |
slowest). For example: for ease of use, at the cost of performance all | |
strings are decoded, even though they may be unused. | |
## Why Yet Another JSON parser? | |
I wanted a tool that makes parsing JSON easier and work well from the shell, | |
similar to »jq«. | |
sed and grep often work well enough for matching some value using some regex | |
pattern, but it is not good enough to parse JSON correctly or to extract all | |
information: just like parsing HTML/XML using some regex is not good (enough) | |
or a good idea :P. | |
I didn't want to learn a new specific »meta-language« which jq has and wanted | |
something simpler. | |
While it is more efficient to embed this query language for data aggregation, | |
it is also less simple. In my opinion it is simpler to separate this and use | |
pattern-processing by awk or an other filtering/aggregating program. | |
For the parser, there are many JSON parsers out there, like the efficient | |
»jsmn parser«, however a few parser behaviours I want to have are: | |
* jsmn buffers data as tokens, which is very efficient, but also a bit | |
annoying as an API as it requires another layer of code to interpret the | |
tokens. | |
* jsmn does not handle decoding strings by default. Which is very efficient | |
if you don't need parts of the data though. | |
* jsmn does not keep context of nested structures by default, so may require | |
writing custom utility functions for nested data. | |
This is why I went for a parser design that uses a single callback per "node" | |
type and keeps track of the current nested structure in a single array and | |
emits that. | |
## Clone | |
git clone git://git.codemadness.org/json2tsv | |
## Browse | |
You can browse the source-code at: | |
* https://git.codemadness.org/json2tsv/ | |
* gopher://codemadness.org/1/git/json2tsv | |
## Download releases | |
Releases are available at: | |
* https://codemadness.org/releases/json2tsv/ | |
* gopher://codemadness.org/1/releases/json2tsv | |
## Build and install | |
$ make | |
# make install | |
## Examples | |
An usage example to parse posts of the JSON API of »reddit.com« and format th… | |
to a plain-text list using awk: | |
#!/bin/sh | |
curl -s -H 'User-Agent:' 'https://old.reddit.com/.json?raw_json=1&limit… | |
json2tsv | \ | |
awk -F '\t' ' | |
function show() { | |
if (length(o["title"]) == 0) | |
return; | |
print n ". " o["title"] " by " o["author"] " in r/" o["subreddi… | |
print o["url"]; | |
print ""; | |
} | |
$1 == ".data.children[].data" { | |
show(); | |
n++; | |
delete o; | |
} | |
$1 ~ /^\.data\.children\[\]\.data\.[a-zA-Z0-9_]*$/ { | |
o[substr($1, 23)] = $3; | |
} | |
END { | |
show(); | |
}' | |
## References | |
* Sites: | |
* seriot.ch - Parsing JSON is a Minefield | |
* A comprehensive test suite for RFC 8259 compliant JSON parsers | |
* json.org | |
* Current standard: | |
* RFC8259 - The JavaScript Object Notation (JSON) Data Interchange Format | |
* Standard ECMA-404 - The JSON Data Interchange Syntax (2nd edition (December… | |
* Historic standards: | |
* RFC7159 - The JavaScript Object Notation (JSON) Data Interchange Format (ob… | |
* RFC7158 - The JavaScript Object Notation (JSON) Data Interchange Format (ob… | |
* RFC4627 - The JavaScript Object Notation (JSON) Data Interchange Format (ob… |