| <- Back | |
| # json2tsv: a JSON to TSV converter | |
| Last modification on 2021-09-25 | |
| Convert JSON to TSV or separated output. | |
| json2tsv reads JSON data from stdin. It outputs each JSON type to a TAB- | |
| Separated Value format per line by default. | |
| ## TAB-Separated Value format | |
| The output format per line is: | |
| nodename<TAB>type<TAB>value<LF> | |
| Control-characters such as a newline, TAB and backslash (\n, \t and \\) are | |
| escaped in the nodename and value fields. Other control-characters are | |
| removed. | |
| The type field is a single byte and can be: | |
| * a for array | |
| * b for bool | |
| * n for number | |
| * o for object | |
| * s for string | |
| * ? for null | |
| Filtering on the first field "nodename" is easy using awk for example. | |
| ## Features | |
| * Accepts all **valid** JSON. | |
| * Designed to work well with existing UNIX programs like awk and grep. | |
| * Straightforward and not much lines of code: about 475 lines of C. | |
| * Few dependencies: C compiler (C99), libc. | |
| * No need to learn a new (meta-)language for processing data. | |
| * The parser supports code point decoding and UTF-16 surrogates to UTF-8. | |
| * It does not output control-characters to the terminal for security reasons by | |
| default (but it has a -r option if needed). | |
| * On OpenBSD it supports »pledge(2)« for syscall restriction: | |
| pledge("stdio", NULL). | |
| * Supports setting a different field separator and record separator with the -F | |
| and -R option. | |
| ## Cons | |
| * For the tool there is additional overhead by processing and filtering data | |
| from stdin after parsing. | |
| * The parser does not do complete validation on numbers. | |
| * The parser accepts some bad input such as invalid UTF-8 | |
| (see »RFC8259 - 8.1. Character Encoding«). | |
| json2tsv reads from stdin and does not do assumptions about a "closed | |
| ecosystem" as described in the RFC. | |
| * The parser accepts some bad JSON input and "extensions" | |
| (see »RFC8259 - 9. Parsers«). | |
| * Encoded NUL bytes (\u0000) in strings are ignored. | |
| (see »RFC8259 - 9. Parsers«). | |
| "An implementation may set limits on the length and character contents of | |
| strings." | |
| * The parser is not the fastest possible JSON parser (but also not the | |
| slowest). For example: for ease of use, at the cost of performance all | |
| strings are decoded, even though they may be unused. | |
| ## Why Yet Another JSON parser? | |
| I wanted a tool that makes parsing JSON easier and work well from the shell, | |
| similar to »jq«. | |
| sed and grep often work well enough for matching some value using some regex | |
| pattern, but it is not good enough to parse JSON correctly or to extract all | |
| information: just like parsing HTML/XML using some regex is not good (enough) | |
| or a good idea :P. | |
| I didn't want to learn a new specific »meta-language« which jq has and wanted | |
| something simpler. | |
| While it is more efficient to embed this query language for data aggregation, | |
| it is also less simple. In my opinion it is simpler to separate this and use | |
| pattern-processing by awk or an other filtering/aggregating program. | |
| For the parser, there are many JSON parsers out there, like the efficient | |
| »jsmn parser«, however a few parser behaviours I want to have are: | |
| * jsmn buffers data as tokens, which is very efficient, but also a bit | |
| annoying as an API as it requires another layer of code to interpret the | |
| tokens. | |
| * jsmn does not handle decoding strings by default. Which is very efficient | |
| if you don't need parts of the data though. | |
| * jsmn does not keep context of nested structures by default, so may require | |
| writing custom utility functions for nested data. | |
| This is why I went for a parser design that uses a single callback per "node" | |
| type and keeps track of the current nested structure in a single array and | |
| emits that. | |
| ## Clone | |
| git clone git://git.codemadness.org/json2tsv | |
| ## Browse | |
| You can browse the source-code at: | |
| * https://git.codemadness.org/json2tsv/ | |
| * gopher://codemadness.org/1/git/json2tsv | |
| ## Download releases | |
| Releases are available at: | |
| * https://codemadness.org/releases/json2tsv/ | |
| * gopher://codemadness.org/1/releases/json2tsv | |
| ## Build and install | |
| $ make | |
| # make install | |
| ## Examples | |
| An usage example to parse posts of the JSON API of »reddit.com« and format th… | |
| to a plain-text list using awk: | |
| #!/bin/sh | |
| curl -s -H 'User-Agent:' 'https://old.reddit.com/.json?raw_json=1&limit… | |
| json2tsv | \ | |
| awk -F '\t' ' | |
| function show() { | |
| if (length(o["title"]) == 0) | |
| return; | |
| print n ". " o["title"] " by " o["author"] " in r/" o["subreddi… | |
| print o["url"]; | |
| print ""; | |
| } | |
| $1 == ".data.children[].data" { | |
| show(); | |
| n++; | |
| delete o; | |
| } | |
| $1 ~ /^\.data\.children\[\]\.data\.[a-zA-Z0-9_]*$/ { | |
| o[substr($1, 23)] = $3; | |
| } | |
| END { | |
| show(); | |
| }' | |
| ## References | |
| * Sites: | |
| * seriot.ch - Parsing JSON is a Minefield | |
| * A comprehensive test suite for RFC 8259 compliant JSON parsers | |
| * json.org | |
| * Current standard: | |
| * RFC8259 - The JavaScript Object Notation (JSON) Data Interchange Format | |
| * Standard ECMA-404 - The JSON Data Interchange Syntax (2nd edition (December… | |
| * Historic standards: | |
| * RFC7159 - The JavaScript Object Notation (JSON) Data Interchange Format (ob… | |
| * RFC7158 - The JavaScript Object Notation (JSON) Data Interchange Format (ob… | |
| * RFC4627 - The JavaScript Object Notation (JSON) Data Interchange Format (ob… |