Also, news transmissions from agencies to newspapers or TV stations used (maybe still use in some places) a format called IPTC 7901 which also makes use of the SOH, SOT, EOT and EOH codes:
This stems from them coming via a serial wire (which is why news updates are also called “wires” in that context) to a TTY.
(Nowadays, you’d have a server receiving everything over the Internet and spitting it out in this format via a serial port or Telnet connection if needed.)
According to Wikipedia, fancier news messages are possible using some more codes, but I’ve never seen them in the wild in recent years:
What software did you use to write that, and how did other dev tools behave with these characters? Just curious
1vuio0pswjnm7 · 3d ago
1. tcpclient, yy025 (a program I wrote that generates HTTP), sed, cut, paste, nvi or vim 4.6 if I need to edit, depending on whether I'm using NetBSD or Linux; formatted text is copied and pasted using tmux buffers since I do not use X11
For example
program|sed or grep|tmux loadb /dev/stdin
where program is a script that outputs a custom ASCII table
2. The UNIX software I'm using works with these characters
The idiom I use for inserting the FS character via command line or shell script is
x=$(echo x|tr x '\34')
For example,
echo select \* from t1|sqlite3 -separator $x 0.db > 0.fsv
"To import data with arbitrary delimiters and no quoting, first set ascii mode (".mode ascii"), then set the field and record delimiters using the ".separator" command. This will suppress dequoting. Upon ".import", the data will be split into fields and records according to the delimiters so specified."^1
But I use tr.
tr '\34' , < 0.fsv > 0.csv
If there is a risk that field values might contain the FS character, then I delete or replace that character in the data before I create fields. For example,
Is there a text format like TSV/CSV that can represent nested/repeating sub-structures?
We have YAML but it's too complex. JSON is rather verbose with all the repeated keys and quoting, XML even moreso. I'd also like to see a 'schema tree' corresponding to a header row in TSV/CSV. I'd even be fine with a binary format with standard decoding to see the plain-text contents. Something for XML like what MessagePack does for JSON would work, since we already have schema specifications.
culi · 3d ago
Well there's JSONL which is used heavily in scientific programs (especially in biology)
But CSV represented as JSON is usually accomplished like so:
Every textual data format that is not originally S-expressions eventually devolves into an informally-specified, bug-ridden, slow implementation of half of S-expressions.
ilyagr · 3d ago
It's a clever format, especially if the focus is on machines generating it and humans or machines reading it. It might even work for humans occasionally making minor edits without having to load the file in the spreadsheet.
I think it can encode anything except for something matching the regex `(\t+\|)+` at the end of cells (*Update:* Maybe `\n?(\t+\|)+`, but that doesn't change my point much) including newlines and even newlines followed by `\` (with the newline extension, of course).
For a cell containing `cell<newline>\`, you'd have:
|cell<tab>|
\\<tab >|
(where `<tab >` represents a single tab character regardless of the number of spaces)
Moreover, if you really needed it, you could add another extension to specify tabs or pipes at the end of cells. For a POC, two cells with contents `a<tab>|` and `b<tab>|` could be represented as:
|a<tab ><tab>|b
~tab pipe<tab>|tab pipe
(with literal words "tab" and "pipe"). Something nicer might also be possible.
*Update:* Though, if the focus is on humans reading it, it might also make sense to allow a single row of the table to wrap and span multiple lines in the file, perhaps as another extension.
ctenb · 3d ago
For multiline cell contents, there is rule 7, the multi line extension. Newlines are not allowed in cells otherwise, because of rule 2, it's a line based format.
I personally use it to write tabular data manually, used to define our datamodel. Because this format is editor agnostic, colleagues can easily read and edit as well. So in my case it's focus on human read/write and machine read.
TheTaytay · 3d ago
I have been using TSV a LOT lately for batch inputs and outputs for LLMs. Imagine categorizing 100 items. Give it a 100 row tsv with an empty category column, and have it emit a 100 row tsv with the category column filled in.
It has some nice properties:
1) it’s many fewer tokens than JSON.
2) it’s easier to edit prompts and examples in something like Google sheets, where the default format of a copied group of cells is in TSV.
3) have I mentioned how many fewer tokens it is? It’s faster, cheaper, and less brittle than a format that requires the redefinition of every column name for every row.
Obviously this breaks down for nested object hierarchies or other data that is not easily represented as a 2d table, but otherwise we’ve been quite happy. I think this format solves some other things I’ve wanted, including header comments, inline comments, better alignment, and markdown support.
Rhapso · 3d ago
The poor delimiter special characters in the ascii table never get any love.
ctenb · 3d ago
Yeah :) Though I can think of two reasons why: it's not typable for most people on a keyboard, and most programs are not designed to deal with it, or render it properly in an aligned way, like tab characters.
Hackbraten · 4d ago
Good on you to leverage EditorConfig settings. Almost every modern IDE or editor supports it either out of the box or with a plug-in.
DrillShopper · 3d ago
Or we could use the actual characters for this purpose - the FS (file separator), GS (group separator), RS (record separator), and US (unit separator).
ASCII (and through it, Unicode) has these values specifically for this purpose.
ilyagr · 3d ago
I don't think popularizing these ASCII characters would solve the problem in its entirety.
If RS and US were in common use, there would be a need to have a visible representation for them in the terminal, and a way to enter RS on the keyboard. Pretty soon, strings that contain RS would become much more common in the wild.
Then, one day somebody would need to store one of those strings in a table, and there would be no way to do so without escaping.
I do think that having RS display in the terminal (like a newline followed by some graphic?) and using it would be an improvement over TSV's use of newline for this purpose, but considering that it's not a perfect solution, I can understand why people are not overly motivated to make this happen. The time for this may have been 40+ years ago when a standard for how to display or type it would be feasible to agree upon.
eviks · 3d ago
> there would be a need to have a visible representation for them in the terminal, and a way to enter RS on the keyboard.
Both already possible, they have official symbols representing them
> Then, one day somebody would need to store one of those strings in a table, and there would be no way to do so without escaping.
Why? But also, yes, escaping also exists, just like in the alternative formats
ilyagr · 3d ago
> Both already possible, they have official symbols representing them.
I'm not sure what you mean. For an illustration, my terminal does not print anything for them.
$ printf "qq\36\37text\n"
qqtext
*Update/Aside:* "My terminal", in this case, was `tmux`. Ghostty, OTOH, prints spaces instead of RS or US.
Unicode does have some symbols for every non-printable ASCII character, which you can see as follows with https://github.com/sharkdp/bat (assuming your font has the right characters, which it probably does):
$ printf "qq\36\37text\n" | bat -A --decorations never
qq␞␟text␊
Yes, I did mean the ␞ U+241E unicode symbols that represent separators. And as your `| bat` example shows, they can also be displayed in the terminal.
If you meant the default should always be symbolic, not sure, like newline separator isn't displayed in the terminal as a symbol, but maybe that's just a matter of extra terminal config
EvanAnderson · 3d ago
I did an ETL project for an ERP system that used these separators years ago. It was ridiculously easy because I didn't have to worry about escaping. Parsing was an easy state machine.
Notepad++ handles the display and entry of these characters fairly easily. I think they're nowhere as unergonomic as people say they are.
addoo · 3d ago
I’m pretty sure part of the intent is that it should be easy to write (type) in this format. Separator characters are not that. Depending on the editor, they’re not especially readable either.
helix278 · 4d ago
I like that there is plenty of room for comments, and the multiline extension is also cool. The backslash almost looks like what I would write on paper if I wanted to sneak something into the previous line :)
ctenb · 3d ago
Thanks :)
montroser · 3d ago
This is pretty under-specified...
> A cell starts with | and ends with one or more tabs.
|one\t|two|three
How many cells is this? Seems like just one, with garbage at the end, since there are no closing tabs after the first cell? Should this line count as a valid row?
> A line that starts with a cell is a row. Any other lines are ignored.
Well, I guess it counts. Either way, how should one encode a value containing a tab followed by a pipe?
jasonthorsness · 3d ago
The spec says the last cell does not need to end in a tab, so this would be two cells IMO
ctenb · 3d ago
That's correct
montroser · 3d ago
Is the "spec", the rules listed under the "syntax" heading? Or is it the python script? Or something else I missed? Because the python is in conflict with the syntax rules in that the rules say a cell ends with one or more tabs, but the python says a cell can end with the end of the line, even if there are no trailing tabs.
ctenb · 3d ago
It's rule number 6, which basically makes an exception to rule 1 :)
bvrmn · 3d ago
I think spec tries and fails to translate code implementation into human language. In code cell separator is `\t|`.
ctenb · 3d ago
That's also correct. In what way does it fail?
imtringued · 3d ago
The problem with CSV is that there are too many variations and that quoting is still a mess. The vast majority of people on this planet do not know CSV, so they invent a new adhoc format on the fly and falsely call it CSV.
TPSV solves none of that and makes things worse.
ctenb · 3d ago
What does it make worse exactly?
Hashex129542 · 3d ago
We need binary formats. In this era we are capable for it. Throw away the text formats.
baby_souffle · 3d ago
> We need binary formats. In this era we are capable for it.
We have them, they're used where appropriate.
> Throw away the text formats.
I would argue that _most_ of the time tsv or csv are used it's because either:
a) the lowest common denominator for interchange. Oh, you don't have _my specific version of $program? How about I give you the data in csv? _everything_ can read that...
b) a human is expected to inspect/view/adjust the data and they'd be using a bin -> text tool anyways. The move to binary based log formats (`journald`) is still somewhat controversial. It would have been a non-starter if the tooling to make the binary "readable" wasn't on-par with (or, in a few cases, better!) than the contemporary text based tooling we've been used to for the prior 30+ years..
account-5 · 3d ago
Text is universally accessible and widely supported. Binary has it's benefits, but human facing, it has to be text.
zzo38computer · 3d ago
Which formats are helpful can depend on the use. I think DER (which is a binary format) is not so bad (although I added a few additional types (such as key/value list, BCD string, and TRON string), but not all uses are required to use them). I had also made up Multi-DER, which is simply any number of DER concatenated together (there are formats of JSON like that too). (I had also made up TER which is a text format and a program to convert TER to DER. I also wrote a program to convert JSON to DER. It would also be possible to convert CSV, etc.)
It was also my idea of an operating system design, it will have a binary format used for most stuff, similar to DER but different in some ways (including different types are available), which is intended to be interoperable among most of the programs on the system.
Hashex129542 · 3d ago
Same Vibe. Yes I am using the same DER files in my apps. So we can have more distinguished universal value types than just a text.
My very next step is OS development too but I'm not sure where to learn the OS in the opcode coding level. I thought to get started with Intel docs for my CPU.
chthonicdaemon · 3d ago
The idea that binary formats are the way because "you're going to use a program to interact with the format anyway" ignores the network effects of having things like text editors and unix commands that handle text as a universal intermediate, while having bespoke programs for every format dooms you to developing a full set of tooling for every format (or more likely, writing code that converts the binary format to text formats).
More recently though, consider that LLMs are terrible at emitting binary files, but amazing at emitting text. I can have a GPT spit out a nice diagram in Mermaid, or create calendar entries from a photo of an event program in ical format.
voidfunc · 3d ago
Yep, and stuff it into a sqlite db too you have a query interface all built.
culi · 3d ago
What's a good program that non-technical people can use to write sqlite db data. I think it's a great idea in theory but lacking in support
ndsipa_pomu · 3d ago
Binary formats are generally poorly defined, buggy as hell and have limited tooling that can deal with them. Also, they're not human readable so when you inevitably hit a problem with a binary format, you can't just eyeball it to see that you've got an "O'Connor" in the names.
bsder · 3d ago
Data will always outlive the program that originally produced it.
This is why you should almost always use text formats.
smallerize · 3d ago
Ok but how do you type them? How do you search them? How do you copy-and-paste between documents?
Hashex129542 · 3d ago
We are just changing the encoders and decoder program only from text to binary. The front end software remains same for example we are using for CSV.
The binary form has lot of benefits than plain text form in editing. For example, when you are replacing UIn8 value from 0 to 100 then you just replacing a byte at a position instead of rewriting whole document.
rr808 · 3d ago
parquet?
stevage · 3d ago
I hate this kind of format. It's trying to be both a data format for computers and a display format for humans. Much better off just using a tool that can edit CSV files as tables.
Also it doesn't seem to say anything about the header row?
nmz · 3d ago
CSV is also a display format for humans and also for computers. Its also a terrible one because its too variable FS is variable, escapes could exist, "" could be used, this slows down parsing.
stevage · 3d ago
I wouldn't say CSV is a display format. Attempting to edit it by hand is pretty error prone, and reading it is hard work.
CJefferson · 3d ago
Honestly at this point my favorite format is JSONLines (one JSON object per line).
It instinctively feels horrible, but it’s easy to create and parse in basically every language, easy to fully specify, recovers well from one broken line in large datasets, chops up and concatenates easily.
hiAndrewQuinn · 3d ago
I second this. I'm using JSONL to bake in the data for my single binary Finnish to English pocket dictionary ( https://github.com/hiAndrewQuinn/tsk ). It just makes things like data transformations so easy, especially with jq.
bvrmn · 3d ago
According to spec it's nearly impossible to correctly edit files in this format by hand.
mkl · 3d ago
How so? All you need is a text editor that preserves tabs.
bvrmn · 3d ago
1. It's quite easy to miss a tab and use only `|`.
2. Generated TPSV would look like an unreadable hard to edit mess. I doubt any tool would calculate max column length to adjust tab count for all cells. It basically kills any streaming.
mkl · 3d ago
You have a very strange definition of "nearly impossible".
> 1. It's quite easy to miss a tab and use only `|`.
Any format is hard to edit manually if you don't follow the requirements of the format (which are very simple in this case).
> 2. Generated TPSV would look like an unreadable hard to edit mess.
CSVs are much less readable than this, but still entirely possible to edit.
bvrmn · 3d ago
My wording is bad, I agree. Original thought was: complicated edit makes TPSV adoption as nearly impossible.
AstroJetson · 3d ago
> A row with too many cells has the superfluous cells ignored.
Ummm, how do you figure out what row has too many cells? Can all the rows before this one have too few cells?
benjaminl · 3d ago
The spec says that the first row specifies the number of columns.
https://www.iptc.org/std/IPTC7901/1.0/specification/7901V5.p...
This stems from them coming via a serial wire (which is why news updates are also called “wires” in that context) to a TTY.
(Nowadays, you’d have a server receiving everything over the Internet and spitting it out in this format via a serial port or Telnet connection if needed.)
According to Wikipedia, fancier news messages are possible using some more codes, but I’ve never seen them in the wild in recent years:
https://en.wikipedia.org/wiki/IPTC_7901#C0_control_codes
For example
where program is a script that outputs a custom ASCII table2. The UNIX software I'm using works with these characters
The idiom I use for inserting the FS character via command line or shell script is
For example, "To import data with arbitrary delimiters and no quoting, first set ascii mode (".mode ascii"), then set the field and record delimiters using the ".separator" command. This will suppress dequoting. Upon ".import", the data will be split into fields and records according to the delimiters so specified."^1But I use tr.
If there is a risk that field values might contain the FS character, then I delete or replace that character in the data before I create fields. For example, 1. https://sqlite.org/cli.htmlhttps://en.wikipedia.org/wiki/ASCII#Character_groups
https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C0_con...
We have YAML but it's too complex. JSON is rather verbose with all the repeated keys and quoting, XML even moreso. I'd also like to see a 'schema tree' corresponding to a header row in TSV/CSV. I'd even be fine with a binary format with standard decoding to see the plain-text contents. Something for XML like what MessagePack does for JSON would work, since we already have schema specifications.
But CSV represented as JSON is usually accomplished like so:
Every textual data format that is not originally S-expressions eventually devolves into an informally-specified, bug-ridden, slow implementation of half of S-expressions.
I think it can encode anything except for something matching the regex `(\t+\|)+` at the end of cells (*Update:* Maybe `\n?(\t+\|)+`, but that doesn't change my point much) including newlines and even newlines followed by `\` (with the newline extension, of course).
For a cell containing `cell<newline>\`, you'd have:
(where `<tab >` represents a single tab character regardless of the number of spaces)Moreover, if you really needed it, you could add another extension to specify tabs or pipes at the end of cells. For a POC, two cells with contents `a<tab>|` and `b<tab>|` could be represented as:
(with literal words "tab" and "pipe"). Something nicer might also be possible.*Update:* Though, if the focus is on humans reading it, it might also make sense to allow a single row of the table to wrap and span multiple lines in the file, perhaps as another extension.
I personally use it to write tabular data manually, used to define our datamodel. Because this format is editor agnostic, colleagues can easily read and edit as well. So in my case it's focus on human read/write and machine read.
It has some nice properties: 1) it’s many fewer tokens than JSON. 2) it’s easier to edit prompts and examples in something like Google sheets, where the default format of a copied group of cells is in TSV. 3) have I mentioned how many fewer tokens it is? It’s faster, cheaper, and less brittle than a format that requires the redefinition of every column name for every row.
Obviously this breaks down for nested object hierarchies or other data that is not easily represented as a 2d table, but otherwise we’ve been quite happy. I think this format solves some other things I’ve wanted, including header comments, inline comments, better alignment, and markdown support.
ASCII (and through it, Unicode) has these values specifically for this purpose.
If RS and US were in common use, there would be a need to have a visible representation for them in the terminal, and a way to enter RS on the keyboard. Pretty soon, strings that contain RS would become much more common in the wild.
Then, one day somebody would need to store one of those strings in a table, and there would be no way to do so without escaping.
I do think that having RS display in the terminal (like a newline followed by some graphic?) and using it would be an improvement over TSV's use of newline for this purpose, but considering that it's not a perfect solution, I can understand why people are not overly motivated to make this happen. The time for this may have been 40+ years ago when a standard for how to display or type it would be feasible to agree upon.
Both already possible, they have official symbols representing them
> Then, one day somebody would need to store one of those strings in a table, and there would be no way to do so without escaping.
Why? But also, yes, escaping also exists, just like in the alternative formats
I'm not sure what you mean. For an illustration, my terminal does not print anything for them.
*Update/Aside:* "My terminal", in this case, was `tmux`. Ghostty, OTOH, prints spaces instead of RS or US.Unicode does have some symbols for every non-printable ASCII character, which you can see as follows with https://github.com/sharkdp/bat (assuming your font has the right characters, which it probably does):
Here, `␞` is https://www.compart.com/en/unicode/U+241E, one of the symbols for non-printable characters that Unicode has; different fonts display it differently. See also https://www.compart.com/en/unicode/block/U+2400.Is there some better representation it has?
If you meant the default should always be symbolic, not sure, like newline separator isn't displayed in the terminal as a symbol, but maybe that's just a matter of extra terminal config
Notepad++ handles the display and entry of these characters fairly easily. I think they're nowhere as unergonomic as people say they are.
> A cell starts with | and ends with one or more tabs.
How many cells is this? Seems like just one, with garbage at the end, since there are no closing tabs after the first cell? Should this line count as a valid row?> A line that starts with a cell is a row. Any other lines are ignored.
Well, I guess it counts. Either way, how should one encode a value containing a tab followed by a pipe?
TPSV solves none of that and makes things worse.
We have them, they're used where appropriate.
> Throw away the text formats.
I would argue that _most_ of the time tsv or csv are used it's because either:
a) the lowest common denominator for interchange. Oh, you don't have _my specific version of $program? How about I give you the data in csv? _everything_ can read that...
b) a human is expected to inspect/view/adjust the data and they'd be using a bin -> text tool anyways. The move to binary based log formats (`journald`) is still somewhat controversial. It would have been a non-starter if the tooling to make the binary "readable" wasn't on-par with (or, in a few cases, better!) than the contemporary text based tooling we've been used to for the prior 30+ years..
It was also my idea of an operating system design, it will have a binary format used for most stuff, similar to DER but different in some ways (including different types are available), which is intended to be interoperable among most of the programs on the system.
My very next step is OS development too but I'm not sure where to learn the OS in the opcode coding level. I thought to get started with Intel docs for my CPU.
More recently though, consider that LLMs are terrible at emitting binary files, but amazing at emitting text. I can have a GPT spit out a nice diagram in Mermaid, or create calendar entries from a photo of an event program in ical format.
This is why you should almost always use text formats.
The binary form has lot of benefits than plain text form in editing. For example, when you are replacing UIn8 value from 0 to 100 then you just replacing a byte at a position instead of rewriting whole document.
Also it doesn't seem to say anything about the header row?
It instinctively feels horrible, but it’s easy to create and parse in basically every language, easy to fully specify, recovers well from one broken line in large datasets, chops up and concatenates easily.
2. Generated TPSV would look like an unreadable hard to edit mess. I doubt any tool would calculate max column length to adjust tab count for all cells. It basically kills any streaming.
> 1. It's quite easy to miss a tab and use only `|`.
Any format is hard to edit manually if you don't follow the requirements of the format (which are very simple in this case).
> 2. Generated TPSV would look like an unreadable hard to edit mess.
CSVs are much less readable than this, but still entirely possible to edit.
Ummm, how do you figure out what row has too many cells? Can all the rows before this one have too few cells?
> 3. The first row defines the number of columns