How to implement tracking of changes in text documents à la MS-Word/Apple Pages

Question

I want to implement tracking of changes in plain-text documents, in a way similar to how it works in MS Word or Apple Pages. What I am unsure of is the data model and how to store it.

Goal

The expected properties:

disk space frugality
it should be possible to revert all the changes and get the original document
multiple users can suggest changes
the suggested changes can stack (it should be possible to do an edit of an edit)

Source-code version control systems like Git orient themselves by newlines. In plain-text documents, there can be a page-long text without a single newline character, which would make diffs unnecessarily big. So Git's out. Due to (1.) I want to avoid storing copies of the document and calculate the deltas. Desktop publishing software like MS-Word or Libre Office use an abstract tree of document nodes instead of plain-text, so I can’t use their approach. I really like Critic Markup (found it mentioned in this discussion).

Suggested solution

My plan is to use a modified version of Critic Markup where:

I. nesting would be allowed

II. each suggested change would also include an id pointing to a DB entry with change metadata: author_id, created, modified, accepted, accepted_by, rejected, rejected_by etc.

Due to (I.) a parser will be required, but that’s ok. An example of such a document:

Don't go around saying{-- to people that--|d76b979c-d7a8-11e8-9f8b-f2801f1b9fd1} the world owes you a living. The world owes you nothing. It was here first. {~~One~>Only one~~|d21ef228-9cdb-447c-97b9-cccdee58e36c} thing is impossible for God: To find sense in any copyright law on the planet. {++ Truth is stranger {++(at least that’s what they say)++|cc6e3998-d649-4b85-8ba1-25c1aaeb1d91}than fiction ++|f41939f6-f203-43a4-9ac4-97690ebf1c8f}, but it is because Fiction is obliged to stick to possibilities; Truth isn’t.

Notice the nesting in the addition {++ … ++} block.

Nesting would be allowed only in accepted changes. Those would render as normal text (maybe with some onMouseHover additional info), whereas rejected edits, although still part of the document source, would not be rendered at all. When loading such a document, the parser would first parse it and create the tree of text and changes. All the changes metadata would be fetched from the DB. With all this information, the document would be subsequently rendered.

Problems

Changes that span two "change nodes", or belong partly to a "change node" and partly to normal text. Example, in which such a change should concern the following sub-text "plums. And she":

Then she ate the apple{++, pear and some plums++|cc6e3998-d649-4b85-8ba1-25c1aaeb1d91}. And she liked it.

Possible solutions:

Split the change into two "change nodes"
Give up (2.) and (4.) and after the user accepts or rejects all the suggested changes and marks the document as resolved, simply turn the document back into a simple plain-text document.

What are your thoughts on this? Are there any better ways of achieving the goal from the title? Any complications that I don’t see? Thanks!

amon · Accepted Answer · 2018-10-25 11:06:12Z

Your use of CriticMarkup is flawed. It is suitable for showing the differences between two version or for showing suggested changes. It is not a suitable format for representing the complete history of a document, for the reasons you listed.

You have rejected building a solution on top of Git because Git uses line-based diffs. This is incorrect:

There is a difference between Git's internal data model and the user interface presented through its command line tools.
Git's internal data model stores snapshots of the project in a compressed format, and occasionally applies delta-compression to its internal database. The resulting packfiles use a very compact binary format. Because plain text generally compresses well, the complete git history of a project can even be smaller than the checked-out project itself.
While git diff shows line-based diffs by default, you can switch to to the --word-diff mode. Especially --word-diff=porcelain gives you a machine-readable format.

So instead of trying to invent your own version control system in the hope of beating Git, consider whether you can build an appropriate user interface on top of Git to track document changes. You may still want to incorporate CriticMarkup to add notes or suggested changes to the text.

Thank you, my knowledge of git guts wasn't that deep.

Adam Libuša
– Adam Libuša

2018-11-07 14:02:48 +00:00
Commented Nov 7, 2018 at 14:02 — Adam Libuša
– Adam Libuša, Commented Nov 7, 2018 at 14:02

Stack Exchange Network

How to implement tracking of changes in text documents à la MS-Word/Apple Pages

Goal

Suggested solution

Problems

1 Answer 1

Hot Network Questions

How to implement tracking of changes in text documents à la MS-Word/Apple Pages

Goal

Suggested solution

Problems

1 Answer 1

Related

Hot Network Questions