I'm working on an app that enables users to collaborate (typically by highlighting/noting specific spans of text) on text articles.
I'll have an API that serves up the documents in some form (they're in .doc format right now, but I'd like to deliver them in something like Markdown). I can safely assume the articles' content will not change.
I'm currently stuck on the encoding format of these highlights. The problem is that these articles have some formatting on them (i.e., blockquotes where the author will cite from another external article, as well as the typical line breaks and paragraph spacing), and so the client would interpret that formatting differently than the server would.
For example, Markdown uses > characters to denote content in a blockquote, while HTML uses <blockquote> - so in this case, my Javascript code would - when a user highlights text that lives in a <blockquote> - need to do some messy calculations to get the correct character offsets.
Ultimately, I'd like to always be working with character offsets on the server as follows:
// e.g. // from the 55th character to the 58th character // offset = [55, 3] I've briefly considered a couple other ways:
- send the article to the client in HTML, although this would yield the same problem, as I'd need to add CSS classes and such to the HTML markup
- send the article content as an array of strings (split on each line break in the article) and give a type to each item (e.g., 'normal' or 'blockquote') - though this seems like a naive way to approach this problem
Is there some other cleaner way of encoding these highlights from the client that I'm missing?
EDIT: For more clarity - this would be a client-side app (requiring a modern browser).