Data shape for formatted publication data

SixStringsCoder · February 10, 2022, 1:38am

Can you tell me how, for example, a multi-paragraph newspaper article, magazine article or educational course would store its formatted text as data?

In the case of a single paragraph with no formatting (i.e. no italics or bold) then storing it as a string makes sense. But if that same paragraph had specific words to italicize or make bold, short of storing it as a string with HTML tags, how would one store it so that when consuming it, it would be easy to get the formatted parts to show correcetly?

Further to that example, imagine an article with 5 separate paragraphs with each having some italicized or bolded words, maybe even some word having footnote superscripts. Again, short of storing that as a long string of HTML (where innerHTML would interpolate it), how would one store that kind of data so that it could be used in the browser or via an API in a Native app or wherever else?

Thanks for your help.

I do have some ideas how I could do this, but I’m curious how others have done this.

I also realize for the browser, I could just hardcode HTML with CSS to format it, but if I wanted it to be available as data that can be consumed outside of a browser, then I assume I have to think in terms of strings to store - but how to get the formatting to show without lots of acrobats splitting strings and loops looking for specific words to format - it just doesn’t sound efficient esp. for long articles.

DanCouper · February 10, 2022, 6:45pm

Often as something like Markdown, ie in a plain text form but where the plain text has certain indicators to say things are bold or italic or whatever.

If there’s an editor for users, then it may have formatting buttons, but they will translate to the plain text form. That will then be saved. That in turn can then be parsed to HTML when needed.

HTML doesn’t work very well for this, you always want HTML to be the target rather than how it’s stored.

Edit: as an obvious example, this forum. So I can write some stuff using ~~different meaningful tags~~ and you can see the result here, but

(what I actually wrote and how it will be stored:)

Edit: as an obvious example, this forum. So *I can* write **some stuff** using ~~different meaningful tags~~ and `you can` see the result here, but

bradtaniguchi · February 10, 2022, 6:48pm

It sounds like your asking how would you save “rich-text”, which is where a user can write down text that is “fancy” or formatted into a database.

Your right in that you could just take whatever the user wrote and save it into the database as a giant string in a specific format. That format could be the actual HTML, or another format, for example markdown, which is what this forum uses, or something like your own data-type that boils down into complex objects that could be represented as JSON objects.

Regardless your still just saving large amounts of text as “some format” that gets parsed at some level and rendered to end-users. Taking this forum as an example, you write your comment in markdown, but it gets rendered as HTML for all to see.

However something like Google Docs has you write down “rich-text” in some “Google format” that gets saved as JSON to represent your document data. This data could be seen if you try to interact with Google Docs via a programming API as well.

The nice part about using something like markdown, is that this job is done for you. There are plenty of ways to parse markdown, and rendered to individual platforms outside of just HTML.

However the issue with this is writing markdown is limited to that specific spec. There are other specs for different use-cases, like Latex, which is used in academia. Or if your user’s don’t want to learn that ** creates bold-text, then you might need to “hide away” the data into an object, which is what Google Docs does, or what some rich-text editors do for you.

Generally, unless you want to create your own format of data you’d leverage one of these tools.

Finally its worth pointing out that there is a key security issue you must pay attention to when doing something like this, and thats XSS, or Cross Site Scripting. This is where a malicious user can “run-code” on other peoples browser by injecting their own code via these “input-boxes” that are then rendered.

If for example we take this forum, I write the following code in my message:

<script>
  alert('HACKED!');
</script>

and post my message, and the forum has 0 XSS precautions, any user who comes and “sees” my comment would then execute that as HTML/JS and “get hacked”. This code is benign, but could represent actually malicious code.

The correct way to prevent this is sanitization. Sanitization if the above code isn’t saved or “ran” as HTML and instead is just text of HTML and safe. This should be done before saving it into your database, and possibly even later when the page is being rendered (in-case it somehow didn’t get sanitized when being saved).

Most client-side libraries that can handle rendering “rich-text”, either from markdown or other formats, will provide utilities and documentation to prevent this from happening. This is also why letting users write HTML directly and just rendering it without any sanitization isn’t a good idea.

SixStringsCoder · February 11, 2022, 7:56am

Thank you , Dan and Brad, for your replies.

My data will not interface with any user input (i.e. no forms). It’s just data that will be used to populate a webpage or Native app via API calls or just stored in the app as a JSON file. But it still sounds like no HTML markings should be used for storage (stored as a strings to be interpolated) and should only be the target format.

The Markdown solution is the one I heard about, too. I suppose I would use something like Marked that parses the markdown to HTML? Here’s one for Swift. so that could help me on the Native side.

I’ll also look into Latex. I’d heard of this before but know nothing about it. The data is actually for an online course about the Pali language so it is educational content. I’ll see if this offers me more flexibility than markdown since I do have lots of footnotes to account for. Though I see that markdown can do footnotes. What I love about HTML is that I can create data attributes for these footnotes to be accessed by JavaScript for pop-ups. So I’m trying to think about how to handle that with rich-text data.

I appreciate the replies. I have some hope now and will just do more investigation.

SixStringsCoder · February 14, 2022, 6:06am

As I’m looking into a Markdown solution I’m coming up short because I can’t find a way to include ID and class attributes in the Markdown.

Example <div class="pali" id="ref-note-1">Some word</div>

Has this been your experience, too, that Markdown doesn’t normally allow for attributes like ID and class but using Markdown syntax (i.e. not HTML syntax which can usually be read by the Markdown program)?

I’m working in SvelteKit, so I’d like to run the data through a Markdown module then render it in the component with the correct IDs and classes.

Would XML be a better solution since it can use attributes or is XML considered outdated?