Structuring HTML Content

26 January 2023

Updated: 03 September 2023

Isn’t HTML a structure?

HTML content is structured as a tree - while this is useful for the medium, this structure isn’t very convenient for transforming into data that can be used outside of the web or with libraries that use content in a more flat structure

While building Articly I wanted to use a library called EditorJS for displaying and making text content interactive. An immediate problem I ran into was importing content into the editor since it requires the data in a specific format - which is not HTML but rather an array of objects with simplified content

StackOverflow is helpful sometimes

In order to get the content into the EditorJS format, I needed to find a way to transform the HTML that I had from scraping the web and reading RSS feeds into something I could use as some kind of base

After a little bit of searching, I found this handy function on StackOverflow:

1
//Recursively loop through DOM elements and assign properties to object
2
function treeHTML(element, object) {
3
object['type'] = element.nodeName
4
var nodeList = element.childNodes
5
if (nodeList != null) {
6
if (nodeList.length) {
7
object['content'] = []
8
for (var i = 0; i < nodeList.length; i++) {
9
if (nodeList[i].nodeType == 3) {
10
object['content'].push(nodeList[i].nodeValue)
11
} else {
12
object['content'].push({})
13
treeHTML(nodeList[i], object['content'][object['content'].length - 1])
14
}
15
}
16
}
17
}
18
if (element.attributes != null) {
19
if (element.attributes.length) {
20
object['attributes'] = {}
21
for (var i = 0; i < element.attributes.length; i++) {
22
object['attributes'][element.attributes[i].nodeName] =
23
element.attributes[i].nodeValue
24
}
25
}
26
}
27
}

Using the above as a guideline, I concluded that the main thing that I need was to iterate over the HTML content in a way that would allow me to build a content array which is recursive - the idea at this point is not to remove the tree structure, but rather to transform it into something that’s a bit easier to work with

A different tree

Now that I had an idea of how to approach the problem, the next step was to define the structure I wanted, below is the final structure I decided on:

  1. The source HTML element object needs to be stored as this may useful to downstream processors
  2. The tagName of the element will be needed to determine how a specific element needs to be handled
  3. The textContent and innerHTML of the element to be stored - this is the core data of the element
  4. The attributes of the HTML element should be unwrapped to make it easier for downstream code to read
  5. The children of the element need to be converted into the the same type by recursively following steps 1-5

Based on the above, the type definition can be seen below along with the implementation:

1
type TagName = Uppercase<keyof HTMLElementTagNameMap>
2
3
type HTMLStringContent = string
4
5
type TransformResult = {
6
element: Element
7
tagName: TagName
8
textContent?: string
9
htmlContent?: HTMLStringContent
10
attrs: Record<string, string>
11
children: TransformResult[]
12
}
13
14
const transform = (el: Element): TransformResult => ({
15
element: el,
16
// tag names are strings internally but that's not very informative downstream
17
tagName: el.tagName as TagName,
18
textContent: el.textContent || undefined,
19
htmlContent: el.innerHTML,
20
children: Array.from(el.children).map(transform),
21
attrs: Array.from(el.attributes).reduce(
22
(acc, { name, value }) => ({ ...acc, [name]: value }),
23
{}
24
),
25
})

Now that I have the data as a tree structure, we can pass in some simple HTML to see what pops out the other end.

Given the following HTML content:

1
<div>
2
<p>Hello World</p>
3
4
<section>
5
<img src="hello.jpg" alt="this is an image" />
6
</section>
7
</div>

We can transform it like so:

1
// use the DOM to parse it from a string
2
const el = new DOMParser().parseFromString(html, 'text/html').body
3
4
// the convertHtmlToBlocks function takes an HTML ELement
5
const transformed = transform(el)

The transformed data looks something like this:

1
{
2
"element": {},
3
"tagName": "BODY",
4
"textContent": "\n Hello World\n\n \n \n \n",
5
"htmlContent": "<div>\n <p>Hello World</p>\n\n <section>\n <img src=\"hello.jpg\" alt=\"this is an image\">\n </section>\n</div>",
6
"children": [
7
{
8
"element": {},
9
"tagName": "DIV",
10
"textContent": "\n Hello World\n\n \n \n \n",
11
"htmlContent": "\n <p>Hello World</p>\n\n <section>\n <img src=\"hello.jpg\" alt=\"this is an image\">\n </section>\n",
12
"children": [
13
{
14
"element": {},
15
"tagName": "P",
16
"textContent": "Hello World",
17
"htmlContent": "Hello World",
18
"children": [],
19
"attrs": {}
20
},
21
{
22
"element": {},
23
"tagName": "SECTION",
24
"textContent": "\n \n ",
25
"htmlContent": "\n <img src=\"hello.jpg\" alt=\"this is an image\">\n ",
26
"children": [
27
{
28
"element": {},
29
"tagName": "IMG",
30
"htmlContent": "",
31
"children": [],
32
"attrs": { "src": "hello.jpg", "alt": "this is an image" }
33
}
34
],
35
"attrs": {}
36
}
37
],
38
"attrs": {}
39
}
40
],
41
"attrs": {}
42
}

Not too dissimilar to the structure raw HTML we would have had if we just used DOMParser directly, however it now has a lot less noise

Deforestation

As I mentioned earlier, we need to transform the data into a flat array of items - so the question that comes up now is - how do we do that?

Looking at the input HTML we used, we can break things up into two types of elements - containers and content

Containers are pretty much useless to the content structure we’re trying to build - the content is all we really care about

This is an important distinction because it tells us what we can throw away

Secondly, we can think of the content inside of a container as an array of content, once we remove all the containers, this content will become flat

So for example, this HTML:

1
<div>
2
<p>Hello World</p>
3
4
<section>
5
<img src="hello.jpg" alt="this is an image" />
6
</section>
7
</div>

When we remove the containers can be thought of as:

1
<p>Hello World</p>
2
3
<img src="hello.jpg" alt="this is an image" />

Which can be thought of as an array like so:

1
;[paragraph, image]

This is our end goal. In order to get here we still have to figure out two things:

  1. How can we separate out the containers from the content
  2. How can we transform the content into the structure that’s useful to us

Separating the leaves from the wood

If we think as containers as having no meaningful data, and just being containers for content, then we can conclude that a way to view the data structure is as an array of content - the container is the array, and the content is the items in the array

So, we can write a transformer for a container as something that just returns an array of content

1
const removeContainer = (data: TransformResult) => data.children

The above is pretty useful, since this means that the following HTML:

1
<div>
2
<p>Hello World</p>
3
4
<section>
5
<img src="hello.jpg" alt="this is an image" />
6
</section>
7
</div>

Will be essentially be converted to this:

1
<p>Hello World</p>
2
3
<section>
4
<img src="hello.jpg" alt="this is an image" />
5
</section>

Now, we still see that there’s a section leftover since we only returned one layer of children. Since we already have a way to get rid of the wrapper, we can just apply that to the child that’s a section,

So we could have something like this:

1
const removeContainer = (data: TransformResult) =>
2
data.children.map((child) =>
3
isContainer(child) ? removeContainer(child) : child
4
)

Now the function is a bit weird because the inner map is either returning an array if the child is a container, or a single child if it’s not - for consistency, let’s just always return an array:

1
const removeContainer = (data: TransformResult) =>
2
data.children.map((child) =>
3
isContainer(child) ? removeContainer(child) : [child]
4
)

Much better, but now we’ve introduced something weird - instead of just returning a TransformResult[] we’re now returning a TransformResult[][] - let’s just leave this here for now, we can always unwrap the arrays later - importantly we now know that we’ve eliminated the wrappers, so the content we’re left with now represents:

1
<p>Hello World</p>
2
3
<img src="hello.jpg" alt="this is an image" />

So this is pretty great, and is the general idea of how we can unwrap things - next up we can talk about transforming the specific elements into useful data blocks

Building blocks

EditorJS has different sections - blocks as it calls them - of content. These are simple Javascript objects that represent the data for the block

For the sake of our discussion, we’re going to consider two blocks - ParagraphBlock and ImageBlock

The simplified types that represent their data can be seen below:

1
export type ParagraphBlock = {
2
type: 'paragraph'
3
data: {
4
text: string
5
}
6
}
7
8
export type SimpleImageBlock = {
9
type: 'image'
10
data: {
11
url: string
12
caption: string
13
}
14
}

We can look at the the TransformResult for each of the above elements from when we passed the HTML into our transform function previously, we can see

For the paragraph:

1
{
2
"element": {},
3
"tagName": "P",
4
"textContent": "Hello World",
5
"htmlContent": "Hello World",
6
"children": [],
7
"attrs": {}
8
}

Which can be translated to the ParagraphBlock data as:

1
{
2
type: "paragraph",
3
data: {
4
text: "Hello World"
5
}
6
}

A function for doing this could look something like:

1
const convertParagraph = (data: TransformResult): ParagraphBlock | undefined =>
2
data.textContent
3
? {
4
type: 'paragraph',
5
data: {
6
text: data.textContent,
7
},
8
}
9
: undefined

Cool, this lets us transform a paragraph into some structured data.

We can do something similar for images:

1
const convertImage = (data: TransformResult): ImageBlock | undefined =>
2
data.attrs.src
3
? {
4
type: 'image',
5
data: {
6
url: data.attrs.src,
7
caption: data.attrs.alt || '',
8
},
9
}
10
: undefined

Putting it all together

So now that we know how to transform the HTML into something useful, remove the wrappers, and represent individual HTML sections as content blocks, we can put it all together into something that lets us convert a section of HTML fully:

First, we can update the removeContainer function to call the converter on the type of tag that it finds:

1
// note that we need a handler for BODY since the `DOMParser` will always add a body element when parsing
2
const isContainer = (data: TransformResult) =>
3
data.tagName === 'BODY' ||
4
data.tagName === 'DIV' ||
5
data.tagName === 'SECTION'
6
7
const removeContainer = (data: TransformResult) =>
8
data.children.map((child) => {
9
if (isContainer(child)) {
10
return removeContainer(child)
11
} else {
12
if (child.tagName === 'IMG') {
13
const block = convertImage(child)
14
15
return block ? [block] : []
16
} else if (child.tagName === 'P') {
17
const block = convertParagraph(child)
18
19
return block ? [block] : []
20
}
21
}
22
})

Now, you can probably see a pattern that’s going to arise as we add more and more elements that we want to handle - so it may be better to create a list of handlers for different tag types:

1
type Block = ParagraphBlock | ImageBlock
2
3
const handlers: Partial<Record<TagName, (data: TransformResult) => Block[]>> = {
4
// content blocks
5
IMG: convertImage,
6
P: convertParagraph,
7
8
// container blocks - we will always want to remove these
9
DIV: removeContainer,
10
SECTION: removeContainer,
11
BODY: removeContainer,
12
}

Using the above structure, we can tweak the convertImage and convertParagraph functions a bit so that they return the Block[] consistently:

1
const convertParagraph = (data: TransformResult): ParagraphBlock[] =>
2
data.textContent
3
? [
4
{
5
type: 'paragraph',
6
data: {
7
text: data.textContent,
8
},
9
},
10
]
11
: []
12
13
const convertImage = (data: TransformResult): ImageBlock[] =>
14
data.attrs.src
15
? [
16
{
17
type: 'image',
18
data: {
19
url: data.attrs.src,
20
caption: data.attrs.alt || '',
21
},
22
},
23
]
24
: []

And we can update the removeContainer function to handle things a bit more genericallly:

1
const removeContainer = (data: TransformResult): Block[] => {
2
const contentArr = data.children.map((child) => {
3
const handler = handlers[child.tagName]
4
5
if (!handler) {
6
return []
7
}
8
9
return handler(child)
10
})
11
12
return contentArr.flat(1)
13
}

If you’re really attentive, you’ll notice the contentArr.flat(1) that was added in the above snippet, this flattens the Block[][] into a Block[]

Once we’ve got that, we can define a convert function that will take the HTML and output the structured blocks like so:

1
const convert = (el: Element): Block[] => {
2
const transformed = transform(el)
3
4
const initialHandler = handlers[transformed.tagName]
5
6
if (!initialHandler) {
7
throw new Error('No handler found for top-level wrapper')
8
}
9
10
return initialHandler(transformed)
11
}

Adding more content types

That’s about it, to handle more specific types of content or other HTML elements is just a matter of following the recipe that we did above:

  1. Define if an element is a container or content
  2. If it’s a container, just use the removeContainer handler
  3. If it’s content then define a handler for the specific kind of content

If you roll this out for loads of elements you’ll eventually have a pretty robust converter

Conclusion

That’s it! We’ve covered the basics for building a transformer like this, and once you have a good feel for how this works, the concepts can be applied to loads of different usecases

If you’d like to see the completed version of my converter, you can take a look at the html-editorjs GitHub repo and if you’d like to look at it in action in an application then take a look at Articly