The general layout of article on mucka.pro is based on technical article style I am familiar with. After writing several drafts of articles, I noticed recurring problem. I wanted figures and listings to have a number and a caption. Markdown do not support my idea of figures, neither does mistletoe. I need to manually wrap any figure or listing in HTML, so they can be properly rendered. It is quite tedious, especially when editing or making corrections. I decided to help myself and add new feature to generator.

Idea

I would like an additional function block in markdown, which will be used to embed figure or listing. Function block should be flexibly parametrised to avoid backward incompatible changes in the near future.

Assumptions

Function block should link to external file with image or code. Function block should me marked with text identifier, which will allow linking. Function block should include caption. Function block should handle enumeration automatically. Function block should contain content format, even if not used for now.

Source code function block should indicate language of code snippet. Source code function block should allow to select lines from source file.

Video function block should allow to set HTML attributes like autoplay or loop.

Design

Function block starts with ! marker, followed by block type: image/video/code. Rest of function block is yaml dictionary. Function block is separated from text with new lines. Function block examples are shown on listing 1.


!image
id: fig2
format: png
link: image.png
caption: Lorem ipsum

!video
id: fig1
format: webm
attributes:
    autoplay: on
    muted: on
    loop: on
link: video.webm
caption: Lorem ipsum

!code
id: lis2
language: python
file: code.py
lines: 1,4-5,8-10
caption: Imsum lorem

Listing 1. Function block example

Graphic function block contain file name, format and caption. Video function block is extended with special attributes. Source code function block is extended with list of lines.

Standard link syntax [listing #](#lis2) should be used with function block. # in link caption will be replaced with actual enumeration. Link reference e.g. #lis2 should match id parameter.

Implementation

I want to replicate HTML that I explicitly used before. To achieve this I will use simple parametrised template with string.format(). I will try to make template rendering multi language. I am using latin language (la) as test environment for new features. HTML template example shown on listing 2.


figure_names = {
    "la" : "Figure",
    "en" : "Figure",
    "pl" : "Rysunek",
}

figure_template = """
<figure id="{id}">
  <img src="{link}" />
  <figcaption>{name} {index}. {caption}</figcaption>
</figure>
"""

Listing 2. Graphic HTML template

Token parsing

Library mistletoe, which I currently use for rendering, have simple but little bit tricky extension interface. I need to implement new token parser and extend existing renderer.

Firstly I need to add class for new token parser. Block token base class is named BlockToken. I need to overload two functions: start and read. Both should be overloaded as @classmethod or @staticmethod. start function should accept a line, and return true if this line matches as first line of the block. read function should return block string extracted from text.

Existing implementation of read extract text until it finds an empty line. It is almost perfect solution. I want to extract text until finding empty line, but in addition to that I want to ignore first line. First line is just a marker, rest of block is yaml.

I decided to declare intermediary class. Implementation is shown on listing 3. FigureToken class contain set of necessary methods. Method start looks for text indicated by cls.pattern parameter. Method read ignores first line and then cut text until it finds an empty line. Class is technically virtual, it does not have cls.pattern defined. I define three concrete child classes that defines cls.pattern: ImageToken, VideoToken and CodeToken.

Method __init__ of FigureToken is called just after method start finds beginning of the block and read cuts the block. Method __init__ accepts one argument, the list returned by read static method. The returned list is parsed using pyyaml.


class FigureToken(block_token.BlockToken):

    def __init__(self, lines):
        self.content = yaml.safe_load("\n".join(lines))

    @classmethod
    def start(cls, line):
        return line == f'{cls.pattern}\n'

    @staticmethod
    def read(lines):
        next(lines)
        return block_token.BlockToken.read(lines)

class ImageToken(FigureToken):
    pattern = "!image"

class VideoToken(FigureToken):
    pattern = "!video"

class CodeToken(FigureToken):
    pattern = "!code"

Listing 3. FigureToken implementation

Token rendering

New token parsers need to be embedded into renderer. HTML renderer base class is called HTMLRenderer.

Extending renderer base class involves calling base class __init__ method, with a list of classes, that should be considered as token parsers. Here ImageToken, VideoToken and CodeToken.

Next I need to add token renderer method, maintaining proper method naming. According to library documentation, token renderer method should be called same as token parser class, but lower case, with camel case replaced with _ characters. For example ImageToken renderer method should be named render_image_token.

I declared ProjectRenderer class shown on listing 4, that follows described interface. Method __init__ calls base class __init__ with a list of additional token parsers. It also resets enumerators counter and sets output language. Renderer object will store index list as self.index_list. Index list will store pairs (id, index) during rendering. List will be used in post processing to replace # characters in links.


from mistletoe import block_token, html_renderer

class ProjectRenderer(html_renderer.HTMLRenderer):

    def __init__(self, source_path, language):
        super().__init__(VideoToken, ImageToken, CodeToken)
        self.source_path = source_path
        self.figure_index = 0
        self.video_index = 0
        self.listing_index = 0
        self.language = language
        self.index_list = []

    def render_image_token(self, token):
        pass

    def render_video_token(self, token):
        pass

    def render_code_token(self, token):
        pass

Listing 4. ProjectRenderer implementation

ProjectRenderer.render_image_token is responsible for generating HTML for every ImageToken object. Input for this method is parsed yaml dictionary loaded by ImageToken. Implementation is shown on listing 5. Method increments figure index and adds id to index list. Basing on figure_template method generates HTML.


class ProjectRenderer(html_renderer.HTMLRenderer):

    def render_image_token(self, token):
        content = token.content

        self.figure_index += 1
        self.index_list.append((content["id"], self.figure_index))

        return figure_template.format(
            id=content["id"],
            link=content["link"],
            caption=content["caption"],
            name=figure_names[self.language],
            index=self.figure_index
        )

Listing 5. ProjectRenderer.render_image_token implementation

ProjectRenderer.render_video_token is responsible for generating HTML for every VideoToken object. Implementation is shown on listing 6. It is almost the same as render_image_token. The HTML template is a bit more complex though.


class ProjectRenderer(html_renderer.HTMLRenderer):

    def render_video_token(self, token):
        content = token.content

        self.video_index += 1
        self.index_list.append((content["id"], self.video_index))

        return video_template.format(
            id=content["id"],
            link=content["link"],
            format=content["format"],
            autoplay=content["attributes"].get("autoplay", "on"),
            muted=content["attributes"].get("muted", "on"),
            loop=content["attributes"].get("loop", "on"),
            caption=content["caption"],
            name=video_names[self.language],
            index=self.video_index
        )

Listing 6. ProjectRenderer.render_video_token implementation

ProjectRenderer.render_code_token is responsible for generating HTML for every CodeToken object. Method is a bit more complex than previous ones. It loads file content using pathlib and filter content using self.extract_selected_lines.


class ProjectRenderer(html_renderer.HTMLRenderer):

    def render_code_token(self, token):
        content = token.content

        self.listing_index += 1
        self.index_list.append((content["id"], self.listing_index))

        code_path = self.source_path.parent / content["file"]
        code = code_path.read_text().split("\n")
        code = self.extract_selected_lines(code, content["lines"])

        return listing_template.format(
            id=content["id"],
            code=code,
            caption=content["caption"],
            name=listing_names[self.language],
            index=self.listing_index
        )

Listing 7. ProjectRenderer.render_code_token implementation

I want simple feature, where user can select list of lines which will be loaded from source file. Implementation is shown on listing 8. List element can be single line, or range in start-end format. I check every list element in a loop, and check if it is a single line or range. Selected lines are added to chunk. I execute html.escape method to remove characters that are not allowed in HTML document like < or >.


class ProjectRenderer(html_renderer.HTMLRenderer):

    def extract_selected_lines(self, code, lines_range):
        lines_range = lines_range.split(",")
        chunks = []
        for element in lines_range:
            if element.find("-") > 0:
                element_range = [int(x)-1 for x in element.split("-")]
                chunks.extend(code[element_range[0]:element_range[1]+1])
            else:
                chunks.append(code[int(element)-1])

        return html.escape("\n".join(chunks))

Listing 8. ProjectRenderer.extract_selected_lines implementation

Extend existing implementation

Last thing that is left is to extend existing rendering procedure with new ProjectRenderer. Changes in render_content method are shown on listing 9. Instead of directly calling mistletoe.mistletoe() I create mistletoe.Document object. This object is passed to render method of ProjectRenderer. Final HTML is passed to ProjectRenderer.enumerate_links.


def render_content(output_path, source_path, metadata, template):
    '''Render content basing on template'''
    elements = {
        '<!-- DATE -->': metadata["date"],
        '<!-- TYPE -->': metadata["type"],
        '<!-- TITLE -->': metadata["title"],
        '<!-- LANGUAGE -->': metadata["language"],
        '<!-- STAGE -->': metadata["stage"],
        '<!-- TAGS -->': metadata["tags"],
    }

    with open(source_path, 'r') as source:
        with ProjectRenderer(source_path, metadata["language"]) as renderer:
            document = mistletoe.Document(source)
            rendered = renderer.render(document)
            rendered = renderer.enumerate_links(rendered)
            elements['<!-- CONTENT -->'] = rendered

    elements = {re.escape(k): v for k, v in elements.items()}
    pattern = re.compile("|".join(elements.keys()))

    output_path.write_text(pattern.sub(lambda x: elements[re.escape(x.group(0))], template))

Listing 9. render_content method changes

enumerate_links method is shown on listing 10. Method is responsible for replacing # characters with actual enumeration. It might be achieved by extending used renderer. Nevertheless I decided to manipulate output HTML.

Firstly I look for links using <a href=\"#link\">[\\w\\s#]+</a> regex. Secondly I assume that nobody will put similar link in HTML with ill intent. It is a bit opportunistic. Thirdly I assume space before # character. It is even more opportunistic. I create a list of pairs (existing link, edited link). Using this list I replace all found links in HTML.

One remark, I had to split regex into two parts because format string is not compatible with regex string in python.


class ProjectRenderer(html_renderer.HTMLRenderer):

    def enumerate_links(self, document):
        replace_list = []
        for element in self.index_list:
            pattern = f"<a href=\"#{element[0]}\">" + "[\\w\\s#]+</a>"
            source = re.findall(pattern, document)
            for s in source:
                replace_list.append((s, s.replace(" #", f" {element[1]}")))

        for source, destination in replace_list:
            document = document.replace(source, destination)

        return document

Listing 10. ProjectRenderer.enumerate_links implementation

Entire implementation can be found in following commits:

9dd2e3f4 - main part of implementation described in this article
9301368e - small fixes that I added while writing this article

Summary

At first glance project works exactly like it worked before, so it is a win. I do not have a single unit test, so I have no idea if my code works in every intended scenario. But I need to live with it for now.

Figures and listings