The general layout of article on mucka.pro is based on technical article style I am familiar with. After writing several drafts of articles, I noticed recurring problem. I wanted figures and listings to have a number and a caption. Markdown do not support my idea of figures, neither does mistletoe. I need to manually wrap any figure or listing in HTML, so they can be properly rendered. It is quite tedious, especially when editing or making corrections. I decided to help myself and add new feature to generator.
Idea
I would like an additional function block in markdown, which will be used to embed figure or listing. Function block should be flexibly parametrised to avoid backward incompatible changes in the near future.
Assumptions
Function block should link to external file with image or code. Function block should me marked with text identifier, which will allow linking. Function block should include caption. Function block should handle enumeration automatically. Function block should contain content format, even if not used for now.
Source code function block should indicate language of code snippet. Source code function block should allow to select lines from source file.
Video function block should allow to set HTML attributes like
autoplay
or loop
.
Design
Function block starts with !
marker, followed by block type:
image
/video
/code
.
Rest of function block is yaml
dictionary.
Function block is separated from text with new lines.
Function block examples are shown on listing 1.
!image
id: fig2
format: png
link: image.png
caption: Lorem ipsum
!video
id: fig1
format: webm
attributes:
autoplay: on
muted: on
loop: on
link: video.webm
caption: Lorem ipsum
!code
id: lis2
language: python
file: code.py
lines: 1,4-5,8-10
caption: Imsum lorem
Graphic function block contain file name, format and caption. Video function block is extended with special attributes. Source code function block is extended with list of lines.
Standard link syntax [listing #](#lis2)
should be used with
function block.
#
in link caption will be replaced with actual enumeration.
Link reference e.g. #lis2
should match id
parameter.
Implementation
I want to replicate HTML that I explicitly used before.
To achieve this I will use simple parametrised template with
string.format()
.
I will try to make template rendering multi language.
I am using latin language (la
) as test environment for new
features.
HTML template example shown on listing 2.
figure_names = {
"la" : "Figure",
"en" : "Figure",
"pl" : "Rysunek",
}
figure_template = """
<figure id="{id}">
<img src="{link}" />
<figcaption>{name} {index}. {caption}</figcaption>
</figure>
"""
Token parsing
Library mistletoe, which I currently use for rendering, have simple but little bit tricky extension interface. I need to implement new token parser and extend existing renderer.
Firstly I need to add class for new token parser.
Block token base class is named BlockToken
.
I need to overload two functions: start
and read
.
Both should be overloaded as @classmethod
or @staticmethod
.
start
function should accept a line, and return true
if this line
matches as first line of the block.
read
function should return block string extracted from text.
Existing implementation of read
extract text until it finds an
empty line.
It is almost perfect solution.
I want to extract text until finding empty line, but in addition to
that I want to ignore first line.
First line is just a marker, rest of block is yaml
.
I decided to declare intermediary class.
Implementation is shown on listing 3.
FigureToken
class contain set of necessary methods.
Method start
looks for text indicated by cls.pattern
parameter.
Method read
ignores first line and then cut text until it finds an
empty line.
Class is technically virtual, it does not have cls.pattern
defined.
I define three concrete child classes that defines cls.pattern
:
ImageToken
, VideoToken
and CodeToken
.
Method __init__
of FigureToken
is called just after method
start
finds beginning of the block and read
cuts the block.
Method __init__
accepts one argument, the list returned by read
static method.
The returned list is parsed using pyyaml.
class FigureToken(block_token.BlockToken):
def __init__(self, lines):
self.content = yaml.safe_load("\n".join(lines))
@classmethod
def start(cls, line):
return line == f'{cls.pattern}\n'
@staticmethod
def read(lines):
next(lines)
return block_token.BlockToken.read(lines)
class ImageToken(FigureToken):
pattern = "!image"
class VideoToken(FigureToken):
pattern = "!video"
class CodeToken(FigureToken):
pattern = "!code"
FigureToken
implementationToken rendering
New token parsers need to be embedded into renderer.
HTML renderer base class is called HTMLRenderer
.
Extending renderer base class involves calling base class __init__
method, with a list of classes, that should be considered as token
parsers. Here ImageToken
, VideoToken
and CodeToken
.
Next I need to add token renderer method, maintaining proper method
naming.
According to library documentation, token renderer method should be
called same as token parser class, but lower case, with camel case
replaced with _
characters.
For example ImageToken
renderer method should be named
render_image_token
.
I declared ProjectRenderer
class shown on
listing 4, that follows described interface.
Method __init__
calls base class __init__
with a list of
additional token parsers.
It also resets enumerators counter and sets output language.
Renderer object will store index list as self.index_list
.
Index list will store pairs (id
, index
) during rendering.
List will be used in post processing to replace #
characters in
links.
from mistletoe import block_token, html_renderer
class ProjectRenderer(html_renderer.HTMLRenderer):
def __init__(self, source_path, language):
super().__init__(VideoToken, ImageToken, CodeToken)
self.source_path = source_path
self.figure_index = 0
self.video_index = 0
self.listing_index = 0
self.language = language
self.index_list = []
def render_image_token(self, token):
pass
def render_video_token(self, token):
pass
def render_code_token(self, token):
pass
ProjectRenderer
implementationProjectRenderer.render_image_token
is responsible for generating
HTML for every ImageToken
object.
Input for this method is parsed yaml dictionary loaded by
ImageToken
.
Implementation is shown on listing 5.
Method increments figure index and adds id
to index list.
Basing on figure_template
method generates HTML.
class ProjectRenderer(html_renderer.HTMLRenderer):
def render_image_token(self, token):
content = token.content
self.figure_index += 1
self.index_list.append((content["id"], self.figure_index))
return figure_template.format(
id=content["id"],
link=content["link"],
caption=content["caption"],
name=figure_names[self.language],
index=self.figure_index
)
ProjectRenderer.render_image_token
implementationProjectRenderer.render_video_token
is responsible for generating
HTML for every VideoToken
object.
Implementation is shown on listing 6.
It is almost the same as render_image_token
.
The HTML template is a bit more complex though.
class ProjectRenderer(html_renderer.HTMLRenderer):
def render_video_token(self, token):
content = token.content
self.video_index += 1
self.index_list.append((content["id"], self.video_index))
return video_template.format(
id=content["id"],
link=content["link"],
format=content["format"],
autoplay=content["attributes"].get("autoplay", "on"),
muted=content["attributes"].get("muted", "on"),
loop=content["attributes"].get("loop", "on"),
caption=content["caption"],
name=video_names[self.language],
index=self.video_index
)
ProjectRenderer.render_video_token
implementationProjectRenderer.render_code_token
is responsible for generating
HTML for every CodeToken
object.
Method is a bit more complex than previous ones.
It loads file content using pathlib
and filter content using
self.extract_selected_lines
.
class ProjectRenderer(html_renderer.HTMLRenderer):
def render_code_token(self, token):
content = token.content
self.listing_index += 1
self.index_list.append((content["id"], self.listing_index))
code_path = self.source_path.parent / content["file"]
code = code_path.read_text().split("\n")
code = self.extract_selected_lines(code, content["lines"])
return listing_template.format(
id=content["id"],
code=code,
caption=content["caption"],
name=listing_names[self.language],
index=self.listing_index
)
ProjectRenderer.render_code_token
implementationI want simple feature, where user can select list of lines which will
be loaded from source file.
Implementation is shown on listing 8.
List element can be single line, or range in start-end
format.
I check every list element in a loop, and check if it is a single
line or range.
Selected lines are added to chunk
.
I execute html.escape
method to remove characters that are not
allowed in HTML document like <
or >
.
class ProjectRenderer(html_renderer.HTMLRenderer):
def extract_selected_lines(self, code, lines_range):
lines_range = lines_range.split(",")
chunks = []
for element in lines_range:
if element.find("-") > 0:
element_range = [int(x)-1 for x in element.split("-")]
chunks.extend(code[element_range[0]:element_range[1]+1])
else:
chunks.append(code[int(element)-1])
return html.escape("\n".join(chunks))
ProjectRenderer.extract_selected_lines
implementationExtend existing implementation
Last thing that is left is to extend existing rendering procedure
with new ProjectRenderer
.
Changes in render_content
method are shown
on listing 9.
Instead of directly calling mistletoe.mistletoe()
I create
mistletoe.Document
object.
This object is passed to render
method of ProjectRenderer
.
Final HTML is passed to ProjectRenderer.enumerate_links
.
def render_content(output_path, source_path, metadata, template):
'''Render content basing on template'''
elements = {
'<!-- DATE -->': metadata["date"],
'<!-- TYPE -->': metadata["type"],
'<!-- TITLE -->': metadata["title"],
'<!-- LANGUAGE -->': metadata["language"],
'<!-- STAGE -->': metadata["stage"],
'<!-- TAGS -->': metadata["tags"],
}
with open(source_path, 'r') as source:
with ProjectRenderer(source_path, metadata["language"]) as renderer:
document = mistletoe.Document(source)
rendered = renderer.render(document)
rendered = renderer.enumerate_links(rendered)
elements['<!-- CONTENT -->'] = rendered
elements = {re.escape(k): v for k, v in elements.items()}
pattern = re.compile("|".join(elements.keys()))
output_path.write_text(pattern.sub(lambda x: elements[re.escape(x.group(0))], template))
render_content
method changesenumerate_links
method is shown on listing 10.
Method is responsible for replacing #
characters with actual
enumeration.
It might be achieved by extending used renderer.
Nevertheless I decided to manipulate output HTML.
Firstly I look for links using <a href=\"#link\">[\\w\\s#]+</a>
regex.
Secondly I assume that nobody will put similar link in HTML with ill
intent.
It is a bit opportunistic.
Thirdly I assume space before #
character.
It is even more opportunistic.
I create a list of pairs (existing link, edited link).
Using this list I replace all found links in HTML.
One remark, I had to split regex into two parts because format string is not compatible with regex string in python.
class ProjectRenderer(html_renderer.HTMLRenderer):
def enumerate_links(self, document):
replace_list = []
for element in self.index_list:
pattern = f"<a href=\"#{element[0]}\">" + "[\\w\\s#]+</a>"
source = re.findall(pattern, document)
for s in source:
replace_list.append((s, s.replace(" #", f" {element[1]}")))
for source, destination in replace_list:
document = document.replace(source, destination)
return document
ProjectRenderer.enumerate_links
implementationEntire implementation can be found in following commits:
- 9dd2e3f4 - main part of implementation described in this article
- 9301368e - small fixes that I added while writing this article
Summary
At first glance project works exactly like it worked before, so it is a win. I do not have a single unit test, so I have no idea if my code works in every intended scenario. But I need to live with it for now.