© Thomas Mailund 2019
T. MailundIntroducing Markdown and Pandochttps://doi.org/10.1007/978-1-4842-5149-2_11

11. Filters

Thomas Mailund1 
(1)
Aarhus N, Denmark
 

Filters let you manipulate your documents similarly to preprocessors. Unlike preprocessors, they do not modify the text before Pandoc gets hold of it, but instead, they are plugged into the text transformation that Pandoc does. Think of them as post-processors; it is not far from the truth except that Pandoc will run after they are done.

Both preprocessors and filters have strength and weaknesses. They can do the same things to your files, but some things are easier to program in a preprocessor, while some things are easier to program in a filter.

Pandoc can read from standard input and output to standard output—it does this by default—and you can control the input and output formats using the --from and --to options . As with any pipeline, you can connect multiple programs, so instead of manipulating the input to Pandoc, as with a preprocessor, you can read the output from Pandoc, transform it, through as many steps as you like, and pipe it back into Pandoc for the final formatting (see Figure 11-1).
../images/486315_1_En_11_Chapter/486315_1_En_11_Fig1_HTML.png
Figure 11-1

Document formatting pipeline with filters

You can string together any number of programs this way, as long as the output format of one matches the input format of the next, but what Pandoc thinks of as filters, and what you can add using the --filter or -F options , should read and write in the JSON format. Consequently, any program that can read and write JSON can be used as a filter. You don’t want to parse JSON yourself, though, if you can avoid it, so to write filters, you wish to use a software package/modules/libraries that help you rewrite the input.

There is support for filters in many languages, for some languages more than one package to support them; see a list at https://tinyurl.com/y2fsn89s . I am most familiar with Python, so in the following examples, I will use that language. The package I will use is panflute which is my favourite for writing Pandoc filters . To install panflute , you can run
$ pip3 install panflute

The JSON representation of a document can be thought of as a tree. A document contains paragraphs, paragraphs contain words and spaces, some words are emphasized, and so on. What the Pandoc filter packages typically will do is that they will traverse this tree structure and apply to each node a function that you provide. This function can leave the local tree structure alone, or it can return a modified tree. The output of the filter will be the input tree with your modifications.

You can use pandoc <inputfile> --to json to see the JSON representation of a file, but I find it easier to see the structure using pandoc <inputfile> -- native . The native format is a format used internally by Pandoc and is what you will traverse over with a filter. I suggest using it to recognize the structure a document will have, what to match on to recognize what you wish to rewrite, and what the rewritten structure should be.

Take a document like this:
# This is a level one header
This is a paragraph
The JSON format of this document is this:
{"blocks":[
    {"t":"Header",
        "c":[
            1,
            ["this-is-a-level-one-header",[],[]],
            [{"t":"Str","c":"This"},
                {"t":"Space"},
                {"t":"Str","c":"is"},
                {"t":"Space"},
                {"t":"Str","c":"a"},
                {"t":"Space"},
                {"t":"Str","c":"level"},
                {"t":"Space"},
                {"t":"Str","c":"one"},
                {"t":"Space"},
                {"t":"Str","c":"header"}
    ]]},
    {"t":"Para",
        "c":[{"t":"Str","c":"This"},
            {"t":"Space"},
            {"t":"Str","c":"is"},
            {"t":"Space"},
            {"t":"Str","c":"a"},
            {"t":"Space"},
            {"t":"Str","c":"paragraph"}]}
],
    "pandoc-api-version":[1,17,5,4],
    "meta":{}
}
It is a bit verbose which is why I prefer the native format :
Pandoc (Meta {unMeta = fromList []})
[Header 1 ("this-is-a-level-one-header",[],[])
    [Str "This",Space,
     Str "is",Space,
     Str "a",Space,
     Str "level",Space,
     Str "one",Space,
     Str "header"],
 Para
    [Str "This",Space,
     Str "is",Space,
     Str "a",Space,
     Str "paragraph"]
]

Except for the API version, which we are not concerned with for our filters, the two formats contain precisely the same information (obviously since it is JSON that is used between filters and a filter pipeline usually begins and ends with Pandoc).

The document has a metadata header (it is empty in this document) and then a list of document nodes. There are two top-level nodes, the header and the paragraph. The header has level 1 and then a triplet of extra information. The first element in the triplet is its identifier—it is used for hyperlinks when formatted as HTML and LaTeX. The next two elements in the triplet are classes and options. Inside the header is a list of strings and spaces. It is the text in the document. We will use identifiers, classes, and options in the following examples. The paragraph contains a lists of strings and spaces.

The hierarchy in this document is not deep. We have sequences of strings and spaces nested in the header and the paragraph. They can get deeper but usually not much. Consider this example:
---
oh: my
---
This is *very* important
If we get its structure, we get a string inside an emphasis inside a paragraph, but the depth is still small.
Pandoc
(Meta
  {unMeta = fromList
       [("oh",MetaInlines [Str "my"])]}
)
[Para
    [Str "This",Space,
     Str "is",Space,
     Emph [Str "very"],
     Space,
     Str "important"]
]

Here you also see an example of metadata. It looks more complicated than what we would expect from a simple map from oh to my, but it is what it is. In the following Python code, the panflute model will translate it into a simple map from keys to values.

To write a filter, try to make a small Markdown file containing input that you expect to transform and the output you want it to become. Then run the example through Pandoc to see what the native structure is. From that, and a package for Pandoc filters, you should be able to get what you want.

Exploring Panflute

Before we see concrete examples of filters, let us explore how panflute lets us traverse a document. As an example I will use this script:
import sys
from panflute import *
def print_structure(elem, doc):
    if type(elem) == Header:
        print("identifier:", elem.identifier,
              file = sys.stderr)
        print("classes:", elem.classes,
              file = sys.stderr)
        print("attributes:", elem.attributes,
              file = sys.stderr)
run_filter(print_structure)
The run_filter traverses the entire tree structure, depth-first, and calls a function we provide it; here that is print_structure . I have written a function that lets me show some of the properties of the header in the previous example. Let me add a few more properties to the header and process the document:
# This is my header {#header-id
    .class1 .class2
    foo=bar baz=qux}
This is *very* important

I have broken the header classes over multiple lines; Pandoc doesn’t mind.

In the print_structure function , I ignore all document elements that are not Header—simply because I only do anything in case the element is a header element. What I do is that I print the identifier, the classes, and the attributes of a header. I print them to standard error—if I printed them to standard out, I would mess up the JSON output that makes a filter work.

The text I print to standard error looks like this:
identifier: header-id
classes: ['class1', 'class2']
attributes: OrderedDict([('foo', 'bar'),
                         ('baz', 'qux')])
You can recognize the properties from the header.
# This is my header {#header-id
    .class1 .class2
    foo=bar baz=qux}

The identifier is first in the curly brackets, and it starts with #. The classes begin with a dot and otherwise standalone, and the attributes are key-value mappings. The different document elements have different properties but panflute is well documented, and you can find all the attributes each document element has.

The print_structure function doesn’t explicitly return any values which means that it implicitly returns the None object. The panflute module will interpret that as saying that the function does not want to make any transformations but leave elem as it is. If we wanted to change anything, we must return an element to replace the input elem.

If you translate the preceding document into HTML, you can see how the various components are used. The identifier becomes the id in the header tag, the classes become classes, and the attributes become data attributes .
<h1 id="header-id"
    class="class1 class2"
    data-foo="bar"
    data-baz="qux">
  This is my header
</h1>
<p>This is <em>very</em> important</p>
Not all of the components are used in all output formats. In LaTeX, for example, only the identifier is used.
hypertarget{header-id}{%
section{This is my header}label{header-id}}
This is emph{very} important

That classes and attributes are not used much in the output here does mean that they are useless. You just need to find a use for them yourself. We can abuse them for our own nefarious purposes in our own filters.

Conditional Inclusion of Exercise Solutions

Consider this example from the previous chapter: we have text with exercises and solutions, and we want to compile it into documents where the solutions have been removed and documents where they have not. The preprocessing solution is excellent, but now we can see how we can achieve the same thing using a filter.

Let this be the input format:
Here is an exercise. Everyone can see it.
::: SOLUTION :::
Here is a solution to the exercise.
Do not give it to the students.
:::

The ::: SOLUTION ::: syntax is not one we have seen before because it is relatively rare. It creates a div tag in HTML—you can use it with a CSS file for formatting—and it adds a hyper reference target in LaTeX. We want neither of that, but we can use a Div structure in our filter.

You can see what the structure looks like by running this:
pandoc solutions.md -s --to native
where I assume that the Markdown is in the file solutions.md.
Pandoc (Meta {unMeta = fromList []})
[Para [
    Str "Here",Space,
    Str "is",Space,
    Str "an",Space,
    Str "exercise.",Space,
    Str "Everyone",Space,
    Str "can",Space,
    Str "see",Space,
    Str "it."
],
Div ("",["SOLUTION"],[])
 [Para [
    Str "Here",Space,
    Str "is",Space,
    Str "a",Space,
    Str "solution",Space,
    Str "to",Space,
    Str "the",Space,
    Str "exercise.",Space,
    Str "Do",Space,
    Str "not",Space,
    Str "give",Space,
    Str "it",Space,
    Str "to",SoftBreak,
    Str "the",Space,
    Str "students."]
 ]
]
The ::: SOLUTION ::: syntax sets the class of the Div structure to SOLUTION (the middle element in the Div object’s properties). Classes are always lists, so it really sets the classes to a list with a single element, which is SOLUTION. If you want to give the Div object more attributes, you can use an alternative syntax:
::: SOLUTION :::
Solution 1
:::
::: {#solution-2 .SOLUTION .advanced}
Solution 2
:::
::: {#solution-3 .SOLUTION level=difficult }
Solution 3
:::

The first solution here has the same syntax as before. It will set the class of the Div object to SOLUTION and leave the other two properties empty. The second solution sets an identifier and adds another class to the list. The third solution is back to a single class but adds one key to value mapping to the attributes.

If you use this filter
import sys
from panflute import *
def print_structure(elem, doc):
    if type(elem) == Div:
        print("id:", elem.identifier,
              file = sys.stderr)
        print("classes:", elem.classes,
              file = sys.stderr)
        print("attributes:", elem.attributes,
              file = sys.stderr)
run_filter(print_structure)
on the preceding Markdown file, you will get this output:
id:
classes: ['SOLUTION']
attributes: OrderedDict()
id: solution-2
classes: ['SOLUTION', 'advanced']
attributes: OrderedDict()
id: solution-3
classes: ['SOLUTION']
attributes: OrderedDict([('level', 'difficult')])

We can use this information to process the example. We want to include or exclude solutions based on metadata. We could not easily do that in the preprocessor, but there we could use preprocessor variables which are harder to do here. Each method has its pros and cons.

The preceding information tells us that 'SOLUTION' will be in the class of the Div if we have an ::: SOLUTION ::: block (or with the alternative syntax an ::: { .SOLUTION }" block). We want to check the metadata to see if we should include the solutions. We can get the meta information from the doc parameter that run_filter will give our filter function. If we call doc.get_metadata() , we will get a table from which we can get metavariables. We want our filter to remove Div objects that have SOLUTION as their class unless there is a metavariable called solutions and it is true.

This filter does that:
from panflute import *
def solution(elem, doc):
    if type(elem) == Div:
        if 'SOLUTION' not in elem.classes:
            # Return None to leave the node as it is
            return None
        meta = doc.get_metadata()
        if "solutions" not in meta:
            return Null
        if meta["solutions"] != True:
            return Null
        return None
run_filter(solution)

First, we check if 'SOLUTION' is in the classes. If not, then we don’t have a solution block and we leave the element alone by returning None. We could also have returned elem; it would make no difference. Otherwise, we get hold of the metadata. We check if "solutions" is in the metadata. If it is not in the metadata, then it definitely cannot be true, so we return Null as a replacement for elem in the output. Notice that it is Null and not None! The former is an element in Pandoc, while the latter is an element in Python; the former replaces elem with an empty block, effectively removing the solutions block, while the latter keeps elem as it is, that is, leaves the solutions block in the output. Finally, if "solutions" is in the metadata but not true, then we also remove the solutions block. The solutions metavariable can have more than true as a value, and those would be considered as true as well in the expression meta["solutions"], so I explicitly check for True.

If we do not get past the checks, we remove the solutions block. If the “solutions” is not set in the metadata, or if it is set to anything but True, we remove the block. Otherwise, we keep it by returning None.

If the block is not a solutions block, if "solutions" is not set or set to anything except true, then we have a solutions block, and we are supposed to keep it.

If we go back to the example Markdown
Here is an exercise. Everyone can see it.
::: SOLUTION :::
Here is a solution to the exercise.
Do not give it to the students.
:::
then we can check how Pandoc reacts when the metavariable "solutions" is set to false:
pandoc --metadata solutions:false -F solutions.py
          solutions.md --to markdown
This is the result:
Here is an exercise. Everyone can see it.
The solutions block is removed in the output. The same happens if you do not set the metavariable or if you set it to any other value, except true. If you do set it to true
pandoc --metadata solutions:true -F solutions.py
          solutions.md -- to markdown
the solution block is included.
Here is an exercise. Everyone can see it.
::: {.SOLUTION}
Here is a solution to the exercise.
Do not give it to the students.
:::

Pandoc uses this syntax in its output, but you can use either of the two ways to define a Div block.

Conditional Inclusions Based on Format

Sometimes we want a different text in the output conditional on the output format, for example, a different text for HTML and LaTeX.

You can insert raw text that is only included in the right output using text followed by {=format}. The delimiters for the text you write is slightly different based on whether you want a block or inline text. For a block, you write
```{=html}
See examples
<ul>
  <li><a href="#ex:ex1">Exercise 1</a></li>
  <li><a href="#ex:ex2">Exercise 2</a>
</ul>
```
For inline text, you use backticks:
See examples
`<a href="#ex:ex1">Exercise 1</a> and
 <a href="#ex:ex1">Exercise 2</a>`{=html}
 `cite{ex:ex1} and cite{ex:ex2}`{=latex}.

In either case, the quoted text is only inserted if the output format matches. We saw this syntax back in Chapter 6.

The text we include here will not be processed by Pandoc, so you cannot use Pandoc’s features. That means that you cannot, for example, use the [text](link) syntax but must the hyperlinks in HTML. We will write a filter that allows us to do this.

We will use two or more classes. The first is used to tag that it is text that should only be output for some formats and another to indicate which output we want to output the text for.

We have already seen how to add classes to a Div block of text:
:::{.out .html}
This is only included for HTML
:::
:::{.out .latex}
This is only included for LaTeX
:::
:::{.out .html .latex}
This is only included for HTML and LaTeX
:::
For inline text, you have to use square brackets:
[HTML only]{.out .html} [LaTeX only]{.out .latex}

For blocks of text, we have to capture Div objects, and for an inline text, we need to catch Span objects. In either case, we need to check if we have the out class. Otherwise, we leave the object alone—it could be used for something else in another filter. If we have the out class, we check if the output format is also a class. If not, we remove the object; if the format is a class, then we include it.

The filter looks like this:
from panflute import *
def format_include(elem, doc):
    if type(elem) == Span:
        if not "out" in elem.classes:
            return elem
        if doc.format not in elem.classes:
            return []
        else:
            return elem.content.list
    if type(elem) == Div:
        if not "out" in elem.classes:
            return elem
        if doc.format not in elem.classes:
            return Null
        else:
            return elem.content.list
run_filter(format_include)

When we return the elements, we want to keep we do not just return elem. We don’t necessarily want to have the Span and Div show up in the output. Instead, we get the object’s contents. The script wants this as a list, so we use elem.content.list .

If you use the filter on this Markdown:
[HTML only]{.out .html} [LaTeX only]{.out .latex}
:::{.out .html}
This is only included for HTML
:::
:::{.out .latex}
This is only included for LaTeX
:::
:::{.out .html .latex}
This is only included for HTML and LaTeX
:::
you will get this HTML output
<p>HTML only </p>
<p>This is only included for HTML</p>
<p>This is only included for HTML and LaTeX</p>
and this LaTeX output
LaTeX only
This is only included for LaTeX
This is only included for HTML and LaTeX

There is no formatting here because there is no formatting in the input.

If you use another output format, for example, Markdown, then you will not get any output with this input; all the text is only included for HTML and LaTeX.

Evaluating Code

Now take another example we considered in the previous chapter : running Python code while formatting a document and inserting the results.

In this version, we will use classes to distinguish between code blocks we want to evaluate and those we do not. We only consider Python code blocks—those whose classes contain "python"—but to evaluate them, we will also require that they have the class "eval". For example, in the markdown, before we will evaluate the second but not the first code block.
~~~{.python}
for i in range(10):
    print(i, end = ")
~~~
~~~{.python .eval}
for i in range(10):
    print(-i, end = ")
~~~

The filter is straightforward. Ignore for now the run_python we call—I list it in the following text, but it is not important how it works. Focus on eval_python . I realize that I have not been that inventive with the names, but run_python_process executes a Python process that evaluates a code block, while eval_python is the filter.

Consider the filter, eval_python. Here, we only look at elements of type CodeBlock. When we have a CodeBlock, we get hold of the classes and check that both "python" and "eval" are in it. If they are not, we do not enter the inner if statement, so the function will, by default, return None which leaves elem as it is. If we have the right classes, then we evaluate the Python code using execute_code function and insert the result after elem and return that.
import sys
from panflute import *
# definition of execute_code
def eval_python(elem, doc):
    if type(elem) == CodeBlock:
        classes = elem.classes
        code_body = elem.text
        if 'python' in classes and "eval" in classes:
            eval_res = execute_code(code_body)
            return [elem, CodeBlock(eval_res)]
run_filter(eval_python)

The execute_code function is slightly more complicated than the way we used exec in the preprocessor. It is simpler to use exec and let it print its output in the preprocessor compared to evaluating the code in a filter where we do not want any unwanted output.

If exec writes something to standard out, it will break the JSON format and this will break the rest of the pipeline.

Therefore, we need to capture the output of exec and then get hold of it again. Since the output of the code in exec gets sent to standard output, we need to change that into a file we can use, open that file when we execute code, close it again to flush it, open it, and read the result. It is not pretty, but it gets the job done, and you can do it like this:
PYTHON_IO_FILE = "/tmp/eval-python-io"
real_stdout = sys.stdout
exec_env = {}
def execute_code(code):
    f = open(PYTHON_IO_FILE, "w")
    sys.stdout = f
    exec(code, exec_env)
    sys.stdout.close()
    sys.stdout = real_stdout
    f = open(PYTHON_IO_FILE, "r")
    return f.read()

You do not need to understand this part of the filter to understand how the filter itself works.

We can run the filter and in this case get the result as Markdown:
pandoc -F eval-python.py eval-python.md --to markdown
The result is this:
``` {.python}
for i in range(10):
    print(i, end = ")
```
``` {.python .eval}
for i in range(10):
    print(-i, end = ")
```
    0-1-2-3-4-5-6-7-8-9

The output is not in a “backtick”-block but indented. This is just another way to write the same in Markdown.

If you use this document, you will see that we can define a function in one code block and use it in another
~~~{.python}
for i in range(10):
    print(i, end = ")
~~~
~~~{.python .eval}
print("defining foo")
def foo():
    for i in range(10):
        print(-i, end = ")
foo()
~~~
~~~{.python .eval}
print("calling foo from different block")
foo()
~~~
This is the output:
``` {.python}
for i in range(10):
    print(i, end = ")
```
``` {.python .eval}
print("defining foo")
def foo():
    for i in range(10):
        print(-i, end = ")
foo()
```
    defining foo
    0-1-2-3-4-5-6-7-8-9
``` {.python .eval}
print("calling foo from a different block")
foo()
```
    calling foo from a different block
    0-1-2-3-4-5-6-7-8-9

Numbering Exercises

As a final example , let us return to the exercise examples. This time, we are not concerned with including or excluding the solutions, but we want to put exercises in a LaTeX environment when the output is LaTeX and otherwise number them and add a header.

Consider an input like this:
::: Exercise :::
First exercise
:::
::: Exercise :::
Second exercise
:::

We have two Div blocks with class Exercise and some text within them. Those are the ones we want to modify. For HTML, say, we want to give them a header, and for LaTeX, we want to put them inside a LaTeX environment.

We can get the output format from the doc object . The input and output of filters are, as mentioned earlier, JSON, and if we just used shell pipes, we couldn’t know what the final output is. When we run a script as a filter, however, Pandoc knows what the final output will be, and we can get that information.

The first attempt at the filter looks like this:
 1   from panflute import *
 2
 3   no_exercise = 1
 4
 5   def number_exercises(elem, doc):
 6       global no_exercise
 7       if type(elem) == Div and
 8           "Exercise" in elem.classes:
 9
10           meta = doc.get_metadata()
11
12           if doc.format == "latex":
13               exercise_env = "exercises"
14               if "exercise_env" in meta:
15                   exercise_env = meta["exercise_env"]
16               block = [
17                   RawBlock(r"egin{" + exercise_env + "}",
18                            "latex"),
19                   elem,
20                   RawBlock(r"end{" + exercise_env + "}",
21                            "latex")
22               ]
23               return block
24
25           level = 1
26           if "exercise_header_level" in meta:
27               level = int(meta["exercise_header_level"])
28
29           title = [Str("Exercise"),
30                    Space,
31                    Str(str(no_exercise))]
32           no_exercise += 1
33           return [Header(*title, level = level,
34                   classes = elem.classes), elem]
35
36   run_filter(number_exercises)

We use a global variable, no_exercise (line 3), for increasing the header number for each exercise. Inside the filter, we first check if we have a Div block with an Exercise class. If so, we get hold of the meta object from the doc element (line 10) and use it to check if its output format is LaTeX or HTML. If it is LaTeX (line 12), then we get the metavariable exercise_env (with exercises as default), and we create a new block as a replacement for the Div block.

The RawBlock is just verbatim text but only inserted if the output format is “latex.” Of course, we know that the output is LaTeX here and we could leave out the argument, but in other cases, a RawBlock can be useful when you output roughly the same text for all output formats and do not want to check for the output format.

For all other formats (line 24 and below), we use a default level of 1 and otherwise use the metavariable exercise_header_level (lines 26 and 27). We create the header text (lines 29–31), increment the no_exercise variable (line 32), and then create the Header element. Its first argument is the list of text object that should comprise the header, then the header level, and keep the classes from the Div block . We put the elem text after the header.

Let us try it on HTML output (where the filter filename is exercises.py and the input Markdown is in exercises.md):
pandoc -F exercises.py exercises.md --to html
The output is this:
<h1 class="Exercise">Exercise 1</h1>
<div class="Exercise">
<p>First exercise</p>
</div>
<h1 class="Exercise">Exercise 2</h1>
<div class="Exercise">
<p>Second exercise</p>
</div>

As you can see, we have added a header to the exercises.

If we set the metavariable for the header level, we modify the level of the header:
pandoc --metadata=exercise_header_level=4
       -F exercises.py exercises.md --to html
<h4 class="Exercise">Exercise 1</h4>
<div class="Exercise">
<p>First exercise</p>
</div>
<h4 class="Exercise">Exercise 2</h4>
<div class="Exercise">
<p>Second exercise</p>
</div>
For LaTeX, we get this:
pandoc -F exercises.py exercises.md --to latex
egin{exercises}
First exercise
end{exercises}
egin{exercises}
Second exercise
end{exercises}

Here we do not create a header but put the exercises into an exercises environment . Since we do not add headers, the header level is ignored.

You need to define the environment in LaTeX for this to work. How to do this is beyond the scope of this book, but you can add an incantation like this in your YAML header :
header-includes: |
     ewcounter{exercounter}[section]
     ewcommand{ heexercise}%
    { hesection.arabic{exercounter}}
    makeatletter
     ewenvironment{exercises}{%
    par efstepcounter{exercounter}%
    protected@edef@currentlabel{ heexercise}%
     oindent extbf{Exercise heexercise}}{}
    makeatother
and then have
$for(header-includes)$
$header-includes$
$endfor$

in your template before egin{document}.

We can add references to the Div blocks .
::: {#ex1 .Exercise}
First exercise
:::
::: {#ex1 .Exercise}
Second exercise
:::
These are automatically kept for the elem Div blocks .
<h4 class="Exercise">Exercise 1</h4>
<div id="ex1" class="Exercise">
<p>First exercise</p>
</div>
<h4 class="Exercise">Exercise 2</h4>
<div id="ex1" class="Exercise">
<p>Second exercise</p>
</div>
You can now use the link syntax , text, to create hyperlinks to the exercises. If you want the reference to be in the header instead of the Div block , you can replace lines 33 and 34 with this:
identifier = elem.identifier
if not identifier:
    identifier = ""
header = Header(*title,
                identifier = identifier,
                level = level,
                classes = elem.classes)
elem.identifier = ""
return [header, elem]

It gives the header the elements identifier and sets the elements identifier to the empty string, which means that it will not be inserted in the output.

For LaTeX we want to use ef commands ; we do get hyper reference targets, but it is not what we want. We add a LaTeX reference command, however. To do this, we need to add a label command inside the environments. Doing this is straightforward. Simply replace lines 11 to 22 with this:
if doc.format == "latex":
    exercise_env = "exercises"
    if "exercise_env" in meta:
        exercise_env = meta["exercise_env"]
    if elem.identifier:
        label = r"label{" + elem.identifier + "}"
    else:
        label = ""
    block = [
        RawBlock(r"egin{" + exercise_env + "}" +
                 label,
                 "latex"),
        elem,
        RawBlock(r"end{" + exercise_env + "}",
                 "latex")
]
return block

We get the identifier for the Div argument —we have seen identifiers earlier—and then we insert it after the egin command . If there is no identifier, we insert the empty string.

Now we have labels we can use with ef{} commands in LaTeX (and we can insert those conditional on the output format) and we can insert links for other formats (dependent on those). LaTeX will automatically number the environments (if you have the LaTeX magic listed earlier to define the environment type), and it will automatically insert their reference number. For other formats, you need to insert the numbers yourself in the link.

This is not desirable. It means you have to manually update all numbers if you add an exercise inside your text. We need a better solution. We are going to use syntax similar to
pandoc-crossref
and
pandoc-citeproc

and we need to run our filter before pandoc-citeproc for the same reason that pandoc-crossref must. We don’t want to interfere with references handled by these two filters, so we will give them a

prefix (like pandoc-crossref). Our references will look like this:
[@ex:identifier].

Before we can handle references, however, we want to collect a map from identifiers to exercise numbers. LaTeX will handle this for environments and ef{} commands but for other formats we must. Since we are not guaranteed that we see an exercise before we reference it, we must traverse the entire document and make the map before we traverse it again and modify the document.

The easiest way to traverse the document is with the run_filter function , but we cannot run it more than once. There is another function, run_filters—notice the plural—that handles that. One filter will provide the input to the next; we don’t want to modify anything with the filter that collects the map, so we just let it return None (implicitly by not returning another value).

The filter for making the map is straightforward and looks like this:
ex_dict = {}
no_exercise = 1
def collect_numbers(elem, doc):
    global no_exercise
    if type(elem) == Div and
        "Exercise" in elem.classes:
        if elem.identifier:
            ex_dict[elem.identifier] = no_exercise
        no_exercise += 1

When we number the exercises, in the filter that modifies the document, we need to reset no_exercise . This is not easy when the two scripts are called one after another, but a straightforward solution is to use a second counter. If you do this, then the preceding number_exercises filter will work as before. It adds the numbers, but it didn’t need the map earlier, and it doesn’t need it now.

Instead, we will write a third filter that does this; let us call it handle_citations . I describe this function in the following text.

We can call the three filters using
run_filters([
    collect_numbers,
    number_exercises,
    handle_citations
])

We must run collect_numbers before handle_citations, but number_exercises can go anywhere in the list.

If you get the native format for a file that contains these [@ref] references, you get a complex text, but you will see that we have a Cite object that contains a Citation element (there can be more than one, but we will only handle one here). We want to translate these elements. We really want to work with Citation, but if we filter on that, we will create an object that goes into the Cite that encloses it, so we will handle Cite objects and extract the Citation object from it.

A Cite object contains several attributes including the identifier (the @reference text), a prefix (text that goes inside the square brackets but before the reference), and a suffix (text that goes after the text). We will only use the prefix and the identifier here and ignore—effectively remove—the suffix. See the exercises for including the suffix.
 1   def handle_citations(elem, doc):
 2       if type(elem) == Cite:
 3           actual_cite = elem.citations[0]
 4           identifier = actual_cite.id
 5           if not identifier.startswith("ex:"):
 6               return elem
 7
 8           prefix_text = actual_cite.prefix.list
 9           prefix_text.extend([Space, Str("exercise")])
10
11           if doc.format == "latex":
12               return actual_cite.prefix.list + [
13                   RawInline(r"~ ef{" + identifier + "}",
14                             "latex")
15               ]
16
17           if identifier in ex_dict:
18               ex_num = ex_dict[identifier]
19               prefix_text.extend([
20                   Space, Str(str(ex_num))
21               ])
22           return [
23               Link(*actual_cite.prefix.list,
24               url = "#" + identifier)
25           ]

In the first line in the filter, we check if we have a Cite element. There is nothing new there. Then we extract the element we are actually interested in, which is a Citation element.

We get the identifier, which is the citation label (line 4), and check if it is an example label, that is, starts with "ex:" (line 5). If it is not, we return the element; we do not want to modify other citation objects since these could be used by other filters.

Now we extract the prefix of the Citation object ; again we get the actual content using .list. We add the text "exercise" to the prefix (line 9), so the references will contain this text as well. We need to add a space as well to prevent the prefix from being concatenated with the "exercise" text. We also need to add a space after "exercise", but I want a non-breaking space in the LaTeX output, so there I want a tilde rather than a space (see the following text).

I am assuming that there is already a space before the reference, that is, that the reference looks like this [see @ex:ref] rather than [see@ex:ref]; if not we need to add a space before "exercise". We will need to add it after the reference. Otherwise, the prefix and the reference number will be concatenated.

We now handle LaTeX output separately (lines 11 to 15). We append the ef{} command to the prefix and return the result. We put the LaTeX code in a RawInline object. This is similar to a RawBlock object but for inline text. We do not add a space here but a tilde.

For other output formats, lines 17 to 25, we look up the exercise number from our map, assuming that there is an identifier, and then we add a space and the number to the prefix. We add a space before the number, so it isn’t concatenated to "exercise". Finally, we create a link from the reference.

Exercises

Conditional Inclusion

Modify the filter so you can use a metavariable to determine which difficulties should be included and which should be removed.

Conditional on Output

We removed the Span and Div objects when we modified the input. If the objects had more than the output and format classes, we might want to keep them (but with the .out and .format classes removed). Modify the script to do this.

Evaluating Code

Add a class to the code blocks that will determine whether the original code and the block should remain in the output, while you still evaluate the code in the block.

Numbering Exercises

Can you add the reference suffixes to the output as well?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset