Scripting with pandoc

This document is based on pandoc version 1.6. If you are using 1.8, please see Pandoc 1.8 changes, below.

A simple example

Suppose you wanted to replace all level 2+ headers in a markdown document with regular paragraphs, with text in italics. How would you go about doing this?

A first thought would be to use regular expressions. Something like this:

perl -pe 's/^##+ (.*)$/\*\1\*/' source.txt

This should work most of the time. But don’t forget that ATX style headers can end with a sequence of #s that is not part of the header text:

## My header ##

And what if your document contains a line starting with ## in an HTML comment or delimited code block?

<!--
## This is just a comment
-->

~~~~
### A third level header in standard markdown
~~~~

We don’t want to touch these lines. Moreover, what about setext style second-level headers?

A header
--------

We need to handle those too. Finally, can we be sure that adding asterisks to each side of our string will put it in italics? What if the string already contains asterisks around it? Then we’ll end up with bold text, which is not what we want. And what if it contains a regular unescaped asterisk?

How would you modify your regular expression to handle these cases? It would be hairy, to say the least. What we need is a real parser.

Well, pandoc has a real markdown parser, the library function readMarkdown. This transforms markdown text to an abstract syntax tree (AST) that represents the document structure. Why not manipulate the AST directly in a short Haskell script, then convert the result back to markdown using writeMarkdown?

First, let’s see what this AST looks like. We can use pandoc’s native output format:

% cat test.txt
### my header

text with *italics*
% pandoc -t native test.txt
Pandoc (Meta {docTitle = [], docAuthors = [], docDate = []})
[ Header 3 [Str "my",Space,Str "header"]
, Para [Str "text",Space,Str "with",Space,Emph [Str "italics"]] ]

A Pandoc document consists of a Meta block (with title, authors, and date) and a list of Block elements. In this case, we have two Blocks, a Header and a Para. Each has as its content a list of Inline elements. For more details on the pandoc AST, see the haddock documentation for Text.Pandoc.Definition.

Here’s a short Haskell script that reads markdown, changes level 2+ headers to regular paragraphs, and writes the result as markdown. If you save it as behead.hs, you can run it using runhaskell behead.hs. It will act like a unix pipe, reading from stdin and writing to stdout. Or, if you want, you can compile it, using ghc --make behead, then run the resulting executable behead.1

-- behead.hs
import Text.Pandoc

behead :: Block -> Block
behead (Header n [Emph xs]) | n >= 2 = Para [Emph xs] -- don't double Emph
behead (Header n xs) | n >= 2 = Para [Emph xs]
behead x = x

transformDoc :: Pandoc -> Pandoc
transformDoc = processWith behead

readDoc :: String -> Pandoc
readDoc = readMarkdown defaultParserState

writeDoc :: Pandoc -> String
writeDoc = writeMarkdown defaultWriterOptions

main :: IO ()
main = interact (writeDoc . transformDoc . readDoc)

The magic here is the processWith function, which converts our behead function (a function from Block to Block) to a transformation on whole Pandoc documents.

Digression: reader and writer options

The behead.hs script uses default options for the parser and writer. If you want more control, you can modify these defaults. (See the definitions for ParserState and WriterOptions in the haddock documentation for Text.Pandoc.Shared.)

For example, the following variants will disable pandoc markdown extensions (“strict mode”) and write markdown using reference-style links instead of inline links:

readDoc = readMarkdown defaultParserState{
                           stateStrict = True  -- disable pandoc extensions
                         }

writeDoc = writeMarkdown defaultWriterOptions{
                           writerStrictMarkdown = True -- disable pandoc exts 
                         , writerReferenceLinks = True -- use ref-style links
                         }

Queries: listing URLs

We can use this same technique to do much more complex transformations and queries. Here’s how we could extract all the URLs linked to in a markdown document (again, not an easy task with regular expressions):

-- extracturls.hs
import Text.Pandoc

extractURL :: Inline -> [String]
extractURL (Link _ (u,_)) = [u]
extractURL (Image _ (u,_)) = [u]
extractURL _ = []

extractURLs :: Pandoc -> [String]
extractURLs = queryWith extractURL

readDoc :: String -> Pandoc
readDoc = readMarkdown defaultParserState

main :: IO ()
main = interact (unlines . extractURLs . readDoc)

queryWith is the query counterpart of processWith: it lifts a function that operates on Inline elements to one that operates on the whole Pandoc AST.

LaTeX for WordPress

Another easy example. WordPress blogs require a special format for LaTeX math. Instead of $e=mc^2$, you need: $LaTeX e=mc^2$. How can we convert a markdown document accordingly?

Again, it’s difficult to do the job reliably with regexes. A $ might be a regular currency indicator, or it might occur in a comment or code block or inline code span. We just want to find the $s that begin LaTeX math. If only we had a parser…

We do. Pandoc already extracts LaTeX math, so:

-- wordpressify.hs
import Text.Pandoc

wordpressify (Math x y) = Math x ("LaTeX " ++ y)
wordpressify x = x

readDoc = readMarkdown defaultParserState

writeDoc = writeMarkdown defaultWriterOptions

main = interact (writeDoc . processWith wordpressify . readDoc)

Mission accomplished. (I’ve omitted type signatures here, just to show it can be done.)

Include files

So none of our transforms have involved IO. How about a script that reads a markdown document, finds all the inline code blocks with attribute include, and replaces their contents with the contents of the file given?

-- includes.hs
import Text.Pandoc

doInclude :: Block -> IO Block
doInclude cb@(CodeBlock (id, classes, namevals) contents) =
  case lookup "include" namevals of
       Just f     -> return . (CodeBlock (id, classes, namevals)) =<< readFile f
       Nothing    -> return cb
doInclude x = return x

readDoc :: String -> Pandoc
readDoc = readMarkdown defaultParserState

writeDoc :: Pandoc -> String
writeDoc = writeMarkdown defaultWriterOptions

main :: IO ()
main = getContents >>= processWithM doInclude . readDoc >>= putStrLn . writeDoc

Try this on the following:

Here's the pandoc README:

~~~~ {include="README"}
this will be replaced by contents of README
~~~~

The trick here is processWithM, which is just a monadic version of processWith, and lifts doInclude from Block -> IO Block to Pandoc -> IO Pandoc.

Documentation on processWith, queryWith, processWithM, and queryWithM can be found in the haddock documentation for Text.Pandoc.Definition.

What if we want to remove every link from a document, retaining the link’s text?

-- delink.hs
import Text.Pandoc

delink :: [Inline] -> [Inline]
delink ((Link txt _) : xs) = txt ++ delink xs
delink (x : xs)            = x : delink xs
delink []                  = []

readDoc = readMarkdown defaultParserState

writeDoc = writeMarkdown defaultWriterOptions

main = interact (writeDoc . processWith delink . readDoc)

Note that delink can’t be a function of type Inline -> Inline, because the thing we want to replace the link with is not a single Inline element, but a list of them. So we make delink a function from lists of Inline elements to lists of Inline elements. processWith can still lift this function to a transformation of type Pandoc -> Pandoc.

Exercises

  1. Put all the regular text in a markdown document in ALL CAPS (without touching text in URLs or link titles).

  2. Remove all horizontal rules from a document.

  3. Renumber all enumerated lists with roman numerals.

  4. Replace each delimited code block with class dot with an image generated by running dot -Tpng (from graphviz) on the contents of the code block.

Pandoc 1.8 changes

processWith

processWith f traverses the Pandoc AST from the bottom up. Suppose you have a BulletList than contains another BulletList in one of its items. Then processWith f will operate first on the nested list, and only later on the containing list.

To make this more salient, pandoc 1.8 renames processWith as bottomUp. (processWith will still work, but it is deprecated and may be removed in a later version.) Pandoc 1.8 also adds a function topDown. topDown f will traverse the tree from the top down.

JSON reader and writer

Starting with version 1.8, the pandoc command-line tool has a json reader and writer. Both are very fast (especially compared to the native reader and writer). So, instead of writing a script (as shown above) that converts from one format to another, you may want to write a general purpose script that accepts JSON and produces JSON. This can then be used in a pipe with pandoc.

Here is an example, using the convenience function jsonFilter from Text.Pandoc:

import Text.Pandoc

main = interact $ jsonFilter $ bottomUp removeLink

removeLink :: Inline -> Inline
removeLink (Link xs _) = Emph xs
removeLink x = x

To use it to remove the links from a rst document:

pandoc -f rst -t json | runghc myscript.hs | pandoc -f json -t rst

If you need more speed, you can compile the script:

ghc --make myscript.hs
pandoc -f rst -t json | ./myscript | pandoc -f json -t rst

New raw constructs

The old RawHtml, HtmlInline, and TeX constructors have been removed and replaced with a more generic RawBlock and RawInline. This opens up new possibilities with scripting. For example, a script could take a specially marked code block, run it through pygments to highlight it in a way that is appropriate for the output format, then insert the result into the document using as RawBlock format code, where format is the output format (e.g. "html") and code is the highlighted HTML code.

One could also use this technique to change how pandoc renders a particular document element, say images, in a particular output format.


  1. I’m assuming here that you’re using GHC 6.12, which handles character encodings and line endings automatically. If you’re not, you should import System.IO.UTF8 and (if on Windows) filter \r characters out of the input.