Automated Web Publishing
No-Tags Markup
February 2005: This document was written in 1995 and modified only slightly since then. Meanwhile, a few people have adopted No-Tags Markup, and several similar systems have been developed. (I've added some links below.) Many are used to enter relatively small amounts of text into a blog editor, wiki, or content management system, which converts the text into HTML or XML (and perhaps from there to PDF, RTF, etc.). The opposite path is also useful: create an enhanced plain-text version from an original stored as XML or HTML.
Similar systems that appear to have an active base of users:
- wiki markup.
- STX (Structured Text) by Jim Fulton, e.g. for Zope and ZWiki.
- reStructuredText for Python's DocUtils.
- Markdown by John Gruber of Daring Fireball, including an active mailing list.
- Seth Golub's (and now Kathryn Andersen's) txt2html, a filter that will convert plain, ordinary text to HTML.
Plus a few that may not have caught on much beyond the original developer:
Creating formatted text without a markup languageAlthough the desktop publishing revolution began 10 years ago, we still have not left plain text behind. It has a few compelling advantages that are not likely to go away soon.
- Universal: any computer, any software, any online service
- Simple: easy to create & edit (in any software)
- Compact: small file size (any disk, any phone connection, any modem speed)
And yet, formatted text and integrated graphics generally make a document much easier to read. Lists can be set off, as above. Words can be emphasized, headlines made to stand out, and literal text marked appropriately. Plain text also cannot adequately capture the structure of a document, e.g. the section break that appears a few paragraphs below.
The rapid growth of the World Wide Web brings the problem into sharp focus. A Web browser shows nicely formatted text & graphics, but the underlying document is a text file that contains special HTML markup commands. (Graphics are stored as separate files, and placed by the browser.) The Web is a great vehicle for delivering new information rapidly, and providing access to broad & deep historical data.
How will this information, old and new, flow onto the web? What new tools have to purchased and learned? How can that information be flowed automatically from or to other places?
One answer is text. Writers can write, using any software on any computer. Text databases can be a source or result. E-mail can be captured, or text compilations can be e-mailed. But plain text is not a good answer, it's simply inadequate. Having writers learn HTML markup is also not a good answer. We need something new.
Note: Ian Feldman's setext addresses some of these issues. However, as discussed below, I think it is not an adequate solution.
No-Tags Markup
The No-Tags Markup (tm) system provides a simple way to "markup" text documents to indicate format and structure, without using formal "tags" from a traditional markup language such as HTML. Instead, a few unobtrusive characters are used to indicate format, e.g. _ (underscore) for italics and * (asterisk) for bold. Document structure is conveyed with a similar approach, e.g. a - (dash) for list items, or a line containing only "---" for a section break. These elements do not intrude on the text. In fact, the basic ones described so far are quite common in e-mail and other online communication.
The second aspect of No-Tags Markup is a simple way to handle line breaks. There are two competing goals: to let the reader's software wrap paragraphs to any desired width, and to enable the writer to indicate text that should not be wrapped (and also not formatted as a standard paragraph). As with traditional markup, No-Tags Markup uses two "new line" characters (carriage return, line feed or both depending on computer) to indicate a paragraph break. Unlike traditional markup, it treats one or more spaces or tabs at the beginning of a line as an indicator that the line should not be wrapped. (The "For more information" section at the end of this document is a real-world example.)
Finally, there are a few important types of format and structure that have no obvious or commonly-accepted indicators, including different levels of headers, blocked quotations (which should wrap, but appropriately) and blocks of text in a fixed-width typeface (e.g. for code or simple tables). _Because there are no existing ad hoc standards, this area of the draft specification is the most subject to change. Let me know what you think!_
The current No-Tags Markup specification is a draft, submitted to the online community for feedback. This entire, unmodified document may be freely distributed.
A parser for No-Tags Markup exists today. The "source" for many documents on this web site (including this document) are "No-Tags" text files.
Draft Specification
July 10, 1995 One brief aside: I dislike the extra newline that Netscape puts before a list, but I have not discovered a workaround.Design goals
- Where possible, formalize existing practice.
- Where needed, create a simple, unobtrusive markup scheme.
Character format
- indicate bold text by surrounding with an * (asterisk).
To be compatible with "setext", No-Tags parsers should treat ** as equivalent.- indicate "underlined" text by surrounding with an _ (underscore).
Note that typographers do not usually distinguish underline from italic. Netscape, for example, renders "emphasized" text as italic (saving underlines for links). No-Tags parsers should have a flag to map underlined text to italic.- indicate italic text by surrounding with a ~ (tilde). Where there is no difference, the _ is preferred since it is more readable.
- indicate fixed-character-width text by surrounding with a ` (backwards single quote).
- NOTE: character formats may not cross paragraph boundaries.
- NOTE: the markup is not a simple "first character turns on, next turns off". A single markup character does nothing, and will be ignored. (For example, the above characters were typed literally; no "escaping" was used.) Only when a matching pair is found in a single paragraph will the text be marked.
Lists
- leading - (dash) indicates a standard list item (Netscape renders as a bullet)
- at present, a leading * (asterisk) with no matching asterisk also indicates a standard list item
- leading # (variously called number or pound sign, hash mark, sharp sign) indicates a numbered list item
- leading spaces indicates that indentation should be preserved; somewhat like a "list" with no list marker. Netscape ignores leading spaces even when formatted as PRE, so the indentation is currently achieved with a transparent GIF image.
- NOTE: Leading spaces & tabs before a list marker are ignored (i.e. lists supercede "no wrap" text).
- NOTE: There must be one and only list marker, 2 or more will be treated as plain text (or, with leading whitespace, as "no wrap" text).
- NOTE: To include paragraph text without breaking the list, precede it with whitespace.
"No wrap" text
- tabs will be converted to 4 spaces. No-Tags parsers should have an option to set the number.
- leading whitespace will prevent a line from being wrapped (unless superceeded by another No-Tags Marker rule).
- NOTE: See the above note on Netscape.
- NOTE: Single "new lines" (carriage returns) are passed through unchanged. However, processing through traditional markup languages such as HTML will strip them out.
- NOTE: I did not think of a simple way to have no-wrap lines without indentation. If the source text comes from a word processor, any hard carriage return could be preserved. If the text has been e-mailed, clearly that would not apply.
Special blocks
- use ` (backwards single quote) as the only thing on a line to begin, and another to end, a block of "no wrap", fixed-width-typeface (mono-spaced text).
- use `` (2 backwards single quotes) as the only thing on a line to begin, and another to end, a blocked quotation.
- NOTE: if either block is "introduced" but not "closed", it will remain in effect until the end of the document.
Structure
- use --- as the only thing on a line for a section break (often: a horizontal line).
perhaps the spec should include flags to partially relax this rule, e.g. a minimum and/or maximum number of characters would allow for parsing internet digests- indicate a heading by surrounding with 3, 5 or 7+ asterisks (*) or underscores (_).
the values go up by 2 so that the marks will be distinct. This spec is intended for simple, page-oriented documents, so 3 levels of headings seemed the best compromise between simple and flexible. The current parser allows heading text to appear anywhere within a line (though Web browsers may force the header to start on a new line). Although up-by-two is preferred, matched pairs of intermediate values are "rounded up", i.e. a matching pair of 4 (****) is treated as 5, matching 6 is lumped with as 7+.Hypertext Links
- A link is specified with a leading ^ (caret), the text to be linked, a comma, an internal or external reference, and a trailing caret. If the text and the reference are the same, only one need be included. (To form a reference, spaces will be stripped from the text and ".html" will be added.) The internal reference must begin with #. The linked text may not contain a comma.
- An internal link destination or label is specified with a leading ^ (caret), a # sign, the label text, and a trailing caret. The text may not contain a comma. At present, the link text is treated strictly as a label; the text will not appear in the final document.
Images, Files
- The caret mechanism has been extended for embedding images (include the keyword "image" followed by a colon and the name), including a "centered image" and a special "title image" used by the StageThree automated web publishing system.
- The caret mechanism also specifies FTP links (keyword "ftp"; StageThree adds an icon and, for local files, the file size).
Notes on translation to HTML
- "bold" is optionally rendered as STRONG, "underline" as EMphasis (which Netscape displays as italic)
- "fixed-width" is rendered as TT (TypewriterText)
- "fixed-width" blocks are rendered as PRE, blocked quotes as BLOCKQUOTE
- section break is rendered as HorizontalRule
Planned Features
- nested lists (using + and/or whitespace count)
- NOTE: to help avoid unnecessary complexity, no feature will be added to the formal specification until I have a working parser that supports it.
Optional "setext" extensions under consideration
- leading > as blockquote (or PRE?) ... have to think about multiple occurances: >>, >>> etc.
- handle ** for bold (implemented as an experimental feature)
- user preference to wrap text despite n leading spaces
- look ahead one line to handle chapter & section breaks
- what else might be needed???
Features under consideration
- a single-character "escape" mechanism using \ (backslash)
Because the spec requires matched pairs, a single-character escape may not be required. An experimental feature of the current parser treats two * _ ~ or ` in a row as the literal character.- a block "escape" mechanism, perhaps ///
- add === (3 equals signs) for a chapter break (often: thicker separator line)
- additional character formats
- begin a line with | (pipe, vertical bar) to center it ... and possibly [] for left & right justified (and both for filled?)
- block centering with | on its own line ... and perhaps left with [, right with ]
- definition lists, perhaps /term/
Contrast to "setext"
No-Tags Markup shares with Ian Feldman's setex (structure-enhanced text) the important goal of marked up text that is easy to read, indeed where the markup either enhances the plain text itself or is almost invisible. However, I see two main problems with setex:
- the tags are not as straightforward as they could be
- many simple and useful elements (such as this numbered list!) are not included
Although I can see some reasons for setext's choice of tags, I disagree with the tradeoffs. Why invent ** for bold? A single * is simpler, more consistent, and more commonly used ad hoc in e-mail and other online communication. The current No-Tags Markup parser can differentiate the most common uses of an asterisk as a bullet vs. as bold; including bold at the beginning of the line. No absolute tradeoff is required. Limiting italic to single words is taking an editorial decision that should be left to authors. "===" for title and "---" for subhead may be fine for newsletters, but ignores their ad hoc use as section breaks before or independent of any heading. The requirement that the heading and "tag" have the same number of characters also increases the possibility of mistakes in creating a setext document (and incidentally complicates the parser). The extra potential for error may be acceptable for the newsletter editors that were (I believe) setext's original target; but is not satisfactory for more ad hoc authorship. The tag for hypertext link is simply strange: no instrinic meaning and difficult to distinguish from underlined text. It also fails to specify the link destination (or did I miss that?).
No-Tags Markup adds support for many common document elements that are missing from setext, including mono-spaced text (in context), plain and numbered lists, block quotes (that are not mono-spaced), and a block definition for mono-spaced lines. It also follows the typical ad hoc rule that indented text should not be wrapped.
The setext format is used by the popular TidBITS online newsletter, EFF's newsletter, MacWeek articles on ZiffNet, and probably elsewhere. For details on setext, check out:
- An introduction, table of "typotags", 2 sermons, discussion, etc.
- The introduction is also available by e-mail from setext@tidbits.com.
- EFF's short description
- David Martland's LATEX converter
- Andrew Pam of Serious Cybernetics offers HTML to setext source code
Other text formats
I've found a few other interesting approaches to this problem:
- Scott J. Kleper's HTML Markup, a very cool Mac solution
- Meng Weng Wong's txt2html, same name & goals as the other txt2html, different program.
- Man Kam Kwong's htxp macro processor.
[February 2005: newer solutions are listed at the top.]
Challenges
The underscore (_) and tilde (~) characters often appear in FTP "addresses" (URLs). When they appear in pairs within the same paragraph, either the parser must be sophisticated enough to understand the URL context, or the human author/editor must "escape" them (by doubling). I don't like either solution, but am also not ready to toss these characters.
For more information:
I use No-Tags Markup to maintain documents (such as this one) that can be e-mailed directly or automatically processed into Web pages using my StageThree system. Although this spec is a draft, I encourage you to use it where it fits your needs. If you have questions or comments, or scripts to share, please let me know.Internet: ssl@prefab.com
The No-Tags Markup specification is Copyright 1995 by Scott S. Lawton. All Rights Reserved. No-Tags Markup is a trademark owned by Scott S. Lawton.
No-Tags Markup Example.ntm
Copyright 1995-2005, Scott S. Lawton. All Rights Reserved.
This site (mostly) built and maintained using Stage Three, a set of custom Frontier scripts.