home.


Tagged: javascript


Javascript: Extending our markup language parser, part seven

Previously we refactored our markup parser to get rid of the regular expression and make it so the tags can have start and end tags ( https://newfivefour.com/javascript-extending-our-regex-less-markup-parser-part-six.html ).

But it still doesn’t deal with ##, since each marker is a single character.

We were looking at a single character of the input string through theText.split("").forEach(function(letter, pos) {. We should refactor our code so it’s in a for loop with an iterator.

This means we won’t automatically be looking at a letter, but a position, which we can alter by increasing the position if we want to look at more than one letter. And we can increase the iterator when we want to skip letters.

We were using the extra argument in the forEach loop to define this which was an object that contained the markup language marker last position, all the tokens we match etc. That will now be in a separate object, parseState which the loop will alter.

Here’s the code now we’ve refactored out the forEach loop in favour of a for loop.

let parseState = { 
    endtag: undefined, pos: -1, isGrabbingOther: false, 
    tokens: [["*", "*"], 
             ["/", "/"], 
             ["_", "_"], 
             ["`", "`"],
             ["[", "]"]]}
  for(var pos = 0; pos < txt.length; pos++) {
    let letter = txt[pos];
    if(hasOtherGrabEnded(letter, parseState)) {
      tokens.push(txt.substr(parseState.pos, pos - parseState.pos))
      parseState.isGrabbingOther = false
      startTag(letter, pos, parseState)
    } else if(hasTagEnded(letter, parseState) || endOfInput(txt, pos)) {
      tokens.push(txt.substr(parseState.pos, pos - parseState.pos + 1))
      endTag(parseState)
    } else if(shouldStartNewTag(parseState)) {
      if(hasStartTag(letter, parseState)) startTag(letter, pos, parseState)
      else startOtherGrab(pos, parseState)
    }
  }

The

( You can play with it here: https://codepen.io/newfivefour/pen/OBXXdp )

We can also change the helper functions. They close over the parseState so there’s no need to pass it in. And we can make them deal not with letters but with positions in a string. We’ll have a look at some of the functions:

  let parseState = { 
    endtag: undefined, pos: -1, isGrabbingOther: false, 
    tokens: [["*", "*"], 
             ["/", "/"], 
             ["_", "_"],
             ["`", "`"],
             ["[", "]"]]}
  let startTokens       = parseState.tokens.map(t => t[0])
  let endTokens         = parseState.tokens.map(t => t[1])
  let getStartTagOrNull = (pos) => startTokens.filter(t => t == txt.substr(pos, t.length))[0]
  let getEndTagOrNull   = (pos) => endTokens.filter(t => t == txt.substr(pos, t.length))[0]
  let endingMarkFor     = (start) => parseState.tokens.filter(t => t[0] == start)[0][1]

This defines our parse state with the starting and ending tokens, creates an array with all the start markers, all the end markers, find a start marker at a position in the txt string (by looking at txt at the passed point until the length of the suspected marker), find a end marker at a position in the txt string, and returns an ending markers if you pass it a starting marker.

The remaining functions alter the state of parseState during the loop.

Now we have these new functions that work, not on a lettter, but on a position in a string, our main logic changes a little in the last if else loop: we use getStartTagOrNull instead of passing in a letter. And we don’t need the letter variable:

for(var pos = 0; pos < txt.length; pos++) {
  if(hasOtherGrabEnded(pos)) {
    tokens.push(txt.substr(parseState.pos, pos - parseState.pos))
    parseState.isGrabbingOther = false
    startTag(getStartTagOrNull(pos), pos)
  } else if(hasTagEnded(pos) || endOfInput(txt, pos)) {
    tokens.push(txt.substr(parseState.pos, pos - parseState.pos + 1))
    endTag()
  } else if(shouldStartNewTag()) {
    let knownTag = getStartTagOrNull(pos)
    if(knownTag) startTag(knownTag, pos)
    else startOtherGrab(pos)
  }
}

Here’s the updated version: https://codepen.io/newfivefour/pen/YJWjwM

We’re nearly done, but we can’t just enter ["##", "##"] into our list of tokens just yet. This is because we’ll match an end marker, ##, with the position at the first #, and then continue. This means we won’t grab the entire ending marker.

Let’s look at this code:

let extraEndTagMarkers = endMarkLengthMin1(pos)
tokens.push(txt.substr(parseState.pos, pos - parseState.pos + 1 + extraEndTagMarkers))
pos = pos + extraEndTagMarkers 
endTag(pos)

The first line gets the length of the end marker (minus 1), if if there’s no ending mark returns 0.

Then we grab the text from position in parseState, adding on the extra ending marker if we need to.

Then we add that potential extra length on our loop iterator (so we skip the remainder of the tag in the loop)

Then end the tag as usual.

Here’s the final example which deals with markers with multiple characters, # and ## in our case: https://codepen.io/newfivefour/pen/GYqXyv

It also deals with our bullet points * a point and . a point.

But there’s a slight problem with that if we match until the end of the line, i.e. ["\n* ", "\n"]. It will grab the \n as part of the tag, and if the next line requires a \n at the start of it, which ["\n* ", "\n"] does, then it will not see it. In this case we need a newline between * points. (We’ll see a way around this later)

Next we’ll look at integrating our new parser into the old quickText function.

javascript

Javascript: Removing the regular expression from our new markup language, part six

Previously we decided the regular expression for our markup language were unmaintainable. And we wrote a function that replaced the regular expressions. ( https://newfivefour.com/javascript-removing-regex-from-new-markup-language.html )

But that function didn’t deal with markup tags with different start and end points, [hello there|https://url] was our example.

We can remedy that by specifying the start and end tags for our markup. i.e.

tokens: [["*", "*"], 
         ["/", "/"], 
         ["_", "_"], 
         ["`", "`"],
         ["[", "]"]]

And in our helper function we’ll ensure we look for the starting tag and look for the ending tag. This is the function that gets for the ending tag for a certain letter:

let endingTagFor = (letter, ob) => ob.tokens.filter(t => t[0] == letter)[0][1]

And here’s the function that checks if we’re found a starting tag:

let hasStartTag = (letter, ob) => ob.tokens.map(t => t[0]).indexOf(letter) != -1

Our entire function is now here:

function quickText(txt, ignoreNewlines) {
  let startTag          = (lett, pos, ob) => { ob.endtag = endingTagFor(lett, ob); ob.pos = pos }
  let endTag            = (ob) => ob.endtag = undefined
  let hasTagEnded       = (letter, ob) => ob.endtag && ob.endtag == letter
  let startOtherGrab    = (pos, ob) => { ob.isGrabbingOther = true; ob.pos = pos}
  let endOtherGrab      = (ob) => ob.endOther = false
  let isEndOfPrevTag    = (ob) => ob.endtag == undefined
  let hasStartTag       = (letter, ob) => ob.tokens.map(t => t[0]).indexOf(letter) != -1
  let endingTagFor      = (letter, ob) => ob.tokens.filter(t => t[0] == letter)[0][1]
  let endOfInput        = (txt, pos) => pos == txt.length -1
  let shouldStartNewTag = (ob) => isEndOfPrevTag(ob) && !ob.isGrabbingOther
  let hasOtherGrabEnded = (lett, ob) => ob.isGrabbingOther && hasStartTag(lett, ob)
  let tokens = []; strArray(txt).forEach(function(letter, pos) {
    if(hasOtherGrabEnded(letter, this)) {
      tokens.push(txt.substr(this.pos, pos - this.pos))
      this.isGrabbingOther = false
      startTag(letter, pos, this)
    } else if(hasTagEnded(letter, this) || endOfInput(txt, pos)) {
      tokens.push(txt.substr(this.pos, pos - this.pos + 1))
      endTag(this)
    } else if(shouldStartNewTag(this)) {
      if(hasStartTag(letter, this)) startTag(letter, pos, this)
      else startOtherGrab(pos, this)
    }
  }, { endtag: undefined, pos: -1, isGrabbingOther: false, 
      tokens: [["*", "*"], 
               ["/", "/"], 
               ["_", "_"], 
               ["`", "`"],
               ["[", "]"]],  })
  return tokens
}

And you can play with it here: https://codepen.io/newfivefour/pen/bmepZV

We still don’t deal with multiple letter tags though, ## for example, and we will next.

javascript

Javascript: Extending our new markup language part 4

Previously ( https://newfivefour.com/javascript-extending-our-new-markup-languge-part-3.html ) we said there was two annoying things about our new markup language: the regular expressions are getting out of hand, and two we can’t do _hello *bold* hello_. We’ll deal with the latter now.

In our last example we did <b>${replaceSpaceAndNL(token[1])}</b> when we found some bold text. That means everything within that bold text went unprocessed.

We can fix that easily by making our function recursive – it will call itself on that text. So instead we will have: <b>${quickText(token[1])}</b>. We can use this recursive nature in all our if statements, for bold text, for italic, for list items, etc.

Eventually the text inside it will be normal text. And the normal text will be processed with replaceSpaceAndNL(token[1]), that is the spaces and newlines will be processed into HTML spaces and newlines.

But in the case of headings and bullet points we don’t want the newlines to be processed. So instead we will pass our quickText function a boolean that says whether we should process newlines or not, and have the final map in our function look at that.

So our function will now look like this:

function quickText(txt, ignoreNewlines) {
  var notThisButThat = (dis, that, name) => dis != name && that == name
  var dot = "[\\s\\S]"
  var tokens = ["[^*_/\\^`\\[\\]#][^*_/\\^`\\[\\]#]*[^*_/\\^`\\[\\]#]", 
                `_${dot}*?_`,
                `[*][^ ]${dot}*?[*]`,
                `##.*?\\n`,`#[^#].*?\\n`,
                `[/]${dot}*?[/]`,
                `\\[${dot}*?\\]`,
                `^[*] .*\\n`,
                `^[\\^] .*\\n`,
                "[`][\\s\\S]*?[`]"]
  return txt.match(new RegExp(tokens.join("|"), "gm"))
    .map(t => {
      if(t.startsWith("/")) return ["italic", t.slice(1, -1)]
      else if (t.startsWith("_")) return ["underline", t.slice(1, -1)]
      else if (t.startsWith("* ")) return ["ulistitem", t.slice(1)]
      else if (t.startsWith("^ ")) return ["olistitem", t.slice(1)]
      else if (t.startsWith("##")) return ["heading2", t.slice(2)]
      else if (t.startsWith("#")) return ["heading1", t.slice(1)]
      else if (t.startsWith("*")) return ["bold", t.slice(1, -1)]
      else if (t.startsWith("`")) return ["pre", t.slice(1, -1)]
      else if (t.startsWith("[")) return ["link", t.slice(1, -1).split("|")]
      else return ["normal", t]
    })
    .map(function(token) {
      let retValue = ""
      if(notThisButThat(token[0], this.prev, "olistitem")) retValue += "</ol>"
      else if(notThisButThat(token[0], this.prev, "ulistitem")) retValue += "</ul>"
      if(notThisButThat(this.prev, token[0], "olistitem")) retValue += "<ol style='margin:0px'>"
      else if(notThisButThat(this.prev, token[0], "ulistitem")) retValue += "<ul style='margin:0px'>"      
      if(token[0] == "italic") retValue += `<i>${quickText(token[1])}</i>`
      else if (token[0] == "underline") retValue += `<u>${quickText(token[1])}</u>`
      else if (token[0] == "bold") retValue += `<b>${quickText(token[1])}</b>`
      else if (token[0] == "ulistitem") retValue += `<li>${quickText(token[1], true)}</li>`
      else if (token[0] == "olistitem") retValue += `<li>${quickText(token[1], true)}</li>`
      else if (token[0] == "heading1") retValue += `<h1>${quickText(token[1], true)}</h1>`
      else if (token[0] == "heading2") retValue += `<h2>${quickText(token[1], true)}</h2>`
      else if (token[0] == "pre") retValue += `<pre style="display: inline">${token[1]}</pre>`
      else if (token[0] == "link") retValue += `<a href="${token[1][1]}">${quickText(token[1][0])}</a>`
      else if(ignoreNewlines) retValue += token[1].replace(/  /g, "&nbsp;")
      else retValue += token[1].replace(/  /g, "&nbsp;").replace(/\n/g, '<br>')
      this.prev = token[0]
      return retValue
    }, { prev : ""})
    .join("")
}

You can play with it here: https://codepen.io/newfivefour/pen/XxdZEe

We still have the fairly unmaintainable regular expressions to remove. And we’ll do that in a later post.

javascript

Javascript: Extending our new markup language, part 3

Although our new markup language ( https://newfivefour.com/javascript-tokenising-parsing-new-markup-language-part-2.html ) doesn’t deal with _hello *there* again_ yet, let’s extend it with headings, links, formatted code and lists beforehand.

Our markup language uses *, _ and / to format the text. This means we can’t use those characters lest they will be interpreted. But there is a way: we’ll mark anything between ` as not to be interpreted.

Our regex for that will be [`][\\s\\S]*?[`] and we’ll add the ` to our regexes for non-markup language so it doesn’t mean the ` marks as normal text: ([^*_/.`][^*_/.`]*[^*_/.`]). Our function now looks like this:

document.body.innerHTML = quickText(`I /can almost/ definitely *visually see* a _thingy thing_ there.

   The end!

*Or is
it?*

\`special characters */_\`

`)

function quickText(txt) {
  var replaceSpaceAndNL = t => t.replace(/  /g, "&nbsp;").replace(/\n/g, ' <br> ')
  var dot = "[\\s\\S]"
  var tokens = [`_${dot}*?_`, 
                "[`][\\s\\S]*?[`]",
                `[*]${dot}*?[*]`, 
                `[/]${dot}*?[/]`, 
                "[^*_/`][^*_/`]*[^*_/`]"]
  return txt.match(new RegExp(tokens.join("|"), "g"))
    .map(t => {
      console.log(t)
      if(t.startsWith("/")) return ["italic", t.slice(1, -1)]
      else if (t.startsWith("_")) return ["underline", t.slice(1, -1)]
      else if (t.startsWith("*")) return ["bold", t.slice(1, -1)]
      else if (t.startsWith("`")) return ["pre", t.slice(1, -1)]
      else return ["normal", t]
    })
    .map(token => {
      if(token[0] == "italic") return `<i>${replaceSpaceAndNL(token[1])}</i>`
      else if (token[0] == "underline") return `<u>${replaceSpaceAndNL(token[1])}</u>`
      else if (token[0] == "bold") return `<b>${replaceSpaceAndNL(token[1])}</b>`
      else if (token[0] == "pre") return `<pre style="display: inline">${token[1]}</pre>`
      else return replaceSpaceAndNL(token[1])
    })
    .join("")
}

(note: that still means we can’t use a ` in our text without writing &#96;.)

We’ll next add links. They will be represented like this [link text|https://linkurl.com]. The regex for that will be \\[${dot}*?\\]. We’ll add the [ and ] symbols to our normal characters regex so those characters aren’t eaten up as normal text. We’ll split on the | character and display them in a <a>.

Next we’ll add headings. They’ll be represented by #a heading and ##a smaller heading and will last until the end of the line. The regexs will be ##.*?\\n and #[^#].*?\\n. The \\n says match until the end of the line. the #[^#] is so # and ## can be differentiated. Again, we’ll add the # to our normal text regex so that # is eaten up as normal text.

Finally we’ll add ordered and unordered bullet points. The regexs for these are easy. We’ll do a * for unordered and . for ordered.: ^[*] .*\\n and ^[\\^] .*\\n. (^ at the start of a reg ex means match the start of the line, and so our regex parameters will need to include “m” to give us multiline and therefore match the start of a line.)Again, we’ll add these symbols to our regex for normal characters so they’re not eaten up as normal characters.

Our final part of the functions is now more complex. We’re saving what the previous token was (via an extra parameter to the map function) and if it was not previously a ordered list, but now is then add <ol> to the output, and if it now not a ordered list, but was previously, add a <\ol>.

Our full function now looks like this:

function quickText(txt) {
  var replaceSpaceAndNL = t => t.replace(/  /g, "&nbsp;").replace(/\n/g, '<br>')
  var replaceSpace = t => t.replace(/  /g, "&nbsp;")
  var notThisButThat = (dis, that, name) => dis != name && that == name
  var dot = "[\\s\\S]"
  var tokens = ["[^*_/\\^`\\[\\]#][^*_/\\^`\\[\\]#]*[^*_/\\^`\\[\\]#]", 
                `_${dot}*?_`,
                `[*][^ ]${dot}*?[*]`,
                `##.*?\\n`,`#[^#].*?\\n`,
                `[/]${dot}*?[/]`,
                `\\[${dot}*?\\]`,
                `^[*] .*\\n`,
                `^[\\^] .*\\n`,
                "[`][\\s\\S]*?[`]"]
  return txt.match(new RegExp(tokens.join("|"), "gm"))
    .map(t => {
      if(t.startsWith("/")) return ["italic", t.slice(1, -1)]
      else if (t.startsWith("_")) return ["underline", t.slice(1, -1)]
      else if (t.startsWith("* ")) return ["ulistitem", t.slice(1)]
      else if (t.startsWith("^ ")) return ["olistitem", t.slice(1)]
      else if (t.startsWith("##")) return ["heading2", t.slice(2)]
      else if (t.startsWith("#")) return ["heading1", t.slice(1)]
      else if (t.startsWith("*")) return ["bold", t.slice(1, -1)]
      else if (t.startsWith("`")) return ["pre", t.slice(1, -1)]
      else if (t.startsWith("[")) return ["link", t.slice(1, -1).split("|")]
      else return ["normal", t]
    })
    .map(function(token) {
      console.log(token)
      let retValue = ""
      if(notThisButThat(token[0], this.prev, "olistitem")) retValue += "</ol>"
      else if(notThisButThat(token[0], this.prev, "ulistitem")) retValue += "</ul>"
      if(notThisButThat(this.prev, token[0], "olistitem")) retValue += "<ol style='margin:0px'>"
      else if(notThisButThat(this.prev, token[0], "ulistitem")) retValue += "<ul style='margin:0px'>"      
      if(token[0] == "italic") retValue += `<i>${replaceSpaceAndNL(token[1])}</i>`
      else if (token[0] == "underline") retValue += `<u>${replaceSpaceAndNL(token[1])}</u>`
      else if (token[0] == "bold") retValue += `<b>${replaceSpaceAndNL(token[1])}</b>`
      else if (token[0] == "ulistitem") retValue += `<li>${replaceSpace(token[1])}</li>`
      else if (token[0] == "olistitem") retValue += `<li>${replaceSpace(token[1])}</li>`
      else if (token[0] == "heading1") retValue += `<h1>${replaceSpaceAndNL(token[1])}</h1>`
      else if (token[0] == "heading2") retValue += `<h2>${replaceSpaceAndNL(token[1])}</h2>`
      else if (token[0] == "pre") retValue += `<pre style="display: inline">${token[1]}</pre>`
      else if (token[0] == "link") retValue += `<a href="${token[1][1]}">${token[1][0]}</a>`
      else retValue += replaceSpaceAndNL(token[1])
      this.prev = token[0]
      return retValue
    }, { prev : ""})
    .join("")
}

It works, as you can see here ( https://codepen.io/newfivefour/pen/ReWEvE ), but the we have a few problems:

  • The regexs are getting unmaintainable.
  • We can’t deal with _hello *there* again_

The first problem will be fixed when we removed regular expressions.

We’ll deal with the last issue next.

javascript

Javascript: Parsing and tokenising a new markup language

Making a new markup language, and language in general, at least in UNIX, normally involves lex and yacc. But the concepts are universal and can be written in Javascript.

A simple version of this means breaking down text into symbols, through spaces, and applying meaning to those symbols.

Take for example:

I /can/ definitely *see* a _thing_ there.

  The end.

We can break this down by saying each token is delimited by a space, and then give meaning:

  1. Normal text, i.e. I, down, etc.
  2. Italic text, i.e. /can/
  3. Bold text, i.e. *see*
  4. Underlined text i.e. _thing_
  5. A newline
  6. A space

And we can write a function that does the above steps and converts the tokens into very, very old HTML (i.e. <i> and <b>):

function quickText(txt) {
  var startsAndEnds = (txt, token) => txt.endsWith(token) && txt.startsWith(token)
  return txt.replace(/\n/g, ' \n ') // so when we break up by spaces we can see the newline
  .split(/ /)
  .map(t => {
    if(t == "\n") return ["newline", ""]
    if(t == "") return ["space", ""]
    else if(startsAndEnds(t, "/")) return ["italic", t.slice(1, -1)]
    else if (startsAndEnds(t, "_")) return ["underline", t.slice(1, -1)]
    else if (startsAndEnds(t, "*")) return ["bold", t.slice(1, -1)]
    else return ["normal", t]
  })
  .map(token => {
    if(token[0] == "space") return `&nbsp;`
    else if(token[0] == "newline") return `<br>`
    else if(token[0] == "italic") return `<i>${token[1]}</i>`
    else if (token[0] == "underline") return `<u>${token[1]}</u>`
    else if (token[0] == "bold") return `<b>${token[1]}</b>`
    else return token[1]
  })
  .join(" ")
}

You can now run:

document.body.innerHTML = quickText(`I /can/ definitely *see* a _thing_ there.

   The end!

Here's a newline: &bsol;n
And underlined text: &lowbar;stuff&lowbar;`)

And you will get

I <i>can</i> definitely <b>see</b> a <u>thing</u> there. <br> <br> 
&nbsp; &nbsp; &nbsp; The end! <br>
Here's a newline: \n &nbsp; <br> 
And underlined: _stuff_

Of course, since we’re splitting on a space, _hello there_ won’t work. But we can come to that in a later post.

javascript

Page 1 of 9
Next