Announcing Megaparsec

Hello!

I’d like to announce the release of Megaparsec, a fork of Parsec that has been in the works for the past 2 months. There’s a lot of improvements and bugfixes under the hood – as well as a new test suite with 128 Quickcheck tests covering 80% of code (Parsec has only 3 tests, by the way) – but first I’d like to explain why a fork was needed, since forking a popular library is a pretty drastic measure and should be accompanied by an explanation. (A disclaimer: I’ve been given permission to announce the library, but I’m neither the author nor an expert on parsing.)

Some links:

If you ever had any ideas about what Parsec should’ve done differently, or what amazing new combinators it should include, etc., post your ideas here: https://github.com/mrkkrp/megaparsec/issues.

Why fork Parsec instead of writing a new library

There are many parsing libraries on Hackage; Parsec was one of the first to appear, and it’s still the one that is recommended to beginners when they ask about doing parsing in Haskell. Most other libraries aren’t exactly trying to compete with Parsec – instead they explore new directions.

Like it or not, it remains a fact that a lot of people are being recommended Parsec, a lot of people are using Parsec, and a lot of people will probably continue to use Parsec since there’s no clear alternative to it. Writing a new and different library probably won’t change it. Even new and similar libraries (trifecta and attoparsec are similar enough) haven’t removed the need for Parsec, only mitigated it somewhat. Perhaps one day trifecta or something else will completely replace Parsec, but right now we still have to put up with Parsec. So, what we need (or at least what would be nice to have) is simply a better Parsec.

What’s wrong with original Parsec

Parsec’s development has stagnated long ago; you’d have a hard time getting your pull request merged, even if the only thing it does is fixing a typo (I’m not even talking about anything more controversial than that). Parsec isn’t perfect or bug-free, and there is a need for those pull requests – just look at its Github page, where you’ll find 8 unmerged PRs (and 9 open issues): https://github.com/aslatter/parsec.

In addition to things that could be fixed but simply aren’t, there are some inconveniences (and bugs!) that are hard to fix without breaking backwards compatibility:

In short, there may be a lot of value in improving Parsec – and that’s where Megaparsec comes in.

What is Megaparsec and how is it different

Megaparsec “reboots” the development of Parsec. It’s not backwards compatible, but it is compatible enough to avoid having to rewrite all Parsec tutorials, and code written for Parsec can be converted to use Megaparsec pretty mechanically. Given that, I would recommend using Megaparsec instead of Parsec from now on, unless you need compatibility with GHCs older than 7.10.

Here is a detailed account of some of the bigger changes:

Error reporting

Megaparsec’s errors messages are significantly more accurate than Parsec’s in some cases, as the following demonstration shows.

Parsec:

 > parseTest (try (string "let") <|> string "lexical") "le"
 parse error at (line 1, column 1):
 unexpected end of input
 expecting "lexical"

Megaparsec:

 > parseTest (try (string "let") <|> string "lexical") "le"
 parse error at line 1, column 1:
 unexpected "le"
 expecting "let" or "lexical"

And here’s another one, showing off how sometimes try a <|> b is a bit less harmful than usual. Just in case, I’m talking about this blog post: http://blog.ezyang.com/2014/05/parsec-try-a-or-b-considered-harmful/.

If you don’t want to read it in full, the gist of it is that if you use try too liberally, Parsec’s error messages become much worse than they could be. In this example, we’re trying to parse a simple version of Haskell’s import statements, which can look either as import Foo or import qualified Foo as B. If you write a parser like this:

 try pQualifiedImport <|> pImport

you will get an uninformative error message on the following input:

 import qualified Foo s B

Specifically, the try pQualifiedImport branch will fail, and then the pImport branch will stumble upon qualified – while the actual error isn’t there, but in misspelled as.

The advice given in the blog post (“The scope of backtracking try should be minimized”) is good, and applies to Megaparsec just as well as it applies to Parsec. However, a curious thing is that in this particular case Parsec should be able to tell you that the error is in misspelled as, because Parsec implements the longest match rule – errors that occur later in text are given precedence over errors that occur earlier. The code is there, but somehow it doesn’t do what it’s supposed to do (and Megaparsec fixes that).

Here’s the code from the article, rewritten for Megaparsec:

 import Text.Megaparsec
 import qualified Text.Megaparsec.Lexer as L

 data Stmt = QualifiedImport String String | Import String
   deriving (Show)

 pStmt = try pQualifiedImport <|> pImport

 pImport = do
   keyword "import"
   fmap Import upperCaseIdentifier

 pQualifiedImport = do
   keyword "import"
   keyword "qualified"
   liftA2 QualifiedImport
     upperCaseIdentifier
     (keyword "as" *> upperCaseIdentifier)

 upperCaseIdentifier = lexeme $
   liftA2 (:) 
     upperChar 
     (many (alphaNumChar <|> oneOf "_."))

 lexeme  = L.lexeme (hidden space)
 keyword = L.symbol (hidden space)

And here are the error messages it produces. Megaparsec:

 > parseTest (pStmt >> eof) "import qualified Foo s B"
 parse error at line 1, column 22:
 unexpected 's'
 expecting "as"

Parsec (with minor modifications):

 > parseTest (pStmt >> eof) "import qualified Foo s B"
 parse error at (line 1, column 8):
 unexpected "q"
 expecting uppercase letter

Integration with monad transformers

The key type of Parsec is ParsecT (others, such as Parsec and Parser, are just type synonyms). It lets you use parsers with other monads – for instance, if you use ParsecT String () IO, you can have IO in your parsers. This is the reason, by the way, why char has the type

 char :: Stream s m Char => Char -> ParsecT s u m Char

instead of a simpler

 char :: Char -> Parser Char

(If it was the latter, you wouldn’t be able to use char and IO in the same parser.)

However, even this type isn’t general enough for all things you might want to do. Imagine that you want some parsers to generate warnings when they are run, and later you want to collect those warnings and do something with them. This sounds like what Writer was invented for, so you try to use it. Now all your parsers have this type:

 ParsecT String () (Writer [Warning])

(Where Warning is, say, a synonym for String.)

Unfortunately, this is not how you should’ve composed ParsecT and Writer, because this way you don’t get backtracking – in other words, if you try to do

 optional (try someParser)

and someParser generates warnings but then fails, optional won’t be able to make them disappear – they will still be recorded. Same with State – if you do

 x <* notFollowedBy y

and y changes the state, this change will be recorded even tho it’s not what you want most of the time. (You can use Parsec’s internal state and it will work, but it doesn’t help you when you want to use Writer or something else instead of State.)

What you want in such situations is

 WriterT [Warning] Parser ()
 StateT YourState Parser ()
 ...

but you can’t get it because char and all other primitive parsers simply don’t have those types. With Parsec, the only solution is to apply lift to all parsers you want to use, which is pretty annoying. Megaparsec solves this by introducing an mtl-style MonadParsec class, making primitive parsers members of this class, and providing instances of MonadParsec for various monad transformers. (If you have ever used the parsers library, you may recognise this approach.) I think being able to get backtracking behavior without relying on inelegant ways like “Parsec user state” is pretty neat, even if it’s not something every Parsec user needs.

Lexing

Parsec has a lexing module (Text.Parsec.Token): http://hackage.haskell.org/package/parsec/docs/Text-Parsec-Token.html.

If you’re not familiar with lexing, the idea is as follows. When you are parsing a programming language, you often have to solve the same set of problems – parsing numbers, string literals (with all those escaping rules), identifiers/operators/keywords, comments, making all parsers handle whitespace (since there can be whitespace between pretty much any 2 tokens), and so on. With Text.Parsec.Token you could just specify what counts as whitespace, as a comment, as an identifier character, etc. and get a set of parsers “for free”.

Parsec achieves this by defining a huge structure called GenLanguageDef that contains the specification of your language:

 data GenLanguageDef s u m =
   LanguageDef {
     commentStart    :: String,
     commentEnd      :: String,
     commentLine     :: String,
     nestedComments  :: Bool,
     identStart      :: ParsecT s u m Char,
     identLetter     :: ParsecT s u m Char,
     opStart         :: ParsecT s u m Char,
     opLetter        :: ParsecT s u m Char,
     reservedNames   :: [String],
     reservedOpNames :: [String],
     caseSensitive   :: Bool }

Then you use makeTokenParser on it to generate another huge structure containing lots of useful parsers:

 makeTokenParser :: GenLanguageDef ... -> GenTokenParser ...

 data GenTokenParser s u m =
   TokenParser {
     identifier    :: ParsecT s u m String,
     operator      :: ParsecT s u m String,
     charLiteral   :: ParsecT s u m Char,
     stringLiteral :: ParsecT s u m String,
     natural       :: ParsecT s u m Integer,
     integer       :: ParsecT s u m Integer,
     float         :: ParsecT s u m Double,
     lexeme        :: forall a. ParsecT s u m a -> ParsecT s u m a,
     parens        :: forall a. ParsecT s u m a -> ParsecT s u m a,
     braces        :: forall a. ParsecT s u m a -> ParsecT s u m a,
     comma         :: ParsecT s u m String,
     colon         :: ParsecT s u m String,
     ... }

What’s the problem with this approach? It’s very inflexible – the moment you want to change something that wasn’t supposed to be changed, you’re on your own. Do you want to special-case -- | comments to use them as doc strings? The easiest solution is to copy the whole module into your own project: https://github.com/aslatter/parsec/issues/15.

Do you want to handle newlines by yourself (for instance, to allow them to be expression separators)? The easiest solution is to fork Parsec: https://github.com/aslatter/parsec/issues/24.

(Even if those turn out not to be the easiest solutions, it’s still somewhat telling that they were what the authors ended up with.)

Megaparsec uses a simpler, more flexible approach. See the docs here: https://hackage.haskell.org/package/megaparsec/docs/Text-Megaparsec-Lexer.html.

The linked module provides 3 categories of functions:

So, instead of having parsers being passed to each other under the hood, you now have to pass them by yourself – except that you don’t actually have to do much passing, because you can just write

 import qualified Text.Megaparsec.Lexer as L

 lexeme = L.lexeme space
 symbol = L.symbol space

and off you go:

 keyword = label "keyword" . symbol
 parens = between (symbol "(") (symbol ")")
 ... etc ...

How you can help