How to parse coq statements from a coq .v file the official way?

The parser I wrote doesn’t seem to be good enough to go through a coq statement from a coq file. We assumed that a coq statement would always end in a ".\s" with the exception of strings and comments.
However after parsing though some of the simple files it looks like also “- intros.” is really two statements, one for “-” and another for whatever comes after. “*” has a similar problem.


Option 1: use sertop/coq serapi’s parsing abilities

I’m not sure if we should try to modify the parser every time one of these different edge cases shows up or try to use the parsing ability integrated into sertop (it returns an (Added …) for every statement it parses with information about where it is located) instead…but since it’s deprecated I think I will avoid this option

Option 2: since coq is an extensible language, use it’s own parse

Perhaps the best way is to use Coqc to do get the coq stmts from a coq file – especially since coq is an extensible language.

cross: How to parse coq statements from a coq .v file the official way? - Stack Overflow
cross2: How to parse coq statements from a coq .v file the official way?
cross3: Reddit - Dive into anything

Hello,

Indeed, the officially recommended way is not to try to build your own Coq parser because you will always encounter limitations and use Coq’s parser instead.

This is what happens when you use sertop/SerAPI. (Internally, it links to Coq and thus runs its parser).

You should not restrict yourself to using SerAPI at the moment. It’s “just” in maintenance-mode (not getting new features) and not deprecated yet. It will keep being maintained for as long as its users cannot switch to a newer protocol that provides the same functionality. SerAPI is too important to get removed suddenly.

1 Like

If SerAPI can serve your needs, that’s probably your best option.

The parser relies on dynamically adding and removing parsing rules to parse notations. Replicating that in another parser would be a huge effort; it’s not a simple feature. If you only want to split proofs into sentences, that shouldn’t be too difficult, but as you discovered, you need to handle curly braces and bullets (-, +, *). See here. The source file fullGrammar gives an almost-complete grammar for Coq. It doesn’t fully describe lookahead processing needed to resolve ambiguities. orderedGrammar is an edited version of the grammar used in the documentation to make it more readable for users.

The parser is based on camlp5, which is unlike most other parsers I’ve seen. It is not an LALR(k) parser.

What are you trying to accomplish?

PS A couple years ago I wrote code to construct full parse trees that exactly reflect the parser’s processing of each production. I never submitted it as a PR because it was complex, brittle and hard to package for others’ use.

In addition to @Zimmi48 's accurate reply, coq-lsp does provide SerAPI parsing capabilities via the getDocument request, so you can use both.

SerAPI is not deprecated (and coq-lsp does use it essentially), but as Théo correctly pointed out we don’t expect to add new features to it.