Révision | ea1a174e665a21e9a912a60fedbc773f9383c2af (tree) |
---|---|
l'heure | 2022-10-27 20:06:19 |
Auteur | Albert Mietus < albert AT mietus DOT nl > |
Commiter | Albert Mietus < albert AT mietus DOT nl > |
QuickNote: PEGEN (more)
@@ -1,5 +1,5 @@ | ||
1 | 1 | ================ |
2 | -Some short blogs | |
2 | +Some Quick Blogs | |
3 | 3 | ================ |
4 | 4 | |
5 | 5 | .. toctree:: |
@@ -1,31 +1,39 @@ | ||
1 | +.. include:: /std/localtoc.irst | |
2 | + | |
1 | 3 | ================ |
2 | -The PEGEN parser | |
4 | +QuickNote: PEGEN | |
3 | 5 | ================ |
4 | 6 | |
5 | -.. post:: 2022/10/23 | |
7 | +.. post:: 2022/10/27 | |
6 | 8 | :category: CastleBlogs, rough |
7 | 9 | :tags: Grammar, PEG, DRAFT |
8 | 10 | |
9 | - To implement CCastle we need a parser (as part of ther compiler). Eventually, that parser will be writen in Castle; | |
10 | - but for now we kickstart it in python. Which has several packages that can assist us. As we like to use an PEG one, | |
11 | - there are a few options. `Arpeggio <https://textx.github.io/Arpeggio/2.0/>`__ is well known, and has some nice | |
12 | - options -- but can’t handle `left recursion <https://en.wikipedia.org/wiki/Left_recursion>`__ -- like most | |
13 | - PEG-parsers. | |
11 | + To implement CCastle we need a parser; as part of the compiler. Eventually, that parser will be writen in Castle. For | |
12 | + now we kickstart it in python; which has several packages that can assist us. As we like to use an PEG one, there | |
13 | + are a few options. `Arpeggio <https://textx.github.io/Arpeggio/2.0/>`__ is well known, and has some nice options -- | |
14 | + but can’t handle `left recursion <https://en.wikipedia.org/wiki/Left_recursion>`__ -- like most PEG-parsers. | |
14 | 15 | |
15 | - Recently python uses a PEG parser, that supports `left recursion <https://en.wikipedia.org/wiki/Left_recursion>`__ | |
16 | - (which is a recent development). That parser is also available as a (hardly documented) package: `pegen | |
17 | - <https://we-like-parsers.github.io/pegen/index.html>`__ | |
16 | + Recently python itself uses a PEG parser, that supports `left recursion | |
17 | + <https://en.wikipedia.org/wiki/Left_recursion>`__ (which is a recent development). That parser is also available as a | |
18 | + package: `pegen <https://we-like-parsers.github.io/pegen/index.html>`__; but hardly documented. | |
18 | 19 | |
19 | - This blog is writen to remember some leassons learned when playing with in | |
20 | + This blog is writen to remember some leassons learned when playing with it. And as kind of informal docs. | |
20 | 21 | |
21 | 22 | |
22 | 23 | Build-In Lexer |
23 | 24 | ============== |
24 | 25 | |
25 | -Pegen is specially writen for Python and use a specialized lexer; unlike most PEG-parser, that uses PEG for both. Pegen | |
26 | +Pegen is specially writen for Python and use a specialized lexer; unlike most PEG-parser that uses PEG for lexing too. Pegen | |
26 | 27 | uses the `tokenizer <https://docs.python.org/3/library/tokenize.html>`__ that is part of Python. This comes with some |
27 | 28 | restrictions. |
28 | 29 | |
30 | +.. hint:: | |
31 | + | |
32 | + This applies mostly when we use pegen as modele: ``pyton -m pegen ...``; that calls `simple_parser_main()`. | |
33 | + |BR| | |
34 | + When uses it in code, by importing pegen ``from pegen.parser Parser ...`` one has more options (not studies yet). | |
35 | + | |
36 | + | |
29 | 37 | Tokens |
30 | 38 | ------ |
31 | 39 |
@@ -36,26 +44,36 @@ | ||
36 | 44 | them differently; possible combined with other characters. Then, those will not be found; not the literal-strings as set |
37 | 45 | in the grammar. |
38 | 46 | |
47 | +.. note:: | |
48 | + | |
49 | + Pegen speaks about *(soft)* **keywords** for all kind of literal terminals; even when they are more like operators | |
50 | + than *words*. | |
51 | + | |
39 | 52 | .. warning:: |
40 | 53 | |
54 | + When the grammar defines (literal) terminals (or keywords) --especially for operators-- make sure the lexer will not | |
55 | + break them into predefined tokens! | |
56 | + |BR| | |
57 | + This will not give an error, but it does not work! | |
58 | + | |
41 | 59 | .. code-block:: PEG |
42 | 60 | |
43 | - Left_arrow_BAD: '<-' ## This is WRONG, as ``<`` is sees as a token | |
61 | + Left_arrow_BAD: '<-' ## This is WRONG, as ``<`` is seen as a token. And so, `<-` is never found | |
44 | 62 | Left_arrow_OKE: '<' '-' ## This is acceptable |
45 | 63 | |
46 | -.. seealso:: https://docs.python.org/3/library/token.html for an overiew of the predefied tokens | |
64 | +.. seealso:: See https://docs.python.org/3/library/token.html, for an overiew of the predefined tokens | |
47 | 65 | |
48 | 66 | .. tip:: |
49 | 67 | |
50 | 68 | A quick trick to see how a file is split into tokens, use ``python -m tokenize [-e] filename.peg``. |
51 | 69 | |BR| |
52 | - Make sure you do not use string-literals that (eg) are composed of two tokens. | |
70 | + Make sure you do not use string-literals that (eg) are composed of two tokens. Like the above mentioned ``<--`` | |
53 | 71 | |
54 | 72 | |
55 | 73 | |
56 | -.. sidebar:: Reserverd Names | |
74 | +.. sidebar:: Reserverd | |
75 | + :class: localtoc | |
57 | 76 | |
58 | - - start | |
59 | 77 | - showpeek |
60 | 78 | - name |
61 | 79 | - number |
@@ -75,3 +93,42 @@ | ||
75 | 93 | The *GeneratedParser* inherites and calls the base ``pegen.parser.Parser`` class and has methods for all |
76 | 94 | rule-names. This implies some names should not be used as rule-names (in all cases) -- see the sidebar. |
77 | 95 | |
96 | + | |
97 | +Meta Syntax (issues) | |
98 | +==================== | |
99 | + | |
100 | +No: regexps | |
101 | +----------- | |
102 | + | |
103 | +PEGEN has **no** support for regular expressions probably as it uses a custom lexer. | |
104 | + | |
105 | +Unordered Group starts a comment | |
106 | +-------------------------------- | |
107 | + | |
108 | +PEGEN (or it lexer) used the ``#`` to start a comment. This implies an **Unordered group** ``( sequence )#`` --as in | |
109 | +`Arpeggio <https://textx.github.io/Arpeggio/2.0/grammars/#grammars-written-in-peg-notations>`__-- are not recognized | |
110 | + | |
111 | +A workarond is to use another character like ``@`` instead of the hash (``#``). | |
112 | + | |
113 | + | |
114 | +Result/Output | |
115 | +============= | |
116 | + | |
117 | +cmd-tool | |
118 | +-------- | |
119 | + | |
120 | +The commandline tool ``pyton -m pegen ...`` only prints the parsed tree: a list (shown as ``[`` ... ``]``) with | |
121 | +sub-list and/or `TokenInfo` namedtuples. Each `TokenInfo` has 5 elements: a token type (an int and its enum-name), the | |
122 | +token-string (that was was parsed), the begin & end location (line- & column-number), and the full line that is beeing | |
123 | +parsed. | |
124 | + | |
125 | +No info about the matched gramer-rule (e.g. the rule-name) is shown. Actually that info is not part of the parsed-tree. | |
126 | + | |
127 | +.. seealso:: This `structure is described <https://docs.python.org/3/library/tokenize.html?highlight=TokenInfo>`__ in | |
128 | + the tokenize module; without specifying its name: TokenInfo. | |
129 | + | |
130 | +The parser | |
131 | +---------- | |
132 | + | |
133 | +The GeneratedParser (and/or it’s baseclass: ``pegen.parser.Parser``) returns only (list of) tokens from the tokenizer (a | |
134 | +OO wrapper arround tokenize). And so, the same TokenInfo objects as described above. |