OSDN > Developer > albertmietus > Chambre > DocIdeas > Commit

DocIdeas
Fork

Dépôt original, Pas de origine de fork

Commit

Commit MetaInfo

Révision	ea1a174e665a21e9a912a60fedbc773f9383c2af (tree)
l'heure	2022-10-27 20:06:19
Auteur	Albert Mietus < albert AT mietus DOT nl >
Commiter	Albert Mietus < albert AT mietus DOT nl >

Message de Log

QuickNote: PEGEN (more)

Change Summary

modified: CCastle/short/index.rst (diff)
modified: CCastle/short/pegen_parser.rst (diff)

Modification

diff -r 5f1b122c9ce7 -r ea1a174e665a CCastle/short/index.rst

--- a/CCastle/short/index.rst Sun Oct 23 22:47:01 2022 +0200

+++ b/CCastle/short/index.rst Thu Oct 27 13:06:19 2022 +0200

		@@ -1,5 +1,5 @@
1	1	================
2		-Some short blogs
	2	+Some Quick Blogs
3	3	================
4	4
5	5	.. toctree::

diff -r 5f1b122c9ce7 -r ea1a174e665a CCastle/short/pegen_parser.rst

--- a/CCastle/short/pegen_parser.rst Sun Oct 23 22:47:01 2022 +0200

+++ b/CCastle/short/pegen_parser.rst Thu Oct 27 13:06:19 2022 +0200

		@@ -1,31 +1,39 @@
	1	+.. include:: /std/localtoc.irst
	2	+
1	3	================
2		-The PEGEN parser
	4	+QuickNote: PEGEN
3	5	================
4	6
5		-.. post:: 2022/10/23
	7	+.. post:: 2022/10/27
6	8	:category: CastleBlogs, rough
7	9	:tags: Grammar, PEG, DRAFT
8	10
9		- To implement CCastle we need a parser (as part of ther compiler). Eventually, that parser will be writen in Castle;
10		- but for now we kickstart it in python. Which has several packages that can assist us. As we like to use an PEG one,
11		- there are a few options. `Arpeggio <https://textx.github.io/Arpeggio/2.0/>`__ is well known, and has some nice
12		- options -- but can’t handle `left recursion <https://en.wikipedia.org/wiki/Left_recursion>`__ -- like most
13		- PEG-parsers.
	11	+ To implement CCastle we need a parser; as part of the compiler. Eventually, that parser will be writen in Castle. For
	12	+ now we kickstart it in python; which has several packages that can assist us. As we like to use an PEG one, there
	13	+ are a few options. `Arpeggio <https://textx.github.io/Arpeggio/2.0/>`__ is well known, and has some nice options --
	14	+ but can’t handle `left recursion <https://en.wikipedia.org/wiki/Left_recursion>`__ -- like most PEG-parsers.
14	15
15		- Recently python uses a PEG parser, that supports `left recursion <https://en.wikipedia.org/wiki/Left_recursion>`__
16		- (which is a recent development). That parser is also available as a (hardly documented) package: `pegen
17		- <https://we-like-parsers.github.io/pegen/index.html>`__
	16	+ Recently python itself uses a PEG parser, that supports `left recursion
	17	+ <https://en.wikipedia.org/wiki/Left_recursion>`__ (which is a recent development). That parser is also available as a
	18	+ package: `pegen <https://we-like-parsers.github.io/pegen/index.html>`__; but hardly documented.
18	19
19		- This blog is writen to remember some leassons learned when playing with in
	20	+ This blog is writen to remember some leassons learned when playing with it. And as kind of informal docs.
20	21
21	22
22	23	Build-In Lexer
23	24	==============
24	25
25		-Pegen is specially writen for Python and use a specialized lexer; unlike most PEG-parser, that uses PEG for both. Pegen
	26	+Pegen is specially writen for Python and use a specialized lexer; unlike most PEG-parser that uses PEG for lexing too. Pegen
26	27	uses the `tokenizer <https://docs.python.org/3/library/tokenize.html>`__ that is part of Python. This comes with some
27	28	restrictions.
28	29
	30	+.. hint::
	31	+
	32	+ This applies mostly when we use pegen as modele: ``pyton -m pegen ...``; that calls `simple_parser_main()`.
	33	+ \|BR\|
	34	+ When uses it in code, by importing pegen ``from pegen.parser Parser ...`` one has more options (not studies yet).
	35	+
	36	+
29	37	Tokens
30	38	------
31	39

		@@ -36,26 +44,36 @@
36	44	them differently; possible combined with other characters. Then, those will not be found; not the literal-strings as set
37	45	in the grammar.
38	46
	47	+.. note::
	48	+
	49	+ Pegen speaks about (soft) keywords for all kind of literal terminals; even when they are more like operators
	50	+ than words.
	51	+
39	52	.. warning::
40	53
	54	+ When the grammar defines (literal) terminals (or keywords) --especially for operators-- make sure the lexer will not
	55	+ break them into predefined tokens!
	56	+ \|BR\|
	57	+ This will not give an error, but it does not work!
	58	+
41	59	.. code-block:: PEG
42	60
43		- Left_arrow_BAD: '<-' ## This is WRONG, as ``<`` is sees as a token
	61	+ Left_arrow_BAD: '<-' ## This is WRONG, as ``<`` is seen as a token. And so, `<-` is never found
44	62	Left_arrow_OKE: '<' '-' ## This is acceptable
45	63
46		-.. seealso:: https://docs.python.org/3/library/token.html for an overiew of the predefied tokens
	64	+.. seealso:: See https://docs.python.org/3/library/token.html, for an overiew of the predefined tokens
47	65
48	66	.. tip::
49	67
50	68	A quick trick to see how a file is split into tokens, use ``python -m tokenize [-e] filename.peg``.
51	69	\|BR\|
52		- Make sure you do not use string-literals that (eg) are composed of two tokens.
	70	+ Make sure you do not use string-literals that (eg) are composed of two tokens. Like the above mentioned ``<--``
53	71
54	72
55	73
56		-.. sidebar:: Reserverd Names
	74	+.. sidebar:: Reserverd
	75	+ :class: localtoc
57	76
58		- - start
59	77	- showpeek
60	78	- name
61	79	- number

		@@ -75,3 +93,42 @@
75	93	The GeneratedParser inherites and calls the base ``pegen.parser.Parser`` class and has methods for all
76	94	rule-names. This implies some names should not be used as rule-names (in all cases) -- see the sidebar.
77	95
	96	+
	97	+Meta Syntax (issues)
	98	+====================
	99	+
	100	+No: regexps
	101	+-----------
	102	+
	103	+PEGEN has no support for regular expressions probably as it uses a custom lexer.
	104	+
	105	+Unordered Group starts a comment
	106	+--------------------------------
	107	+
	108	+PEGEN (or it lexer) used the ``#`` to start a comment. This implies an Unordered group ``( sequence )#`` --as in
	109	+`Arpeggio <https://textx.github.io/Arpeggio/2.0/grammars/#grammars-written-in-peg-notations>`__-- are not recognized
	110	+
	111	+A workarond is to use another character like ``@`` instead of the hash (``#``).
	112	+
	113	+
	114	+Result/Output
	115	+=============
	116	+
	117	+cmd-tool
	118	+--------
	119	+
	120	+The commandline tool ``pyton -m pegen ...`` only prints the parsed tree: a list (shown as ``[`` ... ``]``) with
	121	+sub-list and/or `TokenInfo` namedtuples. Each `TokenInfo` has 5 elements: a token type (an int and its enum-name), the
	122	+token-string (that was was parsed), the begin & end location (line- & column-number), and the full line that is beeing
	123	+parsed.
	124	+
	125	+No info about the matched gramer-rule (e.g. the rule-name) is shown. Actually that info is not part of the parsed-tree.
	126	+
	127	+.. seealso:: This `structure is described <https://docs.python.org/3/library/tokenize.html?highlight=TokenInfo>`__ in
	128	+ the tokenize module; without specifying its name: TokenInfo.
	129	+
	130	+The parser
	131	+----------
	132	+
	133	+The GeneratedParser (and/or it’s baseclass: ``pegen.parser.Parser``) returns only (list of) tokens from the tokenizer (a
	134	+OO wrapper arround tokenize). And so, the same TokenInfo objects as described above.

DocIdeas Fork