python - Creating a Lexer -
hey guys trying understand concepts regarding lexers. understand lexers used in compilers separate individual characters in string form known tokens. thing confuses me matching part. not understand logic of why need match characters corresponding position.
import sys import re def lex(characters, token_exprs): pos = 0 tokens = [] while pos < len(characters): match = none token_expr in token_exprs: pattern, tag = token_expr regex = re.compile(pattern) match = regex.match(characters, pos) if match: text = match.group(0) if tag: token = (text, tag) tokens.append(token) break if not match: sys.stderr.write('illegal character: %s\n' % characters[pos]) sys.exit(1) else: pos = match.end(0) return tokens
this code not understand. after loop, not quite grasp code trying do.why have match characters position?
a pretty traditional lexer can work this:
- get character somewhere, file or buffer
- check current character is:
- is whitespace? skip whitespace
- is comment introduction character? , skip comment
- is digit? try number
- is
"
? try string - is character? try identifier
- is identifier keyword/reserved word?
- otherwise, valid operator sequence?
- return token type
instead of checking single characters @ time, can of course use regular expressions.
the best way learn how hand-written lexer works, (imo) find simple existing lexers , try understand them.
Comments
Post a Comment