Download Text-Parsing-in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Text Parsing in Python
- Gayatri Nittala
- Madhubala Vasireddy
Text Parsing
► The
three W’s!
► Efficiency and Perfection
What is Text Parsing?
► common
programming task
► extract or split a sequence of characters
Why is Text Parsing?
► Simple
file parsing
 A tab separated file
► Data
extraction
 Extract specific information from log file
► Find
and replace
► Parsers – syntactic analysis
► NLP
 Extract information from corpus
 POS Tagging
Text Parsing Methods
► String
Functions
► Regular Expressions
► Parsers
String Functions
► String
module in python
 Faster, easier to understand and maintain
► If
you can do, DO IT!
► Different built-in functions




Find-Replace
Split-Join
Startswith and Endswith
Is methods
Find and Replace
► find,
index, rindex, replace
► EX: Replace a string in all files in a directory
files = glob.glob(path)
for line in fileinput.input(files,inplace=1):
lineno = 0
lineno = string.find(line, stext)
if lineno >0:
line =line.replace(stext, rtext)
sys.stdout.write(line)
startswith and endswith
► Extract
quoted words from the given text
myString = "\"123\"";
if (myString.startswith("\""))
print "string with double quotes“
► Find
if the sentences are interrogative or
exclamative
► What
an amazing game that was!
► Do you like this?
endings = ('!', '?')
sentence.endswith(endings)
isMethods
► to
check alphabets, numerals, character
case etc
 m = 'xxxasdf ‘
 m.isalpha()
 False
Regular Expressions
► concise
way for complex patterns
► amazingly powerful
► wide variety of operations
► when you go beyond simple, think about
regular expressions!
Real world problems
► Match
IP Addresses, email addresses, URLs
► Match balanced sets of parenthesis
► Substitute words
► Tokenize
► Validate
► Count
► Delete duplicates
► Natural Language processing
RE in Python
► Unleash
the power - built-in re module
► Functions
 to compile patterns
► complie
 to perform matches
►
match, search, findall, finditer
 to perform opertaions on match object
►
group, start, end, span
 to substitute
►
►-
sub, subn
Metacharacters
Compiling patterns
► re.complile()
► pattern




for IP Address
^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$
^\d+\.\d+\.\d+\.\d+$
^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$
^([01]?\d\d?|2[0-4]\d|25[0-])\.
([01]?\d\d?|2[0-4]\d|25[0-5])\.
([01]?\d\d?|2[0-4]\d|25[0-5])\.
([01]?\d\d?|2[0-4]\d|25[0-5])$
Compiling patterns
► pattern
for matching parenthesis
 \(.*\)
 \([^)]*\)
 \([^()]*\)
Substitute
Perform several string substitutions on a given string
import re
def make_xlat(*args, **kwargs):
adict = dict(*args, **kwargs)
rx = re.compile('|'.join(map(re.escape, adict)))
def one_xlate(match):
return adict[match.group(0)]
def xlate(text):
return rx.sub(one_xlate, text)
return xlate
►
Count
► Split
and count words in the given text
 p = re.compile(r'\W+')
 len(p.split('This is a test for split().'))
Tokenize
► Parsing




and Natural Language Processing
s = 'tokenize these words'
words = re.compile(r'\b\w+\b|\$')
words.findall(s)
['tokenize', 'these', 'words']
Common Pitfalls
► operations
on fixed strings, single character
class, no case sensitive issues
► re.sub() and string.replace()
► re.sub() and string.translate()
► match vs. search
► greedy vs. non-greedy
PARSERS
► Flat
and Nested texts
► Nested tags, Programming language
constructs
► Better to do less than to do more!
Parsing Non flat texts
► Grammar
► States
► Generate
tokens and Act on them
► Lexer - Generates a stream of tokens
► Parser - Generate a parse tree out of the
tokens
► Lex and Yacc
Grammar Vs RE
►
Floating Point
#---- EBNF-style description of Python ---#
floatnumber ::= pointfloat | exponentfloat
pointfloat ::= [intpart] fraction | intpart "."
exponentfloat ::= (intpart | pointfloat) exponent
intpart
::= digit+
fraction
::= "." digit+
exponent
::= ("e" | "E") ["+" | "-"] digit+
digit
::= "0"..."9"
Grammar Vs RE
pat = r'''(?x)
(
(
(
# exponentfloat
# intpart or pointfloat
# pointfloat
(\d+)?[.]\d+ # optional intpart with fraction
|
\d+[.]
# intpart with period
)
# end pointfloat
|
\d+
# intpart
)
# end intpart or pointfloat
[eE][+-]?\d+
# exponent
)
# end exponentfloat
|
(
# pointfloat
(\d+)?[.]\d+
# optional intpart with fraction
|
\d+[.]
# intpart with period
)
# end pointfloat
'''
PLY - The Python Lex and Yacc
► higher-level
and cleaner grammar language
► LALR(1) parsing
► extensive input validation, error reporting,
and diagnostics
► Two moduoles lex.py and yacc.py
Using PLY - Lex and Yacc
► Lex:
► Import
the [lex] module
► Define a list or tuple variable 'tokens', the
lexer is allowed to produce
► Define tokens - by assigning to a specially
named variable ('t_tokenName')
► Build the lexer
 mylexer = lex.lex()
 mylexer.input(mytext) # handled by yacc
Lex
t_NAME
= r'[a-zA-Z_][a-zA-Z0-9_]*'
def t_NUMBER(t):
r'\d+'
try:
t.value = int(t.value)
except ValueError:
print "Integer value too large", t.value
t.value = 0
return t
t_ignore = " \t"
Yacc
► Import
the 'yacc' module
► Get a token map from a lexer
► Define a collection of grammar rules
► Build the parser
 yacc.yacc()
 yacc.parse('x=3')
Yacc
► Specially
named functions having a 'p_' prefix
def p_statement_assign(p):
'statement : NAME "=" expression'
names[p[1]] = p[3]
def p_statement_expr(p):
'statement : expression'
print p[1]
Summary
String Functions
A thumb rule - if you can do, do it.
► Regular Expressions
Complex patterns - something beyond simple!
► Lex and Yacc
Parse non flat texts - that follow some rules
►
References
► http://docs.python.org/
► http://code.activestate.com/recipes/langs/python/
► http://www.regular-expressions.info/
► http://www.dabeaz.com/ply/ply.html
► Mastering
Regular Expressions by Jeffrey E F.
Friedl
► Python Cookbook by Alex Martelli, Anna Martelli &
David Ascher
► Text processing in Python by David Mertz
Thank You
Q&A