Download Slide 1

Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds 1 Web as language resource  Replaceable or replacable?  check 2       Very very large Most languages Most language types Up-to-date Free Instant access 3 How to use the web?  Google or other commercial search engines (CSEs)  not 4 Using CSEs No setup costs Start querying today Methods  Hit counts  ‘snippets’  Metasearch engines, WebCorp  Find pages and download 5 Googleology  CSE hit counts for language modelling  36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista (Keller & Lapata 2003)  finding noun-noun relations “we issue exact phrase Google queries of type noun2 THAT * noun1” Nakov and Hearst 2006  Small community of researchers  Corpora mailing list  Very interesting work  Intense interest in query syntax  Creativity and person-years 6 The Trouble with Google  not enough instances  max 1000  not enough queries  max 1000 per day with API  not enough context  10-word snippet around search term  ridiculous sort order  search term in titles and headings  untrustworthy hit counts  limited search syntax  No regular expressions  linguistically dumb  lemmatised  aime/aimer/aimes/aimons/aimez/aiment …  not POS-tagged  not parsed not 7  Appeal  Zero-cost entry, just start googling  Reality  High-quality work: high-cost methodology 8 Also:  No replicability  Methods, stats not published  At mercy of commercial corporation 9 Also:     No replicability Methods, stats not published At mercy of commercial corporation Bad science 10 The 5-grams  A present from Google  All  1-, 2-, 3-, 4-, 5-grams  with fr>=40  in a terabyte of English  A large dataset 11 Prognosis  Next 3 years  Exciting new ideas  Dazzlingly clever uses  Drives progress in NLP 12 Prognosis  Next 3 years  Exciting new ideas  Dazzlingly clever uses  After 5+ years  A chain round our necks  Cf Penn Treebank (others? Brickbats?)  Resource-led vs. ideas-led research 13 How to use the web?  Google or other commercial search engines (CSEs)  not 14 Language and the web  Web is mostly linguistic  Text on web << whole web (in GB)  Not many TB of text  Special hardware not needed  We are the experts 15 Community-building  ACL SIGWAC  WAC Kool Ynitiative (WaCKY)  Mailing list  Open source  WAC workshops  WAC1, Birmingham 2005  WAC2, Trento (EACL), April 2006  WAC3, Louvain, Sept 15-16 2007 16 Proof of concept: DeWaC, ItWaC  1.5 B words each, German and Italian  Marco Baroni, Bologna (+ AK) 17 What is out there?  What text types?  some are new: chatroom  proportions is it overwhelmed by porn? How much? Hard question 18 What is out there  The web a social, cultural, political phenomenon new, little understood a legitimate object of science mostly language we are well placed  a lot of people will be interested      Let’s     study the web source of language data apply our tools for web use (dictionaries, MT) use the web as infrastructure 19 How to do it: Components 1. web crawler 2. filters and classifiers  de-duplication 3. linguistic processing • Lemmatise, pos-tag, parse 4. Database • • Indexing user interface 20 1. Crawling  How big is your hard disk?  When will your sysadmin ban you? DeWaC/ItWaC  Open source crawler: heritrix 21 1.1 Seeding the crawl  Mid-frequency words  Spread of text types  Formal and informal, not just newspaper  DeWaC  Words from newspaper corpus  Words from list with “kitchen” vocab  Use Google to get seeds for crawls 22 2. Filtering     non ‘running-text’ stripping Function word filtering Porn filtering De-duplication 23 2.1 Filtering: Sentences  What is the text that we want?     Lists? Links? Catalogues? …  For linguistics, NLP  in sentences  Use function words 24 2.2 Filtering: CLEANEVAL  “Text cleaning”  Lots to be done, not glamorous  Many kinds of dirt needing many kinds of filter  Open Competition/shared task  Who can produce the cleanest text?!  Input: arbitrary web pages  “gold standard”  paragraph-marked plain text  Prepared by people  Workshop Sept 2007. do join us!  http://cleaneval.sigwac.org.uk 25 3. Linguistic processing  Lemmatise, POS-tag, parse  Find leading NLP group for each language  Be nice to them  Use their tools 26 Database, interface  Solved problem (at least for 1.5 BW)  Sketch Engine 27 “Despite all the disadvantages, it’s still so much bigger” 28 How much bigger?  Method  Sample words     30 Mid-to-high freq Not common words in other major lgs Min 5 chars  Compare freqs, Google vs ItWaC/DeWaC 29 Google results (Italian)  Arbitrariness  Repeat identical searches  9/30: > 10% difference  6/30: > 100% difference  API: typically 1/18th ‘manual’ figure  Language filter  mista bomba clima  mostly non-Italian pages  use MAX and MIN of 6 lg-filtered results 30  Clima=  Computational logic in multi-agent systems  Centre for Legumes in Mediterranean Agriculture  (5-char limit too short) 31 Ratios, Google:DeWaC WORD MAX MIN RAW CLEAN -------------------------------------------------------------besuchte 10.5 3.8 81840 18228 stirn 3.38 0.62 32320 11137 gerufen 7.14 3.72 66720 27187 verringert 6.86 3.46 52160 15987 bislang 24.4 11.6 239000 90098 brach 4.36 2.26 44520 19824 -------------------------------------------------------------MAX/MIN: max/min of 6 Google values (millions) RAW: DeWaC document frequency before filters, dedupe CLEAN: DeWaC document frequency after filters, dedupe 32 ItWaC:Google ratio, best estimate  For each of 30 words  Calculate ratio, max:raw  Calculate ratio, min:raw  Take mid-point and average: 1:33 or 3%  Calculate raw:vert  Average = 4.4  half (for conservativeness/uncertainty) = 2.2  3% x 2.2 = 6.6%  ItWaC:Google = 6.6% 33 Italian web size  ItWaC = 1.67b words  Google indexes 1.67/.066 = 25 bn words sentential non-dupe Italian 34 German web size     Analysis as for Italian DeWaC: 3% Google DeWaC = 1.41b words Google indexes 1.41/.03 = 44 bn words sentential non-dupe German 35 Effort  ItWac, DeWac  Less than 6 person months  Developing the method  (EnWaC: in progress) 36 Plan  ACL adopts it (like ACL Anthology) (LDC?)  Say: 3 core staff, 3 years  Goals could be:  English: 2% G-scale (still biggest part)  6 other major languages: 30% G-scale  30 other languages: 10% G-scale  Online for  Searching as in SkE  Specifying, downloading subcorpora for intensive NLP  “corpora on demand”  Don’t quote me  37 Logjams  Cleaning  See CLEANEVAL  Text type  “what kind of page is it?”  Critical but under-researched  WebDoc proposal  (with Serge Sharoff, Tony Hartley)  (a different talk) 38 Moral  Google, CSEs are wonderful  Start today but bad science  Not  Good science, reliable counts  We (the NLP community) have the skills  With collective effort, mid-sized project Google-scale is achievable 39 Thank you  http://www.sketchengine.co.uk 40 Scale and speed, LSE  Commercial search engines  banks of computers  highly optimised code but this is for performance  no downtime  instant responses to millions of queries  This proposal  crawling: once a year  downtime: acceptable  not so many users 41 …but it’s not representative  The web is not representative  but nor is anything else  Text type variation  under-researched, lacking in theory Atkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Baayen 2001, Kilgarriff 2001  Text type is an issue across NLP  Web: issue is acute because, as against BNC or WSJ, we simply don’t know what is there 42 Oxford English Corpus  Method as above  Whole domains chosen and harvested  control over text type  1 billion words  Public launch April 2006  Loaded into Sketch Engine 43 Oxford English Corpus 44 Oxford English Corpus 45 Examples  DeWaC, ItWaC  Baroni and Kilgarriff, EACL 2006  Serge Sharoff, Leeds Univ UK  English Chinese Russian English French Spanish, all searchable online  Oxford English corpus 46 Options for academics  Give up  Niche markets, obscure languages  Leave the mainstream to the big guys  Work out how to work on that scale  Web is free, data availability not a problem 47

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1