Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds 1 Web as language resource  Replaceable or replacable?  check 2       Very very large Most languages Most language types Up-to-date Free Instant access 3 How to use the web?  Google or other commercial search engines (CSEs)  not 4 Using CSEs No setup costs Start querying today Methods  Hit counts  ‘snippets’  Metasearch engines, WebCorp  Find pages and download 5 Googleology  CSE hit counts for language modelling  36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista (Keller & Lapata 2003)  finding noun-noun relations “we issue exact phrase Google queries of type noun2 THAT * noun1” Nakov and Hearst 2006  Small community of researchers  Corpora mailing list  Very interesting work  Intense interest in query syntax  Creativity and person-years 6 The Trouble with Google  not enough instances  max 1000  not enough queries  max 1000 per day with API  not enough context  10-word snippet around search term  ridiculous sort order  search term in titles and headings  untrustworthy hit counts  limited search syntax  No regular expressions  linguistically dumb  lemmatised  aime/aimer/aimes/aimons/aimez/aiment …  not POS-tagged  not parsed not 7  Appeal  Zero-cost entry, just start googling  Reality  High-quality work: high-cost methodology 8 Also:  No replicability  Methods, stats not published  At mercy of commercial corporation 9 Also:     No replicability Methods, stats not published At mercy of commercial corporation Bad science 10 The 5-grams  A present from Google  All  1-, 2-, 3-, 4-, 5-grams  with fr>=40  in a terabyte of English  A large dataset 11 Prognosis  Next 3 years  Exciting new ideas  Dazzlingly clever uses  Drives progress in NLP 12 Prognosis  Next 3 years  Exciting new ideas  Dazzlingly clever uses  After 5+ years  A chain round our necks  Cf Penn Treebank (others? Brickbats?)  Resource-led vs. ideas-led research 13 How to use the web?  Google or other commercial search engines (CSEs)  not 14 Language and the web  Web is mostly linguistic  Text on web << whole web (in GB)  Not many TB of text  Special hardware not needed  We are the experts 15 Community-building  ACL SIGWAC  WAC Kool Ynitiative (WaCKY)  Mailing list  Open source  WAC workshops  WAC1, Birmingham 2005  WAC2, Trento (EACL), April 2006  WAC3, Louvain, Sept 15-16 2007 16 Proof of concept: DeWaC, ItWaC  1.5 B words each, German and Italian  Marco Baroni, Bologna (+ AK) 17 What is out there?  What text types?  some are new: chatroom  proportions is it overwhelmed by porn? How much? Hard question 18 What is out there  The web a social, cultural, political phenomenon new, little understood a legitimate object of science mostly language we are well placed  a lot of people will be interested      Let’s     study the web source of language data apply our tools for web use (dictionaries, MT) use the web as infrastructure 19 How to do it: Components 1. web crawler 2. filters and classifiers  de-duplication 3. linguistic processing • Lemmatise, pos-tag, parse 4. Database • • Indexing user interface 20 1. Crawling  How big is your hard disk?  When will your sysadmin ban you? DeWaC/ItWaC  Open source crawler: heritrix 21 1.1 Seeding the crawl  Mid-frequency words  Spread of text types  Formal and informal, not just newspaper  DeWaC  Words from newspaper corpus  Words from list with “kitchen” vocab  Use Google to get seeds for crawls 22 2. Filtering     non ‘running-text’ stripping Function word filtering Porn filtering De-duplication 23 2.1 Filtering: Sentences  What is the text that we want?     Lists? Links? Catalogues? …  For linguistics, NLP  in sentences  Use function words 24 2.2 Filtering: CLEANEVAL  “Text cleaning”  Lots to be done, not glamorous  Many kinds of dirt needing many kinds of filter  Open Competition/shared task  Who can produce the cleanest text?!  Input: arbitrary web pages  “gold standard”  paragraph-marked plain text  Prepared by people  Workshop Sept 2007. do join us!  http://cleaneval.sigwac.org.uk 25 3. Linguistic processing  Lemmatise, POS-tag, parse  Find leading NLP group for each language  Be nice to them  Use their tools 26 Database, interface  Solved problem (at least for 1.5 BW)  Sketch Engine 27 “Despite all the disadvantages, it’s still so much bigger” 28 How much bigger?  Method  Sample words     30 Mid-to-high freq Not common words in other major lgs Min 5 chars  Compare freqs, Google vs ItWaC/DeWaC 29 Google results (Italian)  Arbitrariness  Repeat identical searches  9/30: > 10% difference  6/30: > 100% difference  API: typically 1/18th ‘manual’ figure  Language filter  mista bomba clima  mostly non-Italian pages  use MAX and MIN of 6 lg-filtered results 30  Clima=  Computational logic in multi-agent systems  Centre for Legumes in Mediterranean Agriculture  (5-char limit too short) 31 Ratios, Google:DeWaC WORD MAX MIN RAW CLEAN -------------------------------------------------------------besuchte 10.5 3.8 81840 18228 stirn 3.38 0.62 32320 11137 gerufen 7.14 3.72 66720 27187 verringert 6.86 3.46 52160 15987 bislang 24.4 11.6 239000 90098 brach 4.36 2.26 44520 19824 -------------------------------------------------------------MAX/MIN: max/min of 6 Google values (millions) RAW: DeWaC document frequency before filters, dedupe CLEAN: DeWaC document frequency after filters, dedupe 32 ItWaC:Google ratio, best estimate  For each of 30 words  Calculate ratio, max:raw  Calculate ratio, min:raw  Take mid-point and average: 1:33 or 3%  Calculate raw:vert  Average = 4.4  half (for conservativeness/uncertainty) = 2.2  3% x 2.2 = 6.6%  ItWaC:Google = 6.6% 33 Italian web size  ItWaC = 1.67b words  Google indexes 1.67/.066 = 25 bn words sentential non-dupe Italian 34 German web size     Analysis as for Italian DeWaC: 3% Google DeWaC = 1.41b words Google indexes 1.41/.03 = 44 bn words sentential non-dupe German 35 Effort  ItWac, DeWac  Less than 6 person months  Developing the method  (EnWaC: in progress) 36 Plan  ACL adopts it (like ACL Anthology) (LDC?)  Say: 3 core staff, 3 years  Goals could be:  English: 2% G-scale (still biggest part)  6 other major languages: 30% G-scale  30 other languages: 10% G-scale  Online for  Searching as in SkE  Specifying, downloading subcorpora for intensive NLP  “corpora on demand”  Don’t quote me  37 Logjams  Cleaning  See CLEANEVAL  Text type  “what kind of page is it?”  Critical but under-researched  WebDoc proposal  (with Serge Sharoff, Tony Hartley)  (a different talk) 38 Moral  Google, CSEs are wonderful  Start today but bad science  Not  Good science, reliable counts  We (the NLP community) have the skills  With collective effort, mid-sized project Google-scale is achievable 39 Thank you  http://www.sketchengine.co.uk 40 Scale and speed, LSE  Commercial search engines  banks of computers  highly optimised code but this is for performance  no downtime  instant responses to millions of queries  This proposal  crawling: once a year  downtime: acceptable  not so many users 41 …but it’s not representative  The web is not representative  but nor is anything else  Text type variation  under-researched, lacking in theory Atkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Baayen 2001, Kilgarriff 2001  Text type is an issue across NLP  Web: issue is acute because, as against BNC or WSJ, we simply don’t know what is there 42 Oxford English Corpus  Method as above  Whole domains chosen and harvested  control over text type  1 billion words  Public launch April 2006  Loaded into Sketch Engine 43 Oxford English Corpus 44 Oxford English Corpus 45 Examples  DeWaC, ItWaC  Baroni and Kilgarriff, EACL 2006  Serge Sharoff, Leeds Univ UK  English Chinese Russian English French Spanish, all searchable online  Oxford English corpus 46 Options for academics  Give up  Niche markets, obscure languages  Leave the mainstream to the big guys  Work out how to work on that scale  Web is free, data availability not a problem 47