Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Annex 4.4.2 Board meeting 2/12-13, March 13th, 2013 MEMORANDUM To IFRRO Board From IFRRO General Counsel Re Text and Data Mining Date 12 February 2013 A. BACKGROUND New areas outside the traditional life sciences and drug discovery are emerging in social sciences, humanities, business and marketing. It seems that text- and data-mining (TDM) introduces an important niche in the text analytics field. Apparently, licensing of “data” has become an increasingly important issue in science1. So far, IFRRO members CCC and STM, and the UK Publishers’ Associations PLS and (PA), are active in the area of TDM. These activities will be outlined in more detail below. a. Definitions “Text and Data Mining” is used mostly as a collective term to describe both text mining and data mining. However, there is no universally agreed definition, partly because it is being used by different communities for different purposes. At the outset, it seems to be helpful to distinguish between text mining as the extraction of semantic logic from text, and data mining as the discovery of new insights. (i) Data Mining It appears that data mining is an analytical process that looks for trends and patterns in datasets that reveal new insights, which are implicit, previously unknown and potentially useful pieces of information. It is the extraction of trends and patterns from data.2 (ii) Text Mining On the other hand, it appears that text mining is the extraction of meaning from a body of text. Generally, text mining is seen as the indexing of content.3 It has also been defined as “analysis of data contained in national language text”4, or described as: “Text mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.”5 (iii)Text and Data Mining The difference between text mining and data mining is somewhat blurred when statistical analysis is used to extract meaning from the text. One could argue that, from a computer’s point of view, text 1 http://pantonprinciples.org/ and http://www.isitopendata.org/ Jonathan Clark, Text Mining and Scholarly Publishing, Publishing Research Consortium 2012, page 19. 3 Jonathan Clark, Text Mining and Scholarly Publishing, Publishing Research Consortium 2012, page 5. 4 Definition provided by Roy Kaufmann (CCC) during the CCC/ALPSP TDM webinar on 11 December 2012. 5 http://en.wikipedia.org/wiki/Text_mining 2 1 mining and data mining are very similar. In the agreement between STM, PDR and ALPSP, the definition includes both: “Text and Data Mining (TDM): download, extract and index information from the Publisher’s Content to which the Subscriber has access (…).”6 b. Examples of Text Mining Some examples of text mining users are the following websites: CiteXplore – EBI/UKPMC7 ChemSpider8 SureChem (http://www.surechem.com)9 BrainMap.org10 Relay Technology Management Inc. (http://relaytm.com)11 Text mining of scholarly content: See figure 2, from: http://www.jisc.ac.uk/media/documents/publications/reports/2012/value-text-mining.pdf B. LEGAL LANDSCAPE There are legal uncertainties around text mining, and there is no consensus on how to best deal with them. Some perspectives from the UK, US and EU are outlined below. a. UK Hargreaves Report12 Recommended TDM exception to copyright Is it copyright? Technology, access, security, privacy UK Parliamentary Business Innovation and Skills Committee Report June 201213 Encourages licenses Encourages publishers to develop business models 6 http://www.stm-assoc.org/2012_09_12_PDR_ALPSP_STM_Text_Mining_Press_Release.pdf http://www.ebi.ac.uk/literature/trainees/citexplore.html 8 http://www.chemspider.com/ 9 SureChem is a search engine for patents that allows chemists to search by chemical structure, chemical name, keyword or patent field. It is looking to add other sources of data, for instance journal articles, and to extend into biology, and perhaps further (“Take My Content Please!”, Nicko Goncharoff, http://river-valley.tv/take-my-content-please-theservice-based-business-model-of-surechem/). 10 BrainMap is a database of published functional and structural neuroimaging experiments. The database can be analysed to study human brain function and structure. 11 Relay Technology Management Inc. is a company that uses text mining to create information products for pharmaceutical and biotech companies. 12 http://www.ipo.gov.uk/ipreview-finalreport.pdf 13 http://www.publications.parliament.uk/pa/cm201213/cmselect/cmbis/367/367.pdf and http://www.publications.parliament.uk/pa/cm201213/cmselect/cmbis/367/367vw.pdf 7 2 JISC report14: Limited uptake of TDM within UK universities A lack of skilled staff High transaction and entry costs Recommended working with publishers, technology service providers and other key stakeholders Explore the technical requirements for optimal provision of text mining infrastructure services Focus on interoperability and metadata standards The UK Hargreaves report 15 recommended that text and data mining be excepted from UK copyright. However, it is to be questioned whether an exception would indeed remove the legal uncertainties, as claimed in the Hargreaves report.16 The UK Government’s White Paper, Modernising Copyright, published on 20 December 2012, states that the Government will amend the law “(…) so that it is not an infringement of copyright for a person who already has a right to access a work (whether under a licence or otherwise) to copy the work as part of a technological process of analysis and synthesis of the content of the work for the sole purpose of non-commercial research. This will enable key research without undermining publishers’ control over IT systems or commercial exploitation. A licence governing access to a work will not be able to prevent or restrict use of the work in accordance with this exception, but it may impose conditions of access to the licensor’s computer system or to third party systems on which the work is accessed. Therefore the exception will not prevent a publisher from applying technological measures on networks required in order to maintain security or stability, or from licensing higher volumes of access to research outputs at an additional cost. To the extent that technological measures prevent a researcher benefiting from this exception, they will be able to appeal to the Secretary of State. This measure will not provide a “right to data mine” works to which the researcher does not already have a right of access, and will not cover data mining for commercial purposes. This is consistent with the principles of the Finch Review of Open Access to publicly funded research, which concluded earlier this year”.17 OVERVIEW: UK Copyright and Text Mining Hargreaves report, May 2011 “According to the Wellcome Trust, 87 per cent of the material housed in UK’s main medical research database (UK PubMed Central) is unavailable for legal text and data mining.” http://www.ipo.gov.uk/ipreview-finalreport.pdf 14 http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx Digital Opportunity, A Review of Intellectual Property and Growth, An Independent Report, Prof. Ian Hargreaves, May 2011, http://www.ipo.gov.uk/ipreview-finalreport.pdf. 16 The Value and Benefits of Text Mining, JISC, http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-oftext-mining.aspx. 17 http://www.ipo.gov.uk/response-2011-copyright-final.pdf 15 3 The Guardian, 23 May 2012: “What do publishers have against this hi-tech research tool? “[c]ountless ... academics are prevented from using the most modern research techniques because the big publishing companies such as Macmillan, Wiley and Elsevier, which control the distribution of most of the world's academic literature, by default do not allow text mining of the content that sits behind their expensive paywalls.” http://www.guardian.co.uk/science/2012/may/23/text-miningresearch-tool-forbidden JISC text mining report, March 2012: “Legal uncertainty, inaccessible information silos, lack of information and lack of a critical mass are barriers to text mining within UKFHE.” “The UKFHE sector collaborates with content publishers and service providers to explore potential new business models and innovative text mining services that meet the sector’s requirement”. [Recommendation 2, p.5] http://www.jisc.ac.uk/media/documents/publications/reports/2012/value-text-mining.pdf The Business Innovation and Skills Committee in the report of its inquiry The Hargreaves Review of Intellectual Property: Where next? recommends: “We believe that publishers should seek rapidly to offer models in which licences are readily available at realistic rates to all bona fide licensees and we encourage the Department to promote early development of such models.” (#65) The UK IPO in its Consultation on Copyright stated that: (#7.96) “The Government proposes to make it possible for whole works to be copied for the purpose of data mining for non-commercial research.” However the BIS Committee concluded that “we believe that content mining should be opened up by way of managed but nevertheless accessible licensing processes.” (#64) Dame Janet Finch in her Report on how to expand access to research publications makes this appeal to publishers: “Subject to any legislative changes following the Hargreaves review, all publishers will have to consider what arrangements they will put in place to make their content available for text and data mining.” [#9.26, p.106] b. US HathiTrust Litigation18 “The search capabilities have already given rise to new methods of academic inquiry such as text mining” “Plaintiffs also argue that non-consumptive research such as text mining causes harm (…) because authors [sic] might one day pay for licences.” Argument deemed speculative Court concludes “no CCC licence” c. EU European Commission launched a stakeholder dialogue on TDM in early 201319 18 http://www.publishersweekly.com/pw/by-topic/digital/copyright/article/54321-in-hathitrust-ruling-judge-says-googlescanning-is-fair-use.html 19 http://europa.eu/rapid/press-release_MEMO-12-950_en.htm#PR_metaPressRelease_bottom 4 CFC (Sandra Chastanet) and PLS (Sarah Faulder) as participants, and the IFRRO Secretariat (Olav Stokkmo, replaced by James Boyd at the first meeting)), as observer, are represented in Working Group 4 (Text and data mining for scientific and research purposes), launched in Brussels on 4 February 2013 C. THE CCC PILOT: ADVANTAGES, DISADVANTAGES AND SOME OTHER CONSIDERATIONS Following CCC’s pilot project, advantages, disadvantages and other considerations with respect to TDM were outlined in the webinar “Content Data and Text Mining: From Containers to Enhanced Research Tools” (11 December 2012). Below some aspects from CCC’s and ALPSP’s (Association of Learned and Professional Society Publishers) presentation and related discussions at the webinar in December 2012: a. Opinion of publishers Scholarly publishers have been aware for some time of the rising market demand for text mining of their publications. The industry is working to streamline and enable the means better to meet that demand. In her report for the Publishing Research Consortium Journal Article Mining, Eefke Smit summarised practices, policies and plans at the time of publication in May 2011. Some of her findings are highlighted below: “Publishers are relatively liberal in granting permission: over 90% grant research-focused mining requests, 60% in most or all cases, 33% for some cases. 32% allow any kind of mining without permissions needed. 68% of publishers consider mining requests on a case by case basis. More than 80% require information on intent and purpose.”20 A total of 32 % of publishing respondents allows for any and all kind of mining without permissions needed, including the 28% who have an Open Access policy for this. 69% of publisher respondents consider mining requests on a case by case basis, 14% have a formal policy that is publicly stated, 28% have no general policy, 21% are formulating a policy. When permission is requested, 35 % of publisher respondents generally allow mining in all or the majority of cases, another 53% in some cases. More than 80% require information on intent and purpose for all or most cases. 53 % of publisher respondents will decline mining requests if the results can replace or compete with their own products and services21 According to Jonathan Clark22, a great challenge for publishers also seems to be the creation of an infrastructure that makes their content more machine-accessible and that also supports everything text-miners or computational linguists might want to do with the content. b. Obstacles According to CCC, the main things holding back TDM could be grouped into three main categories: 20 http://www.publishingresearch.net/documents/PRCSmitJAMreport20June2011VersionofRecord.pdf http://www.publishingresearch.net/documents/PRCSmitJAMreport20June2011VersionofRecord.pdf 22 Jonathan Clark, Text Mining and Scholarly Publishing, Publishing Research Consortium 2012. 21 5 Technical issues: lack of common formats and interoperability, and also lack of agreement on access/authentication arrangements. Licensing arrangements: current requirements for users to negotiate separately with multiple publishers, together with some legal uncertainties; lack of cross-publisher cooperation, incl. on technical formats. Business models and market development: no clear view of the value of TDM or how to measure it, the lack of (established) business models or pricing models for TDM, uncertainly and fear (primarily on the part of publishers), limited awareness and use of TDM outside pharma. c. Prospects of success In CCC’s view, the main items that would accelerate the development of TDM are: A need for a central broker, with two distinct roles: A rights clearance or licensing role; A technical function that might include normalising the data or (more ambitiously) creating a central TDM marketplace and database, hosting standardized XML content and providing access for mining. Fixed-term licenses (allowing unlimited access for mining while in force) might simplify things for the user, while allowing publishers to revert to alternative arrangements if necessary: Agreement on standard content formats important; Maintaining momentum and urgency. d. Further considerations For licensed content, pharma and academic users argue that the right to mine content to which they have paid-for licensed access (or which is freely available on the web, e.g. in repositories) should be included as standard within the license agreement. Publishers generally want individual discussions because business models have not yet been formalised, and the technical framework needs to be established. New services, new costs. Unlicensed content opportunity for both ends. e. Solution offered by CCC In CCC’s opinion, an intermediary organisation between users and publishers could provide a viable platform for addressing the aforementioned TDM issues and could facilitate a conversation between both parties to seek a middle ground. Against this background, Reed Elsevier is setting up a pilot automated licensing system: researchers at institutions involved in the pilot will have access to a self-service process that gives access to their institution’s subscribed Elsevier content through APIs (Application Program Interface)23. It will not be necessary to consider requests on a case-by-case basis. The bulk of requests will be considered pre-approved, an automatic licence generated, and access provided through the automated system. 23 The Elsevier Article API facilitates search and access to scientific journals and scientific articles. The API provides web services for searching for journals, journal volumes, specific issues, articles, and article images. The Article and Article Image specific API interactions provide access to the full-text article XML (and the associated images) and enable a mash-up developer to render the returned article in customizable formats. 6 The CCC Pilot includes both commercial and non-commercial uses for the following users and publishers: - Users Bio Medical Chemical Marketing - Publisher Social Sciences Bio Medical Physical Sciences Plant Sciences f. What users want In CCC’s view, users want: Flexibility: - Obtaining a license to access database - Secure that license without interrupting their workflow. - Confidentiality: User’s queries for licensing content should be kept confidential The minimum amount of information should be obtained for licensing purposes. - Control: Define the TDM algorithms and services involved Define the objective of the mining Retain access to the outcomes of the TDM activities g. Main advantages for users CCC assumes that the main advantages for users are: Include a single centralised point of licensing across rightsholders Check of existing license coverage and the ability to purchase new licenses when needed Format for all related content regardless of the content’s origin Set of discovery tools and metadata descriptors for all content across publishers API and seamless access to all content to be mined h. Main advantages for publishers From CCC’s perspective, the main advantages for publishers are as follows: Elimination of the need for rightholders to standardise their format Flexible licensing of content for text and data mining 7 Flexible pricing of content Visibility into the aggregated data related to data mining Ability to avoid individual negotiations CCC PILOT – User benefits: Flexible licensing – timely access Confidential access (essential for pharma) Single point of content access and delivery Standardised content format across publishers CCC PILOT – Outcomes: No need for bilateral licensing and individual negotiations Develop new business models for content access (e.g. unsubscribed content) Potential for extension into the academic research space – solves further significant access issues D. TDM SERVICES OFFERED (JOINTLY) BY UK PA, PLS, CCC, CROSSREF AND STM Other IFRRO members are also developing (jointly) model licences for publishers, in meeting the text mining needs of researchers. Inter alia, while STM has prepared a sample STM-PDR model licence for pharma24, PLS is offering its services as a clearing-house for requests (single point of contact per project; commonly agreed terms) and the UK PA is co-ordinating a cross-industry effort to provide users with a click-through licence. On 19 December 2012, EMMA, ENPA, EPC, FEP and STM co-hosted a “Mini-Seminar on Text and data Mining” at the European Commission’s premises. The seminar was well-attended, inter alia by Commission representatives from DG MARKT, DG CULTURE and DG CONNECT. Presentations were given by, inter alia, Jonathan Clark (author of a guide on TDM), Maximilian Haeussler (researcher at UC Santa Cruz), Eefke Smit (STM), Andrew Hughes (NLA/PDLN), Sarah Faulder (PLS), and representatives from Springer, Elsevier and Wiley-Blackwell. a. PLS Work has begun to establish a Clearing House for TDM permissions at the PLS, based on an enhancement of its existing rights database, PLSe. According to PLS, this could act as an entry point for researchers wanting to mine journal content. Appropriate rightholders would be identified on behalf of the researcher and the necessary permissions facilitated. Once content to be mined has been specified and rightholders to that content have been identified, then, subject to licences, protocols are needed to verify the permissions that enable mining tools to be applied to full text articles on the publishers’ platforms. PLS plans to develop licences that would support smaller publishers not in a position to negotiate their own licences directly. PLS recommends that the first step for a publisher who wishes to make 24 http://www.stm-assoc.org/text-and-data-mining-stm-statement-sample-licence/ 8 content available for text mining is to decide the terms and conditions under which they will do so. This will be governed by whether the purpose is commercial or non-commercial. It is not always clear, however, who the rightholder is, nor how to contact these to seek permission. Several organisations, including PLS, CCC and CrossRef, are working to enable services in this area. As a rightholder, the publisher must give permission for text mining. This can be done in a number of ways. Permission can be included in an access licence agreement with, for instance, an institution. STM has produced a model clause for this purpose25. Some publishers have established a process for individual researchers to obtain permission to text mine with some restrictions26, while others do not support text mining yet. Some organisations such as PubMed, allow unrestricted text mining without permission. The Pharma-Documentation-Ring (P-D-R) recently updated their sample licence to grant text and data-mining rights for the content to which each of the P-D-R members subscribe.27 Researchers want to track and contact ‘potentially hundreds’ of publishers for permission to mine their text (permissions not required to mine data per se). Connecting researchers to rightholders could be a task for RROs. The envisaged solution by PLS, to be fully functional by mid-2013, is a single discovery portal (in order to find the appropriate publishers and route their permission requests to the relevant person in the publishing house). With the PLS database, PLS is developing a clearing house for researchers and a licensing service for the long tail of publishers. b. UK PA The UK PA is aiming to convene a cross-sector working group comprising researchers, funders, technology providers, and publishers to develop a set of principles for a standard ‘click-through’ licence that meets the needs of both researchers wanting to use mining tools and publishers willing to grant user rights to their content. It follows that in order to develop such a licence, a mutual understanding of needs and an active dialogue is needed between the two communities, researchers and publishers. To streamline permissions transactions even further, especially across multiple smaller publishers, a collective licence might be developed, potentially in collaboration with PLS and CLA. A collective licence could also be of value for publishers with less text to license, who may find it a more cost effective solution than managing their own permissions bilaterally. c. CrossRef, CCC and STM Having set up a Clearing House permissions service via PLS, and a group to develop a ‘click through’ licence for the application of mining tools, publishers are currently exploring the means to enable text mining itself by using enhancements to existing technology. Several publishers and organisations are looking at this or planning working pilots, including CrossRef, an independent membership organisation aiming to promote the development and cooperative use of new and innovative technologies to speed and facilitate scholarly research, and CCC. 25 STM Statement on Text and Data Mining and Sample Licence, http://www.stm-assoc.org/text-and-data-mining-stmstatement-sample-licence/ 26 See, for example, Elsevier, http://www.elsevier.com/editors/open-access/open-access-policies/content-miningpolicies; Springer: http://www.springeropen.com/about/datamining/. 27 http://www.p-d-r.com/content/press_releases/archive/2012/ 9 CrossRef is potentially well-positioned to provide solutions to most of the logistical and technical problems that have been identified by both publishers and researchers. By leveraging existing CrossRef and publisher infrastructure, with modest development efforts, it should be possible to establish an automated, centralised and efficient mechanism to allow researchers and publishers to agree to the terms of a standard text mining licence and to enable a standard cross-publisher mechanism for identifying and retrieving the full text of journal articles for text mining purposes. CCC, in cooperation with STM, brought together users, publishers, and technology companies to explore the state of text and data mining for scientific publications and journals. The participants explored the key drivers and hindrances for TDM. Following that event, CCC convened a group of publishers and users from the US, UK, and Europe, in order to create a working pilot. CCC’s TDM system is being built for the purpose of facilitating proper discovery of, and efficient access to, high quality articles while respecting the rights of publishers who create and manage content and databases. The key goals are: i. to eliminate the burden of multiple formats for users and relieve publishers of the responsibility of normalization of content and data, ii. to provide one-stop clearing of rights and/or access to content for each TDM project, by providing appropriate licenses on behalf of many different publishers, and iii. to generate royalties for rightholders whose content will be used for the purposes of TDM. Providing users with access to both subscribed and unsubscribed content for mining purposes is a key deliverable of the CCC project, and one which has been broadly accepted by both publishers and users. To this end, the Publishing Research Consortium, a collaboration of publisher associations that supports research into scholarly communication in order to enable evidence-based discussion, has commissioned a Guide to Text and Data Mining (apparently not yet published) in order to provide practical guidance on the aims, methods, outputs, and rationale for text mining and also some insight into the technical implications and surrounding issues affecting publishers and their readership. The Pharma-Documentation-Ring (P-D-R) sample license has been updated to grant text and datamining rights to use the content to which each of the P-D-R members subscribes. The P-D-R sample license serves as a benchmark used by P-D-R’s members to negotiate individual subscription agreements with publishers and other content suppliers.28 The text of the clause reads: “Text and Data Mining (TDM): download, extract and index information from the Publisher’s Content to which the Subscriber has access under this Subscription Agreement. Where required, mount, load and integrate the results on a server used for the Subscriber’s text mining system and evaluate and interpret the TDM Output for access and use by Authorised Users. The Subscriber shall ensure compliance with Publisher's Usage policies, including security and technical access requirements. Text and data mining may be undertaken on either locally loaded Publisher Content or as mutually agreed.” 28 The agreement was reached between P-D-R, an association of twenty-one pharmaceutical companies, ALPSP and STM. See also: http://www.stm-assoc.org/2012_12_11_STM_Report_2012.pdf 10 E. AUTHOR INVOLVEMENT So far, to our knowledge the rightholder category involved in (projects-related) work with respect to text and data mining is (mainly) publishers. Authors do not seem to having been involved yet, but we assume that text mining also concerns authors, for instance as regards unpublished material/manuscripts. The character of the use, as large scale subsidiary usages of multiple works by multiple rightholders, combined with potential involvement of both authors and publishers, makes it appropriate to consider collective management of rights. It is therefore relevant for RROs to contemplate whether to offer their services to the rightholders in relation to TDM. F. OPPORTUNITIES FOR RRO INVOLVEMENT a. Understanding the needs of users/researchers Broadly speaking, there are (so far) four main reasons for users to embark on text mining: to enriching the content in some way; to enable systematic review of literature; for discovery; or computational linguistics research.29 Against this background, RROs could consider offering services to authors and publishers in relation to TDM. Their involvement could contribute to the removal of potential friction between TDM users and rightholders by handling payments and offering single licences, in particular given the demand for a central broker and fixed-term licenses. Managed licensed access can deliver benefits for researchers irrespective of any legislation, which will not in itself resolve the significant technical issues involved. Therefore, streamlining the means to enable text mining will be essential. To achieve this, a deep understanding of the needs of researchers and content miners will be required, and collaboration will be needed between stakeholders across the sector. 30 b. Making text mining work on commercial platforms Text/data mining applications, including previous examples, often are research project- or research-specific and not always attractive to commercial publishing platforms and their customers Value to the non-expert can be limited “Articles of the future”31 and “Adventures in semantic publishing”32 not widely implemented yet A solution for medical case reports in journals?33 c. The need for a standardised ‘click-through’ licence Once the content to be mined has been sufficiently specified so that rightholders of that content can be identified and approached, when the necessary permissions have been sought and granted, those permissions still need to be consolidated into some form of licence. Work to develop model clauses that multiple publishers can use and adapt for their own purposes in individual licences has been in 29 Jonathan Clark, Text Mining and Scholarly Publishing, Publishing Research Consortium 2012, page 7. http://text.soe.ucsc.edu/progress.html 31 http://www.articleofthefuture.com/ 32 http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000361 33 See also: www.casesdatabase.com 30 11 hand for some time, but so far this work has been principally applied to negotiations between the major STM publishers and the commercial pharmaceutical industry. The licence used there may not be appropriate for non-commercial transactions between academic researchers and the broader range of journal publishers. Ideally, some form of standardised ‘click-through’ licence is needed to link up the process of verifying and granting permissions and the process of enabling access for the application of mining tools to full text published articles on the publishers’ (or authors’) platforms. One way would be a clearing house for permissions that provides a single point of contact for researchers and rightholders. Ideally, this would be a standard ‘click-through’ licence. A machinereadable licence would enable every article with a defined identifier, for instance with a DOI, to have the licence associated with it which would greatly simplify the whole process. A researcher would accept and receive a certificate that would work across all content. Permission would be granted under defined terms and conditions of use that are usually detailed in the licence. This could be a standard licence or one designed specifically for a particular purpose. The period of time that a licence could cover would depend on the text mining needs. For computational linguistics search, often a one-time access will be sufficient. For systematic literature reviews and data mining, however, access will be needed over an extended period as new content is added all the time. Content may be delivered as a single delivery (“data dump”) or online access may be granted. Rightholders may choose to allow robot crawling of their digital content, possibly with restrictions. The use of a name identifier, such as ISNI (the ISO approved International Standard Name Identifier), would be useful, to uniquely identify researchers and other contributors. G. SOME LEGAL ISSUES Text mining may frequently result in the creation of databases of facts or raw data extracted from the sources mined. From a legal perspective, it is not clear whether any resultant database is protected separately as a derivative work. This would need to be assessed on a case-by-case basis. (On the other hand, some licences cover derivative works, but require attribution of the source, which might be challenging from a practical perspective.) If data = numerical representation of facts, then they are generally not copyrightable, but there are: Many levels of data/derived digital data34 Jurisdictional differences (e.g. US vs. Australian law; EU database rights) = ambiguity about legal status of content 34 Public consultation on implementing CC0 for data published in open access journals Sept-Nov 2012, http://blogs.biomedcentral.com/bmcblog/2012/09/10/put-the-open-in-open-data/; see also: Hrynaszkiewicz I, Cockerill MJ: Open by default: a proposed copyright license and waiver agreement for open access research and data in peerreviewed journals. BMC Research Notes 2012, 5:494 http://www.biomedcentral.com/1756-0500/5/494 12 H. ELEMENTS IN A STANDARD TDM LICENCE The Terms and Conditions of a standard TDM Licence could include – very briefly: 1. Definitions 2. Grant of licence RRO conditions for non-exclusive licence – permitted uses, incl. text (and data) mining, e.g.: downloading, extracting and indexing information from licensed website/online sources; mounting, loading and integrating results on a server used for text (and data) mining; evaluating and interpreting the text (and data) mining output for access and use; copying from digital publications and storing of digital copies: making/distributing and/or permitting making/distributing of paper copies; making available and/or permitting making available of digital copies; (scanning material to) produce digital copies. 3. Conditions applying to the creation and use of licensed copies; further conditions applying to scanning and use of digital material (incl. security and technical access requirements) 4. Commercial uses (if applicable) 5. Duration 6. Payment 7. Notification / Notification to licensee’s staff 8. Data collection 9. Indemnity 10. Breach and termination 11. General: notices, variation of terms, assignments, jurisdiction/disputes/governing law, etc. I. SOME FURTHER READING - Witten, I.H. (2005), “Text mining”, in: Practical handbook of internet computing, edited by M.P. Singh, pp. 14-1 - 14-22. Chapman & Hall/CRC Press, Boca Raton, Florida; http://www.cs.waikato.ac.nz/~ihw/papers/04-IHW-Textmining.pdf - National Centre for Text Mining (NaCTeM), http://www.nactem.ac.uk - The Arrowsmith Project, http://arrowsmith.psych.uic.edu/arrowsmith_uic/index.html - END of Document 13