Tuesday, March 20, 2012

Copyright for expression of ideas; patent law for ideas

This post is a second reply to a post David Prosser wrote on the GOAL list in response to my post on the RCUK consultation, highlighting the intellectual property issues. This post is a mixture of answers, my perspectives, and questions. In my opinion, David Prosser's brief example raises a number of issues which can help us to move forward with understanding libre open access. In brief, I argue that facilitating data and text mining and resulting works does not involve copyright at all (crawling text and data is simply normative in the context of the world wide web, for example), but rather making works openly available, and in a format that permits text and data mining.

On 18-Mar-12, at 5:07 AM, David Prosser wrote:

Say I wanted to data mine 10,000 articles.  I'm at a university, but I am co-funded by a pharmaceutical company and there is a possibility that the research that I'm doing may result in a new drug discovery, which that company will want to take to market.  The 10,000 articles are all 'open access', but they are under CC-BY-NC-SA licenses.  What mechanism is there by which I can contact all 10,000 authors and gain permission for my research?


First, before I comment on intellectual property issues, I would like to point out that the concept of "intellectual property" is a relatively recent invention, and one that arguably should be challenged. For details, see the second chapter of my draft thesis; from here, search for: The invention of “intellectual property”: enclosure of knowledge. Also, a disclaimer that I am a scholar whose work intersects with intellectual property issues, but not a copyright lawyer or expert.  Given that the arguably fictional "intellectual property" is legally nonfiction throughout most of the world, following are some reflections arising from David's example.

Copyright covers the expression of ideas, not the ideas themselves. If a researcher employed by a pharmaceutical firm were to read 10,000 articles and this research resulted in an idea for a new drug, the pharmaceutical firm would not need to seek permission from any of the authors of the articles in order to apply for a patent. Text-mining is merely an automated form of reading, so again, no need to seek permission from authors to apply for a patent. The World Intellectual Property Organization (WIPO) provides a brief overview of intellectual property which explains well the various forms. In brief, there are about 5 forms of intellectual property, many of which actually have opposing expectations. Patent law is a public declaration of rights to use an idea or procedure, and openness is appropriate. Patent law is designed to protect rights to private profit. Trade secret law is also designed to protect private property, however in this case the protection is achieved through secret, private means rather than a public, open process.

The question of whether copyright permissions are, or should be, necessary for data or text mining is an important issue to address when considering libre open access (including broader re-use rights in contrast to the free-to-read gratis open access). I argue that no special copyright related permissions are necessary. As evidence, here is a quick illustration:

Try a google search for: "To pursue, within the limits of the STM Association's aims and objectives, the highest possible level of international protection of copyright works and of the services of publishers in making these works available" and it should be quite easy to find the Introduction to Copyright & Legal Affairs of the International Association of Scientific, Technical and Medical Publishers (STM):  http://www.stm-assoc.org/copyright-legal-introduction/ There is nothing on the STM website to indicate that special rights have been granted for text mining. STM is certainly not naive or neutral about intellectual property rights; the founding reason for the existence of STM in protection of IP. Yet clearly Google, a commercial company, is crawling this site and returning results. There is nothing the slightest bit exceptional about this example. This is how the world wide web works! If anyone wants to post things on the web but not make them available for crawling, it is up to the website owner to opt out by indicating that they do not want their site crawled.  

Some subscription-based scholarly publishers do not allow text or data mining of their databases. It seems likely that they are interpreting the multiple downloads often involved as pirating of their copyrighted content. That is, the basis for refusing to allow text or data mining is interpretation of the activity as a violation of copyright - or fear that the publisher cannot allow text or data mining while simultaneously preventing copyright violation - not because text or data mining actually violates copyright. If publishers' products contain DRM preventing text or data mining, that is a different matter. Legal protection for the publishers in this instance involves DMCA style laws and contract law - not copyright law. Within the context of library subscriptions, data and text mining can be included in contracts. Here is the relevant text from the BC Electronic Library Network model license: 3.1.11 "DATA and TEXT MINING. Members and Authorized Users may conduct research employing data or text mining of the Licensed Materials". This language is not original with BC ELN, but rather developed based on research on other model licenses, including those of JISC, CRKN, and OCUL. In the real world, copying this kind of work with informal permission but without attribution is actually the norm, as we all want to work towards standards and avoid re-inventing the wheel.

What is needed to provide for data and text mining, I argue, is not changes to copyright but rather content made available in formats that are easily crawled for these purposes, such as xhtml rather than locked-down PDFs, and made openly available over the World Wide Web.

I understand that Europe (as a whole, or just some countries) may have some odd laws that would prohibit text and data mining. This may help to explain why people are trying to use copyright law as a means of ensuring permissions for text and data mining. I would like to know more about this; if anyone can provide details, links, etc., that would be most helpful for all of us to really understand the issues.

My first response to David Prosser's question, challenging the underlying assumption that increasing corporatization of the university is acceptable, can be found here.

Discussion is welcome.


  1. I agree with you that reading should not trigger copyright. However, in practice it soon might if efforts to apply copyright to ephemeral copies in computer memory are enacted. (I think this proposal may be part of the Trans-Pacific Partnership; it's certainly not the first time copyright maximalists have pushed for it.) The rest of my comment is a narrow response to specific points.

    The European Union has a sui generis database right covering the content of databases. It applies if the database owner has made "a substantial investment in preventing unauthorized use of information" (May & Sell, Intellectual Property Rights: A Critical History, 2006, Lynne Rienner, Boulder CO, p. 149). The right can be renewed indefinitely. Needless to say, there are continuing efforts to enact similar rights elsewhere.

    I recall efforts to create an open index of postal codes in the U.K. as an alternative to the database of published codes, which is covered by copyright.

    Even without such a right, a similar effect can be achieved. For example, in the U.S., references to legal precedent refer to page numbers in publications owned by particular publishers. The legal decisions themselves are public domain - but the page numbers are not. (At least this used to be the case, I think there have been efforts to improve the situation though I don't know how.)

    As to the idea/expression doctrine, at this point it seems to me to be more a rhetorical justification than a genuine legal norm. May and Sell write, "The courts have stretched copyright law to cover such things as algorithms and aggregates of facts in ways that eradicate the ideas-expression dichotomy at the heart of copyright and extend new protections to facts per se" (p. 151). They give an example (pp. 151-2):

    In 1977 the Ninth Circuit in San Francisco heard a case brought by Sid and Mary Kroft against McDonald's fast food company. The Krofts had created a children's television program, H. R. Pufnstuff, which portrayed a live-action fantasyland of talking trees and magical creatures. McDonald's approached the Krofts about basing some television advertisements on H. R. Pufnstuf. They did not agree to terms,
    but McDonald's went ahead and developed a series of commercials based in "McDonaldland" (complete with talking trees and magical creatures). Despite the fact that McDonald's had differentiated the expression of the characters from H. R. Pufnstuff in its rendition, the court ruled against McDonald's and in favor of the Krofts. In so doing, the court "extended to the realm of visual and narrative entertainment a new principle of idea protection: 'total concept and feel'" (Vaidhyanathan 2001).

  2. Thanks Geof these are useful comments and citations. The fight for fair copyright has many fronts, and needs lots of copyfighters on each of them.

    Regarding your comment, "The European Union has a sui generis database right covering the content of databases. It applies if the database owner has made "a substantial investment in preventing unauthorized use of information" (May & Sell, Intellectual Property Rights: A Critical History, 2006, Lynne Rienner, Boulder CO, p. 149)." This is certainly concerning, and could have implications for open access.

    Let's consider the implications for the major point of this discussion between myself and Prosser, that is, whether CC-BY is optimal for open access. Consider the case of the faculty member who paid to publish in BioMedCentral, then was angry to find that someone was selling her article (behind a paywall) for $3: https://groups.google.com/a/arl.org/group/sparc-oaforum/browse_thread/thread/fc977cabd0d59bcc
    This reseller has obviously made "a substantial investment in preventing unauthorized use of information" by putting this and similar articles behind a paywall. In this case, CC-BY, by granting commercial rights, creates a situation where a third party that contributed nothing at all to the publishing of scholarly articles, can gain rights. If the author had published CC-BY-NC instead, then what this reseller is doing would be clearly in violation of the CC license. This is another example of why I think that specifying noncommercial is important to protect OA downstream. For more on this topic, see my response to the RCUK consultation on their draft new OA policy.

    The postal codes example falls under the push for open data / open government. The civicaccess.ca list has had discussions about this very topic in Canada. This is an important issue because publicly available postal codes would be very useful as a democratic tool (i.e. assisting people to contact their representatives).


Thank you for your comment. Comments on IJPE are moderated.