Thursday, July 12, 2018

The trouble with scientific faith, in this case, in AI

This post was originally posted to the Global Open Access List (GOAL) on July 12, 2018 with the following title:  Why translating all scholarly knowledge for non-specialists using AI is complicated.
To view the full conversation, go to the GOAL archives for July 2018. 
On July 10 Jason Priem wrote about the AI-powered systems "that help explain and contextualize articles, providing concept maps, automated plain-language translations"... that are part of his project's plan to develop a scholarly search engine aimed at a nonspecialist audience. The full post is available here:

We share the goal of making all of the world's knowledge available to everyone without restriction, and I agree that reducing the conceptual barrier for the reader is a laudable goal. However, I think it is important to avoid underestimating the size of this challenge and potential for serious problems to arise. Two factors to consider: the current state of AI, and the conceptual challenges of assessing the validity of automated plain-language translations of scholarly works.

Current state of AI - a few recent examples of the current status of AI:

Vincent, J. (2016). Twitter taught Microsoft's AI chatbot to be a racist asshole in less than a day. The verge.

Wong, J. (2018). Amazon working to fix Alexa after users report bursts of 'creepy' laughter. The Guardian

Meyer, M. (2018). Google should have thought about Duplex's ethical issues before showing it off. Fortune

Quote from Meyer: 
As prominent sociologist Zeynep Tufekci put it: “Google Assistant making calls pretending to be human not only without disclosing that it’s a bot, but adding ‘ummm’ and ‘aaah’ to deceive the human on the other end with the room cheering it… horrifying. Silicon Valley is ethically lost, rudderless and has not learned a thing.”

These early instances of AI applications involve the automation of relatively simple, repetitive tasks. According to Amazon, "Echo and other Alexa devices let you instantly connect to Alexa to play music, control your smart home, get information, news, weather, and more using just your voice". This is voice to text translation software that lets users speak to their computers instead of using keystrokes. Google's Duplex demonstration is a robot dialing a restaurant to make a dinner reservation. 

Translating scholarly knowledge into simple plain text so that everyone can understand it is a lot more complicated, with the degree of complexity depending on the area of research. Some research in education or public policy might be relatively easy to translate. In other areas, articles are written for an expert audience that is assumed to have spent decades acquiring a basic knowledge in a discipline. It is not clear to me that it is even possible to explain advanced concepts to a non-specialist audience without first developing a conceptual progression. 

Assessing the accuracy and appropriateness of a plain-text translation of a scholarly work intended for a non-specialist audience requires expert understanding of the work and thoughtful understanding of the potential for misunderstandings that could arise. For example, I have never studied physics. If I looked at an automated plain-language translation of a physics text I would have no means of assessing whether the translation was accurate or not. I do understand enough medical terminology, scientific and medical research methods to read medical articles and would have some idea if a plain-text translation was accurate. However, I have never worked as a health care practitioner or health care translation researcher, so would not be qualified to assess the work from the perspective of whether the translation could be mis-read by patients (or some patients).

In summary, Jason and I share the goal of making all of our scholarly knowledge accessible to everyone, specialists and non-specialists alike. However, in the process of developing tools to accomplish this it is important to understand the size and nature of the challenge and the potential for serious unforeseen consequences. AI is in very early stages. Machines are beginning to learn on their own, but what they are learning is not necessarily what we expected or wanted them to learn, and the impact on humans has been described using words like 'creepy', 'horrifying', and 'unethical'. The task of translating complex scholarly knowledge for a non-specialist knowledge and assessing the validity and appropriateness of the translations is a huge challenge. If this is not understood and plans made to conduct rigorous research on the validity of such translations, the result could be widespread dissemination of incorrect translations. 


Heather Morrison
Associate Professor, School of Information Studies, University of Ottawa
Professeur Agrégé, École des Sciences de l'Information, Université d'Ottawa

Thursday, July 05, 2018

Ceased and transferred publications and archiving: best practices and room for improvement

In the process of gathering APC data this spring, I noticed some good and some problematic practices with respect to journals that have ceased or transferred publisher.

There is no reason to be concerned about OA journals that do not last forever. Some scholarly journals publish continuously for an extended period of time, decades or even centuries. Others publish for a while and then stop. This is normal. A journal that is published largely due to the work of one or two editors may cease to publish when the editor(s) retire. Research fields evolve; not every specialized journal is needed as a publication venue in perpetuity. Journals transfer from one publisher to another for a variety of reasons. Now that there are over 11,000 fully open access journals (as listed in DOAJ), and some open access journals and publishers have been publishing for years or even decades, it is not surprising that some open access journals have ceased to publish new material.

The purpose of this post is to highlight some good practices when journals cease, some situations to avoid, and room for improvement in current practice. In brief, my advice is that when you cease to publish a journal, it is a good practice to continue to list the journal on your website, continue to provide access to content (archived on your website or another such as CLOCKSS, a LOCKKS network, or other archiving services such as national libraries that may be available to you), and link the reader interested in the journal to where the content can be found.

This is an area where even the best practices to date leave some room for improvement. CLOCKSS archiving is a great example of state-of-the-art but CLOCKSS' statements and practice indicate some common misunderstandings about copyright and Creative Commons licenses. In brief, author copyright and CC licenses and journal-level CC licensing are not compatible. Third parties such as CLOCKSS should not add CC licenses as these are waivers of copyright. CC licenses may be useful tools for archives, however archiving requires archives; the licenses on their own are not sufficient for this purpose.

I have presented some solutions and suggestions to move forward below, and peer review and further suggestions are welcome.

Details and examples

Dove Medical Press is a model of good practice in this respect. For example, if you click on the title link for Dove's Clinical Oncology in Adolescents and Young Adults a pop-up springs up with the following information:
"Clinical Oncology in Adolescents and Young Adults ceased publishing in January 2017. All new submissions can be made to Adolescent Health, Medicine and Therapeutics. All articles that have been published in Clinical Oncology in Adolescents and Young Adults will continue to be available on the Dove Press site, and will be securely archived with CLOCKSS".
Because the content is still available via Dove's website, the journal is not included on the CLOCKSS' list of triggered content. This is because CLOCKKS releases archived content when it is no longer available from the publisher's own website.

CLOCKSS Creative Commons licensing statement and practice critique

One critique for CLOCKSS: - from the home page:  "CLOCKSS is for the entire world's benefit. Content no longer available from any publisher ("triggered content") is available for free. CLOCKSS uniquely assigns this abandoned and orphaned content a Creative Commons license to ensure it remains available forever".

This reflects some common misperceptions with respect to Creative Commons licenses. As stated on the Creative Commons "share your work" website:  [your emphasis added] "Use Creative Commons tools to help share your work. Our free, easy-to-use copyright licenses provide a simple, standardized way to give you permission to share and use your creative work— on conditions of your choice".

The CLOCKSS statement  "CLOCKSS uniquely assigns this abandoned and orphaned content a Creative Commons license to ensure it remains available forever" is problematic for two reasons.
1. This does not actually reflect CLOCKSS' practice. The Creative Commons statements associated with triggered content indicate publisher rather than CLOCKSS' CC licenses. For example, the license statement for the Journal of Pharmacy Teaching on the CLOCKSS website states: "The JournalPharmacyTeaching content is copyright Taylor and Francis and licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License".

2. This would be even more problematic if it did reflect CLOCKSS' practice. This is because CLOCKSS is not an author or publisher of the scholarly journals and articles included in CLOCKSS. Creative Commons provides a means for copyright owners to indicate willingness to share their work. When a third party such as CLOCKSS uses CC licenses, they are explicitly or implicitly claiming copyright it order to waive their rights under copyright. This reflects an expansion rather than limitation of copyright that may lead to the opposite of what is intended. For example, if one third party is a copyright owner that wishes to claim copyright in order to grant broad-based downstream rights, another third party could use the copyright claim to support their right to claim copyright in order to lock down others' works. A third party that is a copyright owner providing free access today could use this copyright claim in future as a rationale for toll access. This could come into play if in future toll access seems more desirable from a business perspective.

The CLOCKSS practice of publisher-level copyright (see 1. above) is problematic because Creative Commons first release of CC licenses was in December 2002. Scholarly journal publishing predates 2002 (the first scholarly journals were published in 1665), and not every journal uses CC licenses even today. Retroactive journal-level CC licensing would require re-licensing of every article that was published prior to the journal's first use of CC licensing.

For example, the copyright statements of volume 1 dated 1990 on the PDFs of the CLOCKSS-triggered Journal of Pharmacy Teaching read: "Journal of Pharmacy Teaching, Vol. l(1)1990 (C) 1990 by The Haworth Press, Inc. All rights reserved". This suggests that all authors in this journal at this point in time assigned full copyright to The Haworth Press, although actual practice was probably more complex. For example, if any authors were working for the U.S. federal government at the time, their work would have been public domain by U.S. government policy. Any portions of third party works included would likely have had separate copyright. Even assuming the simplest scenario, all authors had and transferred all rights under copyright to Haworth Press, the authors would retain moral rights, hence it would be necessary to contact all of the authors to obtain their permission to re-license the works under Creative Commons licenses.

The idea of journal-level CC licensing is at odds with the idea of author copyright. This confusion is common. For example, the website of the Open Access Scholarly Publisher's Association Licensing FAQ states: "one of the criteria for membership is that a publisher must use a liberal license that encourages the reuse and distribution of content" and later "Instead of transferring rights exclusively to publishers (the approach usually followed in subscription publishing), authors grant a non-exclusive license to the publisher to distribute the work, and all users and readers are granted rights to reuse the work". If copyright and CC licenses really do belong to the authors, then journal-level Creative Commons license statements are incorrect.

Even more room for improvement

The above, while leaving some room for improvement, appears to reflect best practices at the present time. Other approaches leave even more room for improvement. For example, in 2016 Sage acquired open access publisher Libertas Academica. The titles that Sage has continued can now be found on the Sage website. The Libertas Academica titles that Sage no longer publishes can be found as trigged content on the CLOCKSS website. However, the original Libertas Academica website no longer exists and there is no indication of where to find these titles from the Sage website.
Titles that were formerly published by BioMedCentral are simply no longer listed on the BMC list of journals. For example, if you would like to know where to find Gigascience, formerly published by BMC, you can find information at the site of the current publisher, Oxford. A note on the SpringerLink page indicates that BMC maintains an archive of content on its website. However, if you look for Gigascience on the BMC journal list, it simply is not listed. It would be an improvement to follow the practice of Dove and include the title, link to the archived content, and provide a link to the current publisher.

Solutions? Some suggestions

If journals and publishers were encouraged to return copyright to the authors when a journal is no longer published, or a book is no longer being actively marketed (in addition to using their existing rights to archive and make works freely available), then authors, if they chose to do so, could release new versions of their works. For example, a work currently available in PDF could be re-released in XML to facilitate text and data-mining, or perhaps updated versions, and authors could, if desired, release new versions with more liberal licenses than journal-level licenses that must of necessity fit the lowest common denominator (the author least willing or able to share).

Education, among the existing open access community, and beyond is needed. First, we need to understand the perhaps unavoidable micro level nature of at least some elements of copyright under conditions of re-use of material. For example, if a CC-BY licensed image by one photographer or artist is included in a scholarly article written by a different person that is also CC-BY licensed, the moral rights, including attribution, are different for the copyright holder of the image and that of the author of the article. In academia, attribution and moral rights are essential to our careers.

The intersection of plagiarism and copyright is different in academia. If one musical composer copies another's work, copyright law is likely the go-to remedy. If a student presents someone else's work as their own, academic procedures for dealing with plagiarism will apply, regardless of the copyright status of the work. For example, the musician using a public domain work need not worry about copyright but the student using a public domain work without attribution is guilty of plagiarism and likely to face serious consequences. Evolving norms for other types of creators (amateur or professional photographers, video game developers) may not work for academia.

For CLOCKSS, a statement that all triggered content is made freely available to the public, and that additional rights may be available for some works, with advice to look at the work in question to understand re-use rights, would be an improvement.

Your comments and suggestions? 

This is an area where even today's best practices are wanting, and the solutions / suggestions listed above are intended as an invitation to open a conversation on potential emerging practices that may take some time to fully figure out. Peer review and suggestions are welcome, via the comments section or e-mail. If you are using e-mail, please let me know if I may transfer the content to this post and if so whether you would like to be attributed or not.

This post is cross-posted to the Sustaining the Knowledge Commons research blog and forms part of the Creative Commons and Open Access Critique series. Comments and suggestions are welcome on either blog.

Wednesday, July 04, 2018

Dramatic Growth of Open Access June 2018

Congratulations to DOAJ for recently surpassing a milestone of over 3 million articles searchable at the article level!

The outstanding growth story by percentage for the second quarter of 2018 was bioRxiv. From March 31 - June 30, bioRxiv grew by 5,290 articles for a total of 28,070 articles, a growth rate of 23% for this quarter and 129% (more than doubled) over the past year.

38 of the limited set of indicators that I track had growth rates this quarter of 2% or more, equivalent of 8% annual growth, more than double the base rate of growth of scholarly journals and articles of 3 - 3.5% (de Solla Price, 1963; Mabe & Amin, 2001).

My best guesstimates of "how much open access there is" are based on the meta-search tool BASE (the Bielefeld Academic Search Engine). BASE harvests metadata from repositories and open access journals using OAI-PMH. BASE now contains over 130 million documents from 6,444 sources. About 60% are open access; collectively, the OA movement now makes available about 78 million open access documents. This quarter, BASE grew by over 13 million documents for a quarterly growth rate of 11%.

The Internet Archive as usual showed robust growth in a number of services - software components grew by 11% this quarter for a total of just over 230,000; audio recordings grew by 8% and are now over 8.8 million; collections also grew by 8% and are now over 325,000; close to a million texts were added this quarter for a growth rate of 6% and a total of over 16.5 million texts; there are close to 200,000 more videos for a growth rate of 5%; webpages and television each grew by 3%. There was a decrease in the number of images this quarter, down 18% or close to 700,000 images (does anyone know why? - if so please comment), in contrast with the annual growth for images from last year of 115% (more than double).

For OA publishing, this quarter SCOAP3 grew by 1,772 documents or 9%. The Directory of Open Access Books added 826 books and 17 publishers, 7% growth this quarter for both indicators. RePEC added over 2,000 books for a quarterly growth rate of 6% (journal articles and total downloadable items each grew by 2%). DOAJ added about 7 new titles per day  this quarter for a total net growth of 624 journals, a growth rate of 6%; DOAJ also by 6% in the number of journals and articles searchable at the article level, and as noted above, DOAJ surpassed a milestone of over 3 million articles searchable at the article level. DOAJ also added 4 countries this quarter.

A PubMed keyword search for "cancer" limited to the last year returned 5% more free fulltext this quarter. However, the same search with no date limit resulted in a slight (1%) decrease in free fulltext (does anyone know why? If so please comment). The same search with date limits of 5 years or 2 years result in a 2% increase in free fulltext. The number of items in PubMedCentral grew by 4% this quarter, adding 200,000 items for a total of 4.9 million (watch for the 5 million milestone coming soon). PMC journal participation grew by 2% this quarter on several indicators: the number of journals actively participating in PMC, the number of journals providing immediate free access, the number of journals depositing all content in PMC, and the number of journals that deposit some content in PMC.

arXiv grew by 3%; ROARMAP OA policy listings by 2%, as did the total number of journals that can be read free of charge listed by the Electronic Journals Library.

Congratulations and thank you to every one of the thousands of journals, repositories, publishers, and related services, and the millions of authors choosing to make your work open access.  Please accept my apologies for not tracking everyone, due to my human limitations. I encourage everyone to applaud and celebrate your own, and your neighbour's, accomplishments and milestones - and share them with everyone in the OA movement by joining the OATP tag team.

To download the data go to the DGOA dataverse.

This post is part of the Dramatic Growth of Open Access series.


Mabe, M., & Amin, M. (2001). Growth dynamics of scholarly and scientific journals. Scientometrics, 51(1), 147-162.
Price, D. J. d. S. (1963). Little science, big science. New York: Columbia University Press.

Monday, April 02, 2018

Dramatic Growth of Open Access March 2018

As usual open access is showing strong growth in many directions; more open access archives, documents, journals, articles, and books. This quarter focuses on the large number of indicators of growth beyond the usual background growth of scholarly journals and articles of 3 - 3.5% per year. Newcomer bioRxiv, with 21% growth this quarter (equivalent to 84% annual growth) is far above this background growth. This quarter, DOAJ added a net total of 378 journals, or more than 4 journals per day, for a total of 11,105 journals. The number of journals searchable at the article level has increased by 236 for a total of 8,045 journals. The number of articles searchable at the article level is just under 3 million.  The number of documents searchable through BASE grew by 3.5 million for a total of just under 24 million (about 60% of these, over 14 million, are open access). BASE added 121 content providers for a total of over 600 content providers. The percentage of PubMed records for a search for "cancer" that retrieve full-text is 27% overall, with a high of 45% for records published in the last 5 years. The percentage of full-text retrieval is rising at a steady rate.

The overall growth rate for scholarly articles and journals has been fairly steady over the past few centuries, in the range of 3 - 3.5% growth annually (Price, 1963; Mabe & Amin, 2001). As noted in the following chart, in the past quarter alone there have been 43 indicators of growth above that level, at least 1% in the quarter (equivalent of 4% annually). 
Quarterly growth percentage Item 03/31/18 Quarterly growth numeric
21% bioRxiv articles 22,780 3,958
13% DOAB books 11,685 1,370
10% SCOAP3 article 19,778 1,736
9% Internet Archive Video 4,128,556 328,556
8% Internet Archive Collections 338,578 25,578
8% Internet Archive Recordings 4,094,506 294,506
7% Internet Archive Television 1,607,000 107,000
7% DOAJ # of articles searchable at article level 2,984,612 192,911
6% DOAB # publishers 261 14
5% PubMed keyword search: cancer- last year - free fulltext 59,695 3,083
5% Internet Archive Texts 15,760,271 760,271
5% RePEC chapters 49,294 2,376
5% Internet Archive Webpages (billions) 325 15
4% Internet Archive Images 3,865,878 165,878
4% RePEc journal articles 1,659,120 67,779
4% PubMed keyword search: cancer- last 5 years - free fulltext 367,509 13,048
4% Internet Archive Software 206,098 7,098
4% DOAJ # journals 11,105 378
3% PubMed keyword search: cancer- last 2 years - free fulltext 142,572 4,723
3% RePEC downloadable articles 2,354,480 75,341
3% ROARMAP # OA policies 916 27
3% DOAJ # articles searchable at article level 8,045 236
3% PubMed keyword search: cancer - free fulltext 980,174 28,288
3% BASE # documents 123,932,954 3,549,531
3% PMC journals with some articles open access 682 18
2% DOAJ # countries 124 3
2% arXiv  articles 1,375,438 32,713
2% PMC select deposit journals 4,588 94
2% BASE # content providers 6,159 121
2% RePEC books 35,263 626
2% PubMed keyword search: cancer - last 5 years - all results 810,024 13,629
2% RePEc working papers 807,624 13,499
2% OpenDOAR # repositories 3,517 53
2% Elektronische Zeitschriftenbibliotek - # journals that can be read free of charge  60,129 889
1% chapters (OECD ilibrary) 60,300 840
1% PubMed keyword search: cancer - all results 3,639,629 47,117
1% PMC journals with immediate free access 1,852 20
1% ROAR # repositories 4,643 46
1% RePEc software components 4,068 40
1% OECD ilibrary tables and graphs  175,500 1,650
1% PMC actively participating journals 2,466 20
1% OECD ilibrary working papers  5,600 40
1% PMC journals that submit all articles 2,108 15


Mabe, M., & Amin, M. (2001). Growth dynamics of scholarly and scientific journals. Scientometrics, 51(1), 147-162.

Price, D. J. d. S. (1963). Little science, big science. New York: Columbia University Press.

This post is part of the Dramatic Growth of Open Access series.  Full data can be downloaded from here.