I am, (it seems) almost constantly reading, evaluating, and passing judgment on, material written by others: not just when I'm synthesizing material for my own papers or blog essays, but as a peer reviewing manuscripts and grants written by colleagues, or as a teacher grading student papers. Comes with the territory of being a professor, or course. As it happens, its that time of year again when I brace myself for a surge in this activity, because I'm deluged under a variety of student writings, mostly term papers from my majors classes. My students typically have to write essays synthesizing material from the peer-reviewed literature - which often begins with learning what is and isn't a truly peer-reviewed source! Once the papers come in (hopefully with the appropriate number of peer-reviewed citations), you might still hear me muttering some of the usual complaints us grouchy professors share about poor writing styles, lack of structure, grammatical errors, general incoherence, etc.. But another growing worry now, in these globally networked days, is about plagiarism. And my worries on both counts are heightened right now because of a recent spate (hopefully not a big one) of reports about a variety of problems in the scientific publishing business: from reputable publishing houses putting out fake journals with a veneer of peer-review in exchange for $$ from big pharma to individual scientists faking research to produce a bunch of papers, to apparently widespread plagiarism among papers archived in the Medline database!
Much has already been written in the science blogosphere (including by me) about the still unfolding case of Elsevier publishing those fake biomedical journals for payment from Merck and other unknown clients. As we are still digesting that, this week's Nature has an article about scientific misconduct, including deliberate fraud, and plagiarism which is apparently quite widespread! That last bit comes from another article published in Science a couple of months ago, based on a promising new approach to plagiarism - so let me start by quoting from that very paper:
The peer-review process is the best mechanism to ensure the high quality of scientific publications. However, recent studies have demonstrated that the lack of well-defined publication standards, compounded by publication process failures (1), has resulted in the inadvertent publication of several duplicated and plagiarized articles.
The increasing availability of scientific literature on the World Wide Web has proven to be a double-edged sword, allowing plagiarism to be more easily committed, while simultaneously enabling its simple detection through the use of automated software. Unsurprisingly, various publishing groups are now taking steps to reinforce their publication policies to counter the fraudulent acts of a few (2). There are now dozens of commercial and free tools available for the detection of plagiarism. Perhaps the most popular programs are iParadigm's "Ithenticate" (http://ithenticate.com/) and TurnItIn's originality checking (http://turnitin.com/), which recently partnered with CrossRef (http://www.crossref.org/) to create CrossCheck, a new service for verifying the originality of scholarly content. However, the content searched by this program spans only a small sampling of journals indexed by MEDLINE. Others include EVE2, OrCheck, CopyCheck, and WordCHECK, to name a few.
So what Long and colleagues have done is create a new automated process to search through and find similar citation in all of the Medline database! While mostly automated, it is still a complex process: they use several publicly available computational/database tools (eTBLAST, and Déjà vu, which they created) to search for high levels of citation similarity. Their search algorithm works not just on keywords and references, but entire sentences and larger chunks of words. Once papers are flagged as being highly similar, they are subject to full-text analysis, which includes examination and interpretation by a human observer! And the results:
As of 20 February 2009, there were 9120 entries in Déjà vu with high levels of citation similarity and no overlapping authors. Thus far, full-text analysis has led to the identification of 212 pairs of articles with signs of potential plagiarism. The average text similarity between an original article and its duplicate was 86.2%, and the average number of shared references was 73.1%. However, only 47 (22.2%) duplicates cited the original article as a reference. Further, 71.4% of the manuscript pairs shared at least one highly similar or identical table or figure. Of the 212 duplicates, 42% also contained incorrect calculations, data inconsistencies, and reproduced or manipulated photographs.
Long et al then confronted the authors of the original and the duplicate papers, as well as editors of the journals where they were published, with a questionnaire and catalog a fascinating range of reactions:
Before receiving the questionnaire, 93% of the original authors were not aware of the duplicate's existence. The majority of these responses were appreciative in nature. The responses from duplicate authors were more varied; of the 60 replies, 28% denied any wrongdoing, 35% admitted to having borrowed previously published material (and were generally apologetic for having done so), and 22% were from coauthors claiming no involvement in the writing of the manuscript. An additional 17% claimed they were unaware that their names appeared on the article in question. The journal editors primarily confirmed receipt and addressed issues involving policies and potential actions.
They offer a sampling of the responses in the paper, and more in supplementary material available on the Science website. And I, for once, am glad that that litany of excuses and mea culpas is behind Science's pay firewall, because I don't want to add to the list already in use by our students! And speaking of them, let me share my own recent experience:
Last semester, I had my first significant encounter with plagiarism in my evolution class: two of the term papers submitted turned out to be copies of older work. And that happened despite our campus' use of Turnitin, a commercial and widely used plagiarism detection service, which failed in both cases for different reasons. The way it works is: students submit their papers for a given class/assignment through turnitin.com, where their software runs a similarity analysis based on their own (proprietary, I think) database, and produces an originality report with an overall similarity score (%) and a list of matches with prior sources from their databases. So how did this fail?
In one (more straigtforward) case, Turnitin thought the paper submitted was fine, with a low similarity score (4%) - but the way the paper was written set my sensors off immediately because it was unlike anything this student had turned out in the class until then! Puzzled by the low Turnitin score, and strongly suspicious of the contents of the paper, I turned to Google, with little luck at first - until I decided to copy and paste almost an entire (small) paragraph from the paper into the search box. And lo, the paper turned out to be an almost exact copy of a paper published in a relatively obscure journal in 1975 - a paper apparently missing from Turnitin's database! I have no idea what the student was thinking when submitting an exact copy of a published paper, but clearly, the online filtering system our campus pays for had failed.
The second case was somewhat more complex, and again, required some vigilance on my own part to catch it without (rather inspite of) Turnitin's help. This paper was submitted through Turnitin several days before the deadline, and I had allowed the system to let students see their own originality reports. This one lit up the board with a >90% similarity score. The next day (deadline) the paper had been pulled off and another document uploaded - with a new similarity score <20%! All looked well and good, and the system, including my inclination to give students a second chance to fix their own mistakes (incl. potential plagiarism), seemed to have worked. Yet, a suspicion lurked in my mind as I graded that paper, forcing me to dig up the original source (also a term paper from an earlier semester!) which had led to the initial >90% similarity score. It was when I manually compared the two papers side by side that comprehension slowly dawned, and my jaw dropped: the new paper was essentially the old one cleverly disguised! For instance, all the sentences were different - yet every paragraph had the same content and the entire paper was organized in the same way! So the re-edit had focused on altering almost every sentence to evade, successfully, turnitin's algorithms - but the broader semantic structure remained identical! Go figure!!
So where does all this leave us? Long et al, again:
While there will always be a need for authoritative oversight, the responsibility for research integrity ultimately lies in the hands of the scientific community. Educators and advisors must ensure that the students they mentor understand the importance of scientific integrity. Authors must all commit to both the novelty and accuracy of the work they report. Volunteers who agree to provide peer review must accept the responsibility of an informed, thorough, and conscientious review. Finally, journal editors, many of whom are distinguished scientists themselves, must not merely trust in, but also verify the originality of the manuscripts they publish.
We try to teach our students (many pre-med types) how to filter peer-reviewed research from stuff that's not been so reviewed - only to have a major scientific publisher get in bed with a pharma multinational to pull the rug out from under us! The internet has made plagiarism much easier for the lazy - and the lazy wealthy who can pay others to plagiarize on their behalf as reported on NPR recently. At the same time, the growing volume of scientific publications, not to mention student essays and blogs, etc., makes it much harder for us to keep up with all the potential sources of plagiarism. On the positive side, as Errami, Long, and colleagues show, one can turn the other side of that double-edged sword to our advantage: develop good search algorithms which will help us catch plagiarism too. However, their approach is still quite painstaking - as they said, at time they published their analysis, they still had >9000 flagged potentially duplicate papers awaiting human inspection, suggesting the magnitude of the problem was only likely to grow! The onus, indeed, is on all of us in the scientific community. But do I have the time to go through such an intensive process with the next manuscript sent to me for peer-review? Do journals - especially the more trustworthy ones published by scientific societies - have the resources to commit to this high level of scrutiny in-house? Or do we need to explore newer models of publishing science, leveraging some of the newer elements of the Web 2.0 world, e.g., crowdsourcing some of this review process? Can the PLoS One model, perhaps, help us shift some of the onus of detecting plagiarism (and conflicts of interests, ethical concerns, etc.) away from the shoulders of (unpaid) peer-reviewers and editors? While we figure a way out, remember: these are dark, dangerous, and exciting times for scientific publishing (as indeed all publishing). In the meantime we must exercise: Constant Vigilance!!
Dove, A. (2009). Regulators confront blind spots in research oversight Nature Medicine, 15 (5), 469-469 DOI: 10.1038/nm0509-469a
Errami, M., Sun, Z., Long, T., George, A., & Garner, H. (2009). Deja vu: a database of highly similar citations in the scientific literature Nucleic Acids Research, 37 (Database) DOI: 10.1093/nar/gkn546
Long, T., Errami, M., George, A., Sun, Z., & Garner, H. (2009). SCIENTIFIC INTEGRITY: Responding to Possible Plagiarism Science, 323 (5919), 1293-1294 DOI: 10.1126/science.1167408