Algorithms for Open Article Scoring

Open Science - What is it?

Open Science is a lively movement that aims to make the culture and industry that is scientific research open, accountable, transparent, and, well, more scientific. It would be possible fill books of multiple volumes with in-depth descriptions of all that can be said about how a notion of ‘open’ can be applied to the areas of scientific research, the scientific method, the world/game of publishing, etc., but here are a few pointers:

Open Publishing - the aim here is to get away from paywalling research publication. Research is funded publically, in almost every case, and as such it is the belief of Open Science that access to research should be open.
Support of ‘non-desirable’ research - lots of science is doing the same thing as other people in order to replicate results. This is not grand nor glamourous resarch, and is seen unjustifiably as in some way boring/mediocre/a waste of talent.
Open Methods and Open Data - the common belief of Open Science is that all data and tools used to generate the results for a study should be openly available, including precise versions of software used to conduct analysis, precise algorithms that have interacted with the data, as well as Open Protocols, detailing precisely how any other scientist could replicate such work. Yes, there are ethical concerns of genetic/psychological/other sensitive data, but we believe that these are manageable.

An important takeaway is that most poeple involved in Open Science are (understandably perhaps) realists. As such, the mantra don't let perfect get in the way of better is a common theme in the discussions and work around Open Science.

Open Algorithms - What’s the issue?

In short, Big Data and ‘Weapons of Math Destruction’ (a phrase I have stolen from Cathy O’Neil, whose book of the same title is well worth a read).

Many of the algorithms in use in our daily lives are opaque, used without any kind of feedback or ‘common sense’ assessment, and can be hugely impactful on people’s lives and indeed their careers. Academic publishing has not escaped the clutches of this societal problem; algorithms with great opacity are in use to ‘grade’ or ‘score’ academic articles, and these scores are used to determine a researcher’s ‘value’ (for want of a better word) to a department or research group.

This is clearly wrong. We cannot see science and judge our peers through a scorecard darkly.

Background to the Proposal

What we want to do here is introduce a version of an algorithm that is currently in widespread use in the field of Information Security, the Common Vulnerability Scoring System (CVSS).

But that’s a terrible metric!

Yes. And no. Yes, it does have its flaws. Here we are potentially talking of taking on a gargantuan task of designing a scoring system that can be used across multiple disciplines in science, and can score them with some level of sensitivity - what nonsense!

And yes, we would agree, until it is pointed out that this is already the case. This is already the state of affairs. But the difference is this; you can argue with me and my reasoning, call me an idiot, and ideally demonstrate how it should be done.

This cannot be done with the publishers, whose scores can make or break careers.

What we want to do is open this black spot on academic publishing up, and offer a tangible solution.

What it is

This is a description and justification of an algorithm based off a group workshop that took place at the 33rd Chaos Computer Congress (#33C3) in Dec. 2016. Should such a scoring system ever come to actually exist, it would be surprising to see any of the elements here in the final algrotithm, but we need to open up the debate - and the easiest way to get people involved is to no longer simply say ‘someone should do this’, but instead actually offer up a solution.

It is my firm belief that this system is incomplete, or highly localised to my field (mathematics) and locale (UK), given my use of the Research Excellence Framework (REF) criteria as a measurement of ‘Impact’, which is what our universities are mandated to use to justify their funding. These are issues I openly admit from the start.

But it is an open algorithm, that works, based off an algorithm that has a widespread usage and has been well documented and tested (CVSS is used for all PCI DSS related scoring, it is that trusted).

What it is not

The be-all and end-all in this discussion.

Requirements for a Standard

I’ve laid out my motivation, and hopefully have managed some expectations. Here is what I want to offer up:

Description of the algorithm with justifications
Pseudocode description of the scoring
Actual working portable code (in Python/JavaScript/GoLang/whatever I can write at 3am)

It is by giving all these things that I think we will actually make some sort of movement. My aim is that we no longer just talk about ‘how nice it would be if we had an open scoring system for academic articles’, bur to acutally make one, provide example (possibly badly written) code that works, and use such an output as an input for future discussions.

NB - I would actually be happy if people hated this algo and publically ridiculed the ‘silly Scouse quasi-mathematician’. Why? Well, it would mean:

they read it
they understood it
there is a better way, and they’re likely to show the world what that better way is

But this discussion, this disagreement, is only possible in an Open Science environment. At the moment, we as researchers are at the behest of a silent and darkly judgemental algorithm, that is written by people. Algorithms in mathematics is a language like any other - you can express beauty, poetry, and yes, opinions using it. It is not infallible.

Maybe we need an algorithm per Research Council/funding body? Maybe we need an algorithm per individual science or sub-science? (perhaps what works for materials physicists doesn’t work for High Energy physicists?) It is my belief that we don’t know until we have something to argue about, disagree with, and pull apart analytically in order come to some conclusions about these questions.

TL;DR - Now we can be scientific about how we rate/score scientific articles.

Proposed Algorithm - the Common Article Scoring System (CASS)

We propose a Common Article Scoring System (CASS) that is designed to be a generic way of scoring articles based on some loose but important criteria. To do this, all we have done is renamed parts of the Common Vulnerability Scoring System, and tweaked some of the weightings to fit more generically into the scoring.

CVSS Overview

CVSS is used extensively in information security to give a ‘marks out of ten’ (a very human thing) for a vulnerability. In short, there are two sides to a score - the number, and the vector. If you take the vector, you can see how someone assessed the severity of a vulnerability, and my clients could perform their own risk management based off the ability to see how/why I considered some vulnerability a major issue or not.

CVSS is, however, quite old, and in some ways doesn’t really fit many vulnerabilities we see in the wild (I’ll spare the descriptions of how for an infosec blog post), but it is continually maintained and recently a new version was released.

CVSS sees two main areas of risk - Exploitability (how easy is it to take advantage of the vulnerability) and Impact (when taking advantage, how much control of the target system does a vulnerability give me). The break down is as follows:

Exploitability
- Access Vector - how do you exploit this (over a network, sat at the computer, etc.)
- Access Complexity - How difficult is it to exploit? (Do I have to hang from the ceiling Tom Cruise style, or can I do it in my PJ’s?)
- Authentication - do I have to be logged in once? Twice? Not at all?
Impact - all metrics are measured as one of None/No Effect, Partial effect, Complete effect.
- Confidentiality - can I expose sensitive data?
- Integrity - can I modify any data on a system?
- Availability - can I make the system/its data unavailable in any way? (Can I turn it off?)

Clearly, this has nothing to do with science, but the framework is, I think, the correct way to think about it. Especially the following requirement - every score must have its vector in order to be valid. This is very important for openness.

The aim is to have a score that is transparent. Suppose a peer reviewer does not want their comments to be made public - not an unreasonable request - having a score like this with this structure would mean that a reviewer’s thoughts on a paper can still be open without compromising their above request.

Algorithm Description

The aim is to have an algorithm that is representative of the amount of work that was put into the research. No, this is not an easy thing to do, but we aim to try in the following ways:

Honour the scope of a study
Determine the impact of the study in the sciences (is it fairly bland? Or is it world-leading?)
Appreciate the scope for reproducibility and data provision - do we have copies of the data, software tools developed, and precise vesions of analytical tools brought into play to reproduce the study?

Here is the algorithm that I have in mind, based on the above criteria (you’ll see a lot is similar in the mathematics, too). The aim is to split the assessment of an article into two areas:

Availability - what is the ‘reach’ of the article?
- Localisation - is the article applicable to Small/large area or study, or is it interdisciplinary?
- Poss Values: [(S)mall area, (Large) area, (M)ultidisciplinary] * Data Provision - is the data and source, with enough detail for reproduction, provided?
- Poss Values: [(N)o data/source provided, (M)ost data/source provided, (F)ull provision of significant data/source provided] * Methodology - is it provided with; no detail, overview, or full protocol?
- Poss Values: [(N)o Detail, (O)verview of methodology, (F)ull details of methodology]
Impact - how much does the research impact science/the world? For this, I will use the UK REF criteria as follows:
- Rigour - Low/Med/High rigour in study
- Poss Values: [(L)ow, (M)edium, (H)igh] * Significance - Low/Med/High significance or study
- Poss Values: [(L)ow, (M)edium, (H)igh] * Originality - Low/Med/High level of originality
- Poss Values: [(L)ow, (M)edium, (H)igh]

The mapping for the score weightings is precisely isomorphic to the above description of CVSS, which we will make clear in the next section (cf. the description of the CVSS score calculation).

Here is a more in depth description of each area:

The Availability of a paper comments on the ways in which another researcher can interact with it. Could someone else reproduce the same results from the same data? Is the article relevant to a large audience across multiple disciplines, or is it localised to 3 blokes in a shed just outside Leeds? Here’s how I broke this down:

Localistion - Is the study relevant to a small audience? Or a large one? Improving upon this, is it a fully multidisciplinary study?
Data Provision - Do we have the full set/relevant chunks of data? Do we have enough data to reproduce a study? Can we accurately say what the raw data looked like from the submission by the authors?
Methodology - Do we have access to full descriptions of methods and protocols used? Do we know what statistical methods were used to assess the data?

The Impact section of the score is used to assess the way in which the study in question can be said to have ‘impact’. This is a very sugjective score, but objectively, it is well defined in the Research Excellence Framework used in the UK. To that end, I’ve mapped those criteria to the risk assessment criteria of CVSS as low/med/high (for the none/partial/complete in CVSS) to fit the calculation as it stands:

Rigour - how rigorous was the study? Was the sample set large enough? Were the controls suitable? Were pitfalls suitably identified and mitigated/assessed in the study?
Significance - what is the signifcance of the study? Does it break new ground or spark a new area of study? Or does it only confirm basic results?
Originality - how original is the study? Did they find a new, super, amazing, wonderful way of interpreting the dataset? Or did they follow only established protocols and paradigms in the field?

As is hopefully clear, the logical (to me) fact that a highly original, multi-disciplinary study that provides full data/source code and gives detailed methodologies gets a significantly higher score than those that simply reproduce well-known results without anything of real interest.

Pseudo-code Overview

Output_Score = round_1dp(((0.6*Impact)+(0.4*Availability)-1.5)*f(Impact))

Impact = 10.41*(1-((1-Rigour)*(1-Significance)*(1-Originality)))

Availability = 20*Localisation*DataProvision*Methodology

f(Impact) = [1 if Impact=0, 1.176 otherwise]

Localisation = one of the following:
	0.395 if small area
	0.646 if large area
	1.0 if multidisciplinary

DataProvision = one of the following:
	0.35 if little to no data provided
	0.62 if most data and/or some source code provided
	0.71 if full data and source code provision

Methodology = one of the following:
	0.45 if methodology not provided
	0.56 if most methodology provided
	0.704 if full methodology provided

Rigour = one of the following:
	0.0 if Low rigour is set
	0.275 if Med rigour is set
	0.660 if High rigour is set

Significance = one of the following:
        0.0 if Low significance is set
        0.275 if Med significance is set
        0.660 if High significance is set

Originality = one of the following:
        0.0 if Low originality is set
        0.275 if Med originality is set
        0.660 if High originality is set

NB - this is a shameless copy/paste effort atm. Going to do some more maths to determine whether f(Impact) is needed, for example. Probably not, is the answer.

Overview of changes from CVSS

f(Impact) returns 1 if Impact is zero, 1.176 multiplier if > 0 - this is because a paper can have low/low/low on the impact scorings, but still contribute knowledge and data.

Example Implementations

To follow… I’ll make a github repo.