Skip to Main Content Area
Login
About
Mission Statement
Principles
FAQ
Board of Directors
Business Working Group
Technical Working Group
Query API
Technical Contributions
Outreach Working Group
Participants
Organizations
New Participants
Sponsorship
Platinum Sponsors
Gold Sponsors
Silver Sponsors
Bronze Sponsors
Meetings
Documents
Presentations
Publications
Blog
Register
Search
ORCID requirements summary
Submitted by Gudmundur Thorisson on November 9, 2010 - 1:00am
Embedded Scribd iPaper - Requires Javascript and Flash Player
Enable JavaScript in your browser to view this document as it was initially formatted.
A summary report on ORCID core system requirements and current status of development
Author: Gudmundur A. Thorisson <gt50@le.ac.uk> University of Leicester, United Kingdom Date: 9 November 2010
!"#$%&'(&)'*+%*+,&
!"! #$%&'()*%+'$""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""",! ,"! -./0123#40+(/$%+%50656%/7 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""",! ,"!! 89*:;&')$(""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" ,! ,",! 2/<)+&/7/$%606)779&5""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" =! ,"=! >%&)*%)&/0'?0%./0+(/$%+?+/&09$(0#>@# """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" A! ,"B! C&'?+D/0*&/9%+'$09$(079$9;/7/$% """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" A! ,"E! 3'$%&'D09$(09)%.'&+%5"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" F! ,"A! -./0G&+H9*507'(/D"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" I! ,"J! C&'?+D/079%*.+$;09$(0(/K()GD+*9%+'$ """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!L! ,"F! C)MD+*9%+'$0*D9+76 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!!! ="! 1G/$0(9%909$(06/&H+*/60H6"06)6%9+$9M+D+%5""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" !,! B"! 3)&&/$%06%9%)609$(0$/N%06%/G6 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" !=! B"!! -./0123#409DG.90G&'%'%5G/ """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!=! B",! O$&/6'DH/(0+66)/60K06)779&509$(0$/N%06%/G6 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!E!
1. Introduction
The central goal of ORCID is to solve the long-standing name ambiguity problem in scholarly communication. Accurate attribution is a fundamental pillar of the scholarly record. Global identification infrastructure exists for content but not for the producers of that content, creating challenges in establishing the identity of authors and other contributors and reliably linking them to their published works. With the number of individuals and the number and types of publications they contribute to constantly growing, these challenges are becoming increasingly intractable. The core mission of ORCID is rectify this by creating a “a central registry of unique identifiers for individual researchers and an open and transparent linking mechanism between ORCID and other current author ID schemes”1. The primary purpose of this document is to present the overall functional requirements for the core ORCID informatics infrastructure. More specifically, it represents an attempt to identify and summarize key functional aspects of the core ORCID system on which there is broad agreement amongst stakeholders, and, in particular, highlight areas where this is not the case and/or where requirements are still unclear. As such, the document does not provide a fully comprehensive requirements analysis and technical specification, but is intended as high-level a guide to further development work in the ORCID Technical Working Group (TWG). The analysis presented here is based on presentations, meeting notes and a variety of other materials published on the ORCID wiki website2 and elsewhere. Informal interviews were also conducted with several ORCID stakeholders in the TWG and Business Working Group (BWG), and their valuable contributions are hereby acknowledged.
2. The ORCID identity system
2.1 Background
A useful definition of “identity” is “the collective aspect of the set of characteristics by which a thing is definitively recognizable or known”3. ORCID is concerned with two main types of characteristics of people: i) Information describing the contributors themselves (names, affiliations) ii) Information describing the relationships between contributors and their publications The proposed ORCID registry can be described as a centralized identity management system for collecting and managing this information about contributors, also known as identity claims. Hybrid identity – previous requirements gathering work by CrossRef4 indicated that it would not be feasible to build a contributor ID system based solely on self-asserted identity claims – that is, facts
1 2
http://www.orcid.org https://sites.google.com/site/openrid/ 3 http://www.thefreedictionary.com/identity
ORCID Requirements summary
2
about contributors provided by themselves via self-registration. It would likely take too long to reach critical mass, and the system would by definition not include inactive or deceased authors who are unable or unwilling to register. Thus, “priming the pump” by gathering organization-asserted identity claims from other name authorities is generally regarded as necessary for ORCID to be successful, albeit with the acknowledgement that this strategy by itself will not be sufficient either. The administrative effort required to deal with errors that will inevitably arise from imperfect automated disambiguation would be substantial. Moreover, as some stakeholders have pointed out, saddling ORCID with the responsibility of maintaining accuracy of registry records would be inadvisable. The overall conclusion from this earlier, pre-ORCID work was that the most feasible strategy is a hybrid identity system, based on a combination of self-asserted and organization-asserted identities. This general approach underpins much of the work undertaken in the TWG work to date. The next subsection outlines key high-level requirements for the ORCID hybrid identity system, to be implemented as a database-driven web application with a front end user interface (the central UI) and a web service API for interacting with external system. Most of these have already been presented and discussed in recent meetings, in the context of the current Beta proposal5.
2.2
Requirements summary
In the aforementioned work by CrossRef, use cases for globally-unique contributor identifiers were found to fall into two main categories. Briefly, “knowledge discovery” scenarios included answering questions such as: - Who authored/reviewed/edited document X? - Which documents where authored/reviewed by person identified by ID Y? - Which IDs are related to ID Z and what is the nature of that relationship (e.g. Z co-authored paper with Y, Z edited/reviewed paper by Y)? - What (subject to privacy settings) is the profile information for ID Z (e.g. institutional affiliation, email address, etc.)? The other main type of use cases involves the contributor identifying him/herself over the network in various settings, including: - Single-sign on (SSO) for manuscript tracking systems (MTSs) and sharing contact information with editorial offices, marketing departments, royalty payments systems etc. - Automatic updating of addresses for TOC alerts and other automated email communications - Automated tools for detecting potential reviewers, including tools for detecting potential conflicts of interest. - Synchronization with publisher web site user profiles and ID validation/assertion amongst all external profiles linking to the ID. - Granting researchers customized, privileged access to content.
4
Contributor ID Strategy Update - report to the PILA board meeting, July 21-22, 2009. https://sites.google.com/site/openrid/archive/crossref-contributor-idbackground/contrib_id_update_jul09.pdf 5 https://sites.google.com/site/openrid/technical-working-group/technical-working-groupmeetings/28-june-2010-twg-to-bwg-presentation/BetaProposal28June2010-a-b.pptx
ORCID Requirements summary
3
With contributions from institutional repositories, funders and other stakeholders, the TWG has in the past year refined and expanded these into a broader set of overall requirements for the ORCID identity system, as summarized below. End users - authors, contributors, departmental administrators and other end users of the system should be able to do the following via a web-based UI: - Register in the system to create a primary contributor profile, and subsequently edit and maintain this self-claim profile. - Search the registry for 3rd party deposited profiles representing them in other systems, and "selfdisambiguate" by claiming up to several of these profiles. - Control privacy settings for their profile, including flagging individual profile fields as “public”, “private” (not shared – only used internally) or “protected” (shared with certain external parties). - Search the CrossRef bibliographical database to find, or manually enter, scholarly publications identified with a DOI that they have contributed to and claim them. - Search for and claim research datasets identified with a DOI in the upcoming DataCite service6. - Search for and claim other scholarly or quasi-scholarly work not catalogued in CrossRef, including but not limited to; OCLC7 for monographs and trade publications; quasi-scholarly work published by governmental organizations, NGOs and others; articles, reports and working papers in arXiv and institutional repositories; patent offices for patents; the Concept Web Alliance (CWA)8 for triples or “facts”; blogging networks (e.g. PloS Blogs9, Nature Network10, ScienceBlogs11) for postings; Wikipedia12 for articles. - Report erroneous/fraudulent profile and publication claims. - Given special permissions, perform various "back office" administrative tasks, such as edit/delete other users' profiles and retire accounts. Partner systems - key requirements for external partner systems include enabling the following via web UI or API: - Universities and other organizations to deposit and retrieve profile data and publication claims for their personnel, and enable their researchers to register and easily fill in ORCID profile based on the deposited information. - Journals and other scholarly publishers to deposit verified publication claims for authors who have published with them. - Societies, universities and other institutions/organizations to verify claims of affiliation. - Journal MTSs and others to look up contributor profile information by ID and request/retrieve protected profile data (e.g. E-mail address and phone number) from contributors who have published with them, and have changes to this information automatically propagated to them. - Journal MTSs, institutional repositories and other systems to mediate interaction of contributors with the central ORCID system, for example by encouraging authors to acquire an ORCID ID as part of the partner site registration/login process.
6 7
http://datacite.org http://www.oclc.org 8 http://www.nbic.nl/about-nbic/affiliated-organisations/cwa/ 9 http://blogs.plos.org 10 http://network.nature.com 11 http://scienceblogs.com 12 http://www.wikipedia.org
ORCID Requirements summary
4
There is also general interest from all stakeholders in a variety of knowledge discovery use cases, for accessing the information gathered through the above interactions with end users and partner systems. In addition to the examples already listed above, important use cases include: - Funding agencies retrieving ORCID IDs and publications relating to funded projects - Institutions retrieving ORCID IDs and publications for their faculty (e.g. for research assessment) Key properties and capabilities of the core system: - The ORCID identifier itself will be an opaque, numeric, non-sequentially assigned string compatible with the ISNI naming scheme. ISNI/ORCID collaboration is under consideration (see 2.3). - Contributors who actively join the scheme will be assigned an ID as they register to create a selfclaim profile. There are significant unresolved issues concerning control and authority over profile information sourced from 3rd party profiles (see 2.5). - Identifiers for other contributors will likely be handled by automated generation of computed primary profiles seeded by 3rd party profiles, but details of how will this will work are unclear (see 2.4). - The ORCID profile should have a minimal set of fields (E-mail, name, affiliation, etc.), possibly to be expanded later. Using identifiers for institutions where available (e.g. Ringgold14, ARIW15) has been suggested. There is a debate over whether date of birth should be included or not (see 2.6). - The core system will have basic profile matching capability to support self-disambiguation by registered users. Opinions are divided regarding the level of sophisticated, automated matching and de-duplication capability that should be built into the core system and how critical this is for the initial beta production system (see 2.7). - Provenance will be tracked, by way of capturing metadata describing source and other properties of profile records, profile claims and publication claims (see 2.8). - Tools for batch loading records deposited in bulk from partner systems. - A security layer for authenticating/authorizing i) users interacting with the front end UI and ii) MTSs and other external applications using the API to access protected profile data or non-open services via, for example, OAuth16-based workflows. - A conflict resolution mechanism and process to help with resolving user-to-user and user-toorganization disputes which will inevitably arise. The following was also identified as important to consider for later iterations of the system: - Exporting ORCID profile data in alternative formats (Atom/RSS feeds, JSON, Linked Data) to enable lightweight, Web 2.0-style integration “mashups” and embedding in other websites (see e.g. the arXiv Author Identifier service17 and a recent paper by Simeon Warner18). - Interoperability with emerging standards in the social networking community, such as the Universal Widget API (UWA)19 and OpenSocial20. - Support for federated authentication (e.g. OpenID, Shibboleth) and for ORCID to possibly operate as an identity provider.
14 15
http://www.ringgold.com http://ariw.org 16 http://oauth.net 17 http://arxiv.org/help/author_identifiers 18 Warner, S. Author Identifiers in Scholarly Repositories. arXiv. March 6 (2010). http://arxiv.org/abs/1003.1345 19 http://dev.netvibes.com/doc/uwa 20 http://www.opensocial.org
ORCID Requirements summary
5
2.3
Structure of the identifier and ISNI
Various issues relating to the format of the primary ORCID identifier have previously been outlined in Geoff Bilder's whitepaper21 and discussed in the TWG. There is overall agreement amongst stakeholder on the following key points: The identifier should be semantically opaque to avoid the 'brittleness' that comes with semantic overloading. A myriad complications are likely to arise if semantic information is incorporated into the identifier (directly or inadvertently) or if opportunity is created for people to "project" semantic meaning onto the identifier. The identifier should be numeric, issued out of sequence, include checksum digits (for error detection) and ideally be compatible with ISNI (see below). "Memorizability" of the identifier is not critical, but it should at least be feasible for a human to transcribe and to recognize an ORCID ID. Shortening, prefixing and various other usability aspects of the identifier are under discussion in the TWG.
-
ISNI collaboration - a number of ORCID partners have advocated a collaboration with the International Standard Name Identifier (ISNI) ISO 27729 standard initiative22. ISNI aims to create identifiers for "identities used publicly by parties involved throughout the media content industries in the creation, production, management, and content distribution chains", and so has a broader scope than ORCID. There is general support for adopting the numeric 16-digit ISNI format for ORCID IDs, a key advantage being straightforward ISNI interoperability in the event of an ORCID/ISNI collaboration. However, some stakeholders have raised concerns over possible confusion if ORCID issues ISNI-like IDs which are not actual ISNIs, and proposed instead a completely different scheme for the identifiers (e.g. UUIDs) and establishing ORCID-to-ISNI ID mappings at a later if required. Various options are currently being considered for ISNI collaboration, including a close cooperation with ORCID becoming an ISNI registration agency, as well as a more loosely-coupled arrangement where ORCID would operate “at arm’s length” of ISNI through an intermediary authority such as VIAF (http://viaf.org). If ongoing discussions with ISNI prove productive, collaboration could start as early as December. Some stakeholders have pointed out that involvement in a standards project would inevitably bring overhead and potentially slow down ORCID development, but ISNI proponents stress that long-term benefits of collaboration would outweigh such overheads.
2.4
Profile creation and management
It is generally agreed that the central ORCID registry will hold two main types of information describing contributors: i) 3rd party profiles deposited by ORCID members (organization-asserted identity claims).
21
https://sites.google.com/site/openrid/technical-working-group/structure-of-the-orcididentifier/ORCID_Identifiers_whitepaper_v3.pdf 22 http://www.isni.org
ORCID Requirements summary
6
ii) Primary ORCID profiles created through registration (self-asserted identity claims). Federation vs. centralization - Centralized storage of 3rd party profile data in the ORCID system itself is regarded as key to enabling profile matching (see below) and other features which many stakeholders have listed as vital functions of the core system. An alternative, federated model has also been suggested, in which the central registry would store only basic ORCID profiles containing links to 3rd party profiles in external identity systems. In this model, the central registry would be one of several components in a loosely-coupled, distributed system system, with other component systems (possibly operated by other ORCID member organizations) handling publication claiming workflows and data storage, profile matching and other key tasks. The core argument for a distributed model is that the central registry itself could be made much simpler and less expensive to build and operate. But others have pointed out that the overall, fulllyfunctional system would much more challenging to implement and operate in distributed fashion, compared to a self-contained, centralized system. The point was also made that centralization looks like the only feasible strategy to i) ensure uniform long-term persistence of ORCID data (a critical requirement of the scholarly record) and ii) address certain other key issues such as data licensing/reuse. For this and other reasons, TWG work has hitherto focused on building a centralized system. A recent suggestion is a mixed model incorporating some of the features of a distributed system. The user could point to a certain record referring to him/herself in an external partner system, and profile data would be then be retrieved (via API call or "screen scraping" a web page) and copied into the central ORCID system. Self-claim profiles - A long-standing source of confusion in the project concerns what the ORCID ID identifies. For users who register directly, it is relatively uncontroversial that the identifier should be assigned to the newly-created primary profile. If the user subsequently claims 3rd party profiles and links them to his/her primary profile, the ORCID ID effectively identifies an aggregate of organizationasserted and self-asserted identities for the person in question (this is the essence of the hybrid identity model). Computed profiles - A more controversial issue is how to deal with individuals represented in the system by one or more 3rd party profiles, but who are unwilling or unable to register. Many stakeholders were adamant that assigning ORCID IDs to this class of non-registrant contributors is central to the ORCID mission, and without this the project cannot succeed. In other words, there are strong arguments for the core system to support the generation of "computed" primary profiles from bulk-deposited records, whether for active contributors (who may later register and claim their profile) or for deceased or otherwise inactive contributors. Some aspects of this are relatively clear - for example, the common case of universities creating profiles for their faculty (see HKU model discussed below) - whilst others are less so. Notably, several stakeholders pointed out that, unlike user-driven self-disambiguation, the generation and maintenance of computed profiles is dependent on automated batch matching capabilities in the core system (see more below). They acknowledge that dealing with non-registrants is a critical requirement for ORCID, but stress that this is a much harder problem to solve, and that initial focus for the initial production system should be on self-registration by active researchers.
ORCID Requirements summary
7
2.5
Control and authority
Stakeholders appear to have quite diverging opinions concerning authority and control over personal information to be held in the central ORCID registry. Opinions on this topic are broadly divided into two main schools of thought. The library view - On the one hand is the view held by ORCID partners in the digital library community, who will bulk-deposit profiles into the central system to create IDs for their faculty, with the expectation that library or other administrative staff will subsequently manage most or all of these profiles on researchers' behalf. The assumption here is that the majority of researchers will simply not bother to sign up for "yet another profile", thus necessitating an active "push" approach to drive ORCID adoption. This pessimism regarding community buy-in reflects past experiences of several partners who have attempted, with little success, to persuade researchers to submit information to institutional repositories. Another factor is the belief that institution-driven management is critical to ensuring accuracy of personal data held in ORCID. This, too, is based on past experiences of institutions such as MIT Libraries23, who have traditionally managed and been the authoritative source of this information for their people and expect to continue to do so. Under this model, a researcher who is registered in ORCID through his institution would be “assigned” a primary profile and not have authority to edit it to contradict the institution-supplied record (e.g. change affiliation from "assistant professor" to "professor"). The identity/privacy view - The above has been heavily criticized by various other stakeholders, who argue that this model will inevitably result in situations that would be very difficult to entangle, for example when the institutional record is incorrect. Moreover, they feel strongly that control of (and responsibility for) a claimed ORCID identity should ultimately be in the hands of the researcher him/herself, and that this is key to community acceptance. This view is founded modern notions of online privacy and user-centric identity on the Internet24, which have recently attracted a great deal of interest in the wider online community. This interest in online identity and privacy is spilling over into a small but growing sub-community of Web-savvy researchers who care a great deal about how their “digital self” is projected onto the Internet. Although this sub-community comprises a relatively small fraction of the overall pool of potential ORCID users, they are likely to be highly over-represented amongst early adopters of the system. If ORCID is seen to not respect online privacy and identity principles, severe anti-ORCID backlash from a vocal minority is likely to follow which would be detrimental to the overall project. Some stakeholders in the identity/privacy camp feel that this apparent clash of ideals can be resolved by applying the hybrid identity model. A useful property of the model, as already noted, is that self-asserted and organization-asserted claims would both be held in the ORCID system with attached metadata. Contradicting information and history of profile changes could thus be, for example, visually flagged on the public profile, or in some other way made accessible to consumers of profile data, enabling them to draw their own conclusions regarding the validity of the information. Case study: HKU/ResearcherID integration - An example of institution-driven profile generation is integration of the University of Hong Kong (HKU) institutional repository with the Thomson Reuters
23 24
http://libraries.mit.edu Maler, E. The design of everyday identity. Online Information Review 33, 443–457 (2009). doi:10.1108/14684520910969899
ORCID Requirements summary
8
(TR) ResearcherID service25 (see David Palmer’s comment on the wiki26 and materials on the TR website27). Primary ORCID profiles are created initially from batch uploads of HKU records and bibliographical data. Subsequently, members of HKU faculty receive an E-mail notification encouraging them to register and claim their newly-created scholarly identities and start maintaining them themselves (they are also given the choice of opting out of the scheme). As HKU does not subsequently manage deposited records, this is not the kind of institutionmanaged profile model as proposed by the library camp. However, two suggested enhancements to the HKU model would perhaps go some way towards resolving the stand-off outlined above. First, clearly marking unclaimed, organization-managed ORCID profiles as such would avoid conflating them with claimed profiles which are actively maintained and curated. Second, a delegation mechanism can be built into the profile system. In this scheme, a registered users would have the option of giving another user or users permission to edit and manage some or all aspects of his profile on his/her behalf. This would enable, for example, a departmental or library administrator (him/herself holding an account in the ORCID system) to manage claimed primary profiles for an entire faculty. Also, if the list of permitted delegated actions included deactivation of the profile, the same facility could also provide institutional administrators with the ability to “lock” the profiles of deceased or retired faculty (which one stakeholder identified as an important feature). More generally, enabling users registered in the system to act on behalf of organizations through the web UI may be important to broader adoption of ORCID. For example, institutions lacking financial resources and/or IT expertise to implement full API-based integration could participate to some extent in this way (especially for creating profiles for their people and verifying affiliations, as noted by one stakeholder).
2.6
The privacy model
Several stakeholders mentioned that a “maximal” ORCID profile, capable of capturing a wide range of public/private data elements, would be very useful (e.g. for disambiguation purposes: the more information, the better). Another benefit mentioned was that a richer profile would be valuable to users who may prefer to use ORCID as their main online “home”, e.g. if they do not have a profile in other systems. However, most agree that this would be substantially more difficult to model and implement than a "minimal" profile model, not the least because of security implications. It has also been pointed out that gathering of extensive personal data that could potentially be used for discrimination would also likely violate EU privacy laws (the BWG legal subgroup is currently investigating this). TWG profile modelling work has therefore focused on capturing public facts - i.e. data elements which are typically already in the public domain by way of scholarly publications – which are most useful for disambiguation. The date-of-birth debate - A critical issue that needs to be resolved is whether or not the minimal ORCID profile should include the person’s date of birth. Several stakeholders have emphasized the importance of having this information at hand (even just the date, without the year) for disambiguating otherwise identical author records. One stakeholder proposed that the data of birth be made a mandatory field in the registration form, but not be made public beyond the ORCID
25 26
http://www.researcherid.com https://sites.google.com/site/openrid/technical-working-group/use-cases 27 http://wokinfo.com/wok/media/pdf/univ_hk_cust_profile_palmer.pdf
ORCID Requirements summary
9
system. The opposing privacy/identity view is that an attempt to go beyond collecting what is strictly necessary for disambiguation, and is already in the public domain, risks ORCID being perceived as a sort of “mini-Facebook” enterprise, intent on collecting personal data for its own purposes. Privacy controls - There is also some debate over the extent to which users should be able to control profile field visibility. Several stakeholders have argued that a subset of the fields, including the person’s name, should always be publicly visible, whereas some others have stressed that the user should be able to hide all fields except for the identifier itself. The latter would in effect enable authors to operate in true pseudo-anonymous mode, instead of registering in ORCID merely to create a “throwaway” alter-ego profile (or simply not use the ORCID registry at all). A final point to note is that various issues concerning ownership of profile data are currently unclear. For example, if a contributor copies information from a 3rd party record into his newly-created selfclaim profile, who owns the copied information? Data ownership and data licensing are important issues to consider for the future, but will not be further discussed here.
2.7
Profile matching and de-duplication
TWG discussions have focused on two levels of matching capability in the core system, for two distinct purposes. The first concerns the profile claiming step in the proposed ORCID selfregistration procedure, as documented by a set of wireframes created by the TWG28. The user is presented with a list of possible matches to his newly-created profile found in deposited profile collections, and is then asked to accept or reject each of these profiles. Such self-disambiguation would require relatively unsophisticated, "loose" matching - little more than basic field-based querying (e.g. by name and affiliation). One stakeholder also mentioned conditional acceptance (e.g. if deposited profile data require correction) and ability to keep profile claims private as potentially useful options30. The other main application of matching relates to what several stakeholder see as the main purpose of ORCID: to create a single identifier to serve as a bridge between deposited records representing the same person in different ID systems, irrespective of whether the person has actively registered or not. A critical aspect of this, mentioned by several stakeholders, is to avoid as much as possible the creation of duplicate ORCID profiles (and therefore multiple IDs) for the same person. Unless the core system has a mechanism for detecting and avoiding name collisions, this duplication will inevitably occur; for example, if an institution attempts to create an ORCID profile for a person who had already registered. Some duplication is likely to occur even with such safeguards in place, and so provision will need to be made for merging duplicate primary profiles later found to refer to the same person (and to a lesser extent, splitting up incorrectly merged profiles). Two key issues are unresolved in this regard. First, whether computed profiles (and therefore deduplication) should be a priority for the initial production system, or whether this functionality can come later. Second, how extensive should this capability be: should ORCID limit itself to deduplicating and linking already-disambiguated records from other systems? Or, should the aim be to do fully-fledged batch disambiguation of all deposited records? Some stakeholders expressed concern about ORCID over-extending in this space, especially given that other projects and companies specialize in disambiguation. There were some suggestions that ORCID could minimize
28 30
https://sites.google.com/site/openrid/technical-working-group/wireframes http://openlib.org/home/krichel/proposals/chibit.html
ORCID Requirements summary
10
overhead and cost by "outsourcing" batch disambiguation to VIAF or another partner organization which already has the necessary expertise and computing infrastructure. However, one stakeholder was against this idea and argued that the core system should be self-sufficient and not dependent on an external party for this crucial functionality.
2.8
Publication claims
The aforementioned pre-ORCID requirements gathering work by CrossRef identified two types of publication-related identity assertions of key importance to a contributor ID system: - Self-claims supplied by authors themselves (e.g. "J. Smith claims J. Smith wrote paper X"). - Verified claims from publishers (e.g. "Nature Genetics claims J. Smith wrote paper X"). The wireframes referred to above also describe publication claiming scenarios developed by the TWG, inspired by previous work to create document claiming workflows in the AuthorClaim system31. Gathering publication claims into the ORCID system will enable a variety of additional, secondary assertions to be inferred (e.g. verified "proxy" claims by co-authors on a given paper) and automated detection of inconsistencies or errors (e.g. multiple authors with the same name claiming the same paper, which could trigger disambiguation to validate the authors’ profiles). Further work will be required to determine which types of derived claims are most useful, and how best to present this information via the central front end UI. In the meantime, there is general recognition amongst stakeholders that the initial production system needs to be capable of storing the primary claim assertion data with associated provenance metadata from day one. Model refinement - Various details relating to publication claims have yet to be worked out, including the data model and which provenance attributes would be useful to capture. One suggestion calls for a minimal set including timestamp, source of the claim (identifier for the person or organization) and method (e.g. automated, supervised etc.). Another key consideration is that ORCID has a broader scope than just authors and traditional publications. For example, several members are interested in capturing links between ORCID identifiers and research datasets published online (identified by DOI, some other type of persistent identifier, or a URL). The claims model therefore needs to be flexible enough to be able to handle different types of publications. One stakeholder also wished for ORCID to handle different types of contributor roles. Modern "Big Science" projects often generate peer-reviewed papers with large numbers of authors, each with different kinds of roles ("wrote paper", "designed and takes overall responsibility of study", "analyzed data", etc.). The traditional interpretation of a person-publication link is "author", but conventions to indicate proportional credit via first/second/last position on the author list tend to become meaningless when the number of contributors reaches the thousands32. As one stakeholder remarked, it is not ORCID's job to solve this problem now. However, many journals now provide an
31 32
http://authorclaim.org A recent particle physics paper on analysis of LHC data lists over 2,000 authors from 170 institutions around the world: http://dx.doi.org/10.1007/JHEP02(2010)041
ORCID Requirements summary
11
"author contributions" statement in the main article text, and it seems reasonable to expect this information to become available in a machine-readable format in the near future and could be included in deposits into ORCID.
3. Open data and services vs. sustainability
In addition to the specific issues already discussed, a key challenge for ORCID is to balance openness with long-term financial sustainability of the project. Some stakeholders have stated that they believe in the principle of sharing, and that the prospect of broad data reuse is a major motivating factor for them to contribute data to ORCID. Several others have expressed the opinion that ORCID’s notion of “open” needs to be compatible with emerging principles for open data and knowledge, or else risk alienating the community it intends to serve. Specific references have been made to grassroots movements such as the Open Knowledge Foundation (OKF)33 who have proposed the Open Knowledge Definition (OKD)34. Adopting open data principles implies making data available for free at a certain level. Meanwhile, it is also clear that in order to become self-sustainable, ORCID must somehow generate income, and stakeholders generally agree that services for accessing ORCID identifiers and profile data cannot be completely free for all to use. One obstacle is that, as with some other related initiatives, it has been a matter of some debate within the project what exactly being "open" entails for ORCID. Various financial, legal and other non-technical aspects of this are being explored in depth in the BWG35,36. The BWG is also drafting an "ORCID Principles" document which will clarify some of these issues in a broader context. At present, there seems to be general agreement amongst stakeholders on the following key points which are of most relevance to ORCID technical development: Contributor participation - Keeping the barrier for entry low will be crucial to broad uptake in the user community. It should therefore be open and free for all to register, create and manage an ORCID ID. One stakeholder specifically advised against attempting to exclude “non-contributors”, given that this implies setting up some sort of contributor vetting process. Organization participation - Similarly, to keep the barrier for entry low and more quickly reach critical mass, organizations should not be charged for initial deposition of profile data into ORCID. However, organizations would need to become ORCID participants in order to be able to deposit; (i.e. deposition would be free, but not open). Data access and reuse – Most stakeholders concur that long-term success of the project depends on broad adoption of ORCID identifiers in other systems. To this end, identifiers and profile information in the ORCID system should be accessible via web pages and web service APIs at no charge for light, non-commercial use. More demanding, commercial and/or sophisticated usage
33 34
http://okfn.org http://www.opendefinition.org/okd/ 35 https://sites.google.com/site/openrid/business-policy-group/funding-subgroup-meeting-notes/28june-agenda/VisionofORCIDandservicefeemodel062210final.docx 36 https://sites.google.com/site/openrid/business-policy-group/funding-subgroup-meeting-notes/28june-agenda/ORCIDFundingModel1b6-25-10.doc
ORCID Requirements summary
12
(e.g. from manuscript tracking systems) would be catered for with "for a fee" guaranteed quality-ofservice provision. Several interviewed stakeholders also mentioned that bulk data dumps could be made available without restrictions annually or biannually, possibly under a liberal license (e.g. Creative Commons CC0 waiver37 to comply with the OKD. Bulk data access will be needed in any case, as some ORCID members will want to perform processing and integration of ORCID profile data mirrored on their own systems. This would be one of the benefits of membership, so bulk access to up-to-date ORCID data would need to be restricted.
4. Current status and next steps
While there appear to be genuine disagreements amongst stakeholders in certain areas, there is in fact a great deal of agreement over the majority of high-level technical requirements for the core ORCID system. Further work is needed to clarify requirements in several of places, but key aspects of the two core subsystems - the biographical system (personal profile information) and the bibliographical system (publication claims) - are relatively well established. These include: interactions with contributors and other end users; integration with journal MTSs and other partner systems; a minimal rather than maximal profile; a numeric, opaque identifier format; and the need for tracking provenance for profiles, profile claims and publication claims.
4.1
The ORCID alpha prototype
Software prototyping work undertaken by the TWG in the past year has focused on building a firstiteration, proof-of-principle system which meets a subset of ORCID requirements. At the launch of the project, TR had agreed to contribute the code for their ResearcherID service to help jump-start development of the ORCID ID system. The TWG has cloned and extended this platform to create the non-public ORCID alpha prototype38. This system supports a subset of critical use cases that the ResearcherID system does not, such as interaction with MTSs and the HKU scenario discussed above. The overall architecture of the alpha system is depicted in Figure 1 and its main features are outlined below (adapted from the TWG Beta Proposal presentation): ! Easy registration process - Researchers fill out a registration form or have it pre-populated with data from an ORCID partner system (currently AuthorClaim, RePEc Author Service40 and Scopus41). ! User-controlled privacy settings - The researcher controls how much/little information about him/herself he/she wants to make publically available. ! Local-language support - The database supports UTF-8 character-set. Searching by Unicode characters is also supported.
37 38
http://wiki.creativecommons.org/CC0_FAQ http://www.orcidsandbox.org 40 http://authors.repec.org 41 http://www.scopus.com
ORCID Requirements summary
13
! Search facility - The system supports search of public profiles by first/last name; institution; keyword; ORCID number. In addition, the system allows for browsing by keyword and supports auto-suggest for keyword and institution. ! Publication claiming - Researchers can perform a DOI search against CrossRef to add publications to their profile, or upload publications as RIS records. ! Services for integration with ORCID partner systems - Includes the ability for partners to search ORCID, upload and download profile and publication information.
Figure 1: Overview of the ResearcherID-based ORCID alpha prototype. The core biographical and bibliographical subsystems are self-contained, whereas the authentication/authorization subsystem, web service API and some other components are tied to the WoK framework42 (not bundled with the TR donation).
A substantial number of enhancements and additional components will need to be implemented to advance from the current alpha to a fully-functional production system which meets stakeholders’ requirements. The TWG has been working on identifying this “delta” and trying to determine the best strategy to progress to the next major milestone: a public beta system suitable for user testing in a live environment. Choice of platform - There have been growing doubts within the TWG about the suitability of the ResearcherID platform, largely the result of i) lack of information about the ResearcherID architecture and source code and ii) apparent slowness in adding new features to the alpha system. Some in the group have argued against continued use of platform as the basis for ORCID development and suggested alternative strategies, including implementing a new system from scratch. To address this issue, two ORCID stakeholder organizations, OCLC and CrossRef, have recently undertaken in-depth, independent architecture and code reviews of the ResearcherID platform. Both reviews conclude that the codebase and overall architecture is of good quality, technologically solid and would be a good foundation on which to build the ORCID production system. Based on this, the newly-formed ORCID board has decided to continue with the original plan and the only remaining obstacles are practical issues such as code licensing.
42
http://wokinfo.com
ORCID Requirements summary
14
4.2
Unresolved issues - summary and next steps
The main areas where there appear to be conflicts amongst stakeholders regarding requirements include the privacy model, control over profile data and the extent of profile matching support in the core system. Some of these seem to reflect genuine disagreements in critical areas, in particular the debate over the extent of personal information to include in the minimal ORCID profile which has potentially far-reaching privacy/security implications. Also important in the near term is the need clarification of requirements in some areas, including the provenance data model. It is also unclear how ORCID IDs will be created for inactive authors who will not be “pushed” into the system by institutions - should ORCID itself generate primary profiles and IDs for those authors? Regarding profile matching, the apparent disagreements seem to reflect mostly differences in stakeholder preferences regarding prioritization in development and/or level of functionality. It is agreed that some level of batch profile matching is needed to minimize duplication when computed profiles are created for non-registrants, and for handling profile merging/splitting when duplication does occur. Thus, the debate is not over whether this capability is needed in the fully-functional system, but rather on the following key questions: i) Is batch matching needed in the public beta system, or should the first iteration be aimed at supporting self-registration and creation of self-claim profiles only? ii) What should be the extent of batch matching and automated disambiguation capability that is ultimately built into the core ORCID production system? Outcomes from the first phase of work in the TWG Profile Exchange subgroup43 should help with answering these questions. The apparent conflict concerning control over profile data – i.e. whether contributors should be allowed to edit their self-claim profiles to contradict organization-supplied information – is a crucial one, and more discussion is needed to move forward on this issue. Encouragingly, this dilemma seems eminently solvable by adding delegation or similar functionality to the core system. Other open issues for future consideration include: Enable identifiers for institutions to be used for contributor affiliation. Allowing users to interact with the system on behalf of organizations and perform a subset of the functions available via the API. Data ownership - e.g who owns data copied from a 3rd party profile into a self-claim profile? Enabling contributors to “pull” profile information from external web pages or APIs into ORCID.
To conclude, the above are all issues that need to be tackled for a working beta system to be specified, built, launched and tested with a slice of the community, and to subsequently progress towards a fully-functional production ORCID system.
43
https://sites.google.com/site/openrid/technical-working-group/technical-subg/scenario-4-profileexchange
ORCID Requirements summary
15
report
requirements
Technical Working Group (TWG)