(C) PLOS One

(C) PLOS One
This story was originally published by PLOS One and is unaltered.
. . . . . . . . . .

Ten quick tips for editing Wikidata [1]

['Thomas Shafee', 'Swinburne University Of Technology', 'Melbourne', 'Daniel Mietchen', 'Ronin Institute', 'Montclair', 'New Jersey', 'United States Of America', 'Institute For Globally Distributed Open Research', 'Education']

Date: 2023-08

Copyright: © 2023 Shafee et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Introduction

This article acts as a successor to the 10 simple rules for editing Wikipedia from a decade ago [1]. It addresses Wikipedia’s machine-readable cousin: Wikidata—a project potentially even more relevant from the point of view of Computational Biology.

Wikidata is a free collaborative knowledgebase [2] providing structured data to every Wikipedia page and beyond. It relies on the same peer production principle as Wikipedia: anyone can contribute. Open, collaborative models often surprise in how productively they work in practice, given how unlikely they might be expected to work in theory. Nevertheless, they can still be met with a lot of resistance and suspicion in academic circles [3,4].

Since its launch in 2012, Wikidata has rapidly grown into a cross-disciplinary open knowledgebase with items ranging from genes to cell types to researchers [2,5–7]. It has wide-ranging applications, such as validating statistical information about disease outbreaks [8], aligning resources on human coronaviruses [9], or assessing biodiversity [10,11]. It can be thought of as a vast network graph (Fig 1A), wherein the items act as nodes (now over 100 million) linked to one another by over a billion statements, and further linked out to the wider web by many billions more. We’ll link to example Wikidata items and properties by using italics throughout the text as we refer to them (Fig 1).

PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 1. Structure of an example Wikidata item. Wikidata items are linked to one another and to outside databases via properties that describe the relationships between them. (A) Some example links to and from the item human retinoic acid receptor alpha (Q254943). Items can have outgoing links; e.g., to the concept of a protein (Q8054), incoming links; e.g. from the human RAR–SRC1 complex (Q107514806), or both; e.g., to and from the human RARA gene (Q18031040). There can be multiple links out with the same property (e.g., multiple molecular functions) and links out to external websites and identifiers; e.g., it has the MeSH ID (P486) of D011506. The links formed by properties can be further annotated with qualifiers; e.g., its physical interaction with (P129) tretinoin (Q29417) is with the role of (P2868) being an agonist (Q389934). Now imagine this for a hundred million node items and many billions of property edges. (B) The human–readable interface for this item is organised into the label, description, and aliases, followed by a list of statements with their qualifications and references, with a final section listing any Wikipedia (and other wikimedia) pages for the item. (C) Example labels, descriptions, and aliases for virus (Q808) from the 410 currently supported languages. These screenshots contain only text and data released under a CC0 licence. https://doi.org/10.1371/journal.pcbi.1011235.g001

The online interface makes the items themselves somewhat human-readable (Fig 1B), but their structured nature makes it possible to query and combine the information in ways that can’t be achieved for information sources written entirely in prose. This versatility makes its applications in computational biology, arguably, even more universal and flexible than just relying on Wikipedia alone [12]. Queries on Wikidata can vary from which gene variants predict a positive prognosis in colorectal cancer to taxa by number of streets in the Netherlands that bear their name. We’ll try to use examples relevant to computational biology, but bear in mind that the same can be done with almost everything from a map of mediaeval witch executions in Scotland to emergency phone numbers by population using them to paintings depicting frogs.

Since it’s under a CC0 copyright waiver, Wikidata’s structured content is essentially released into the public domain to be used on other projects [13]. You’ll probably have already seen its structured data at the top of search engine results but it’s also used behind the scenes on thousands of sites, becoming the backbone infrastructure for using, sharing, and collaboratively curating structured reference knowledge.

Tip 1: Learn by doing If you’re thinking of editing Wikidata, you can start right away, perhaps by exploring and experimenting with one of its sandbox items like Q4115189, or by taking some of the introductory tours. While it is possible to edit without an account, it is best to register one. Wikidata uses the same user account as Wikipedia or Wikimedia Commons. This enables you to build a reputation within the editor community as you contribute, makes it easier for other editors to contact and collaborate with you, and will enable you to use some additional tools (see Tip 9). Paradoxically, it can also protect your anonymity better: you edit under a username of your choice instead of your edits being tagged with your IP address. Once you’ve created your account, it’s useful to click on your username in the top right of the screen to add some basic information to your userpage—particularly your topics of interest and your areas of expertise. It is increasingly common, although not required, for researchers on Wikidata to also link out to their real-world identity (faculty profile, professional social media, personal website, etc.) or simply to the Wikidata entry about them. Whereas Wikipedia strictly prohibits editing a page about yourself (if you have one), in Wikidata, it is acceptable to add uncontroversial statements to the Wikidata item about you if you can reference them to publicly available sources (see Tip 7). It can therefore be useful to search for yourself in Wikidata and add statements, for example, your ORCID (P496), Github account (P2037), or Wikimedia username (P4174). Also note that while it is technically possible to add phone numbers or email addresses, be extremely cautious about adding any information—to any item—that may violate privacy (the policy about living people provides guidance here).

Tip 2: Think of knowledge as structured statements Information in Wikidata is organised into statements. A basic statement is a triple containing a subject, a predicate, and an object. Although the subject of a statement is always a Wikidata item, the object can be either another Wikidata entity or another data type such as strings, URLs, quantities, or external identifiers. For example, Human retinoic acid receptor alpha (Q254943) has the molecular function (P680) of retinoic acid binding (Q14901431) (Fig 1). The identifiers beginning with Q are items and indicate objects, concepts, or events. Identifiers beginning with P are the properties that define relationships. This model of statements is common to linked data repositories aligned to the Semantic Web [14–16], and Wikidata extends it with qualifiers and references that enable capturing specific detail and provenance (see Tip 7). For example, the statement Retinoic acid receptor alpha (Q254943) physically interacts with (P129) tretinoin (Q29417), with the role (P2868) of agonist (Q389934) cites as a reference that it is stated in (P248) the IUPHAR/BPS database (Q17091219). Besides Ps and Qs, some other identifiers with a leading letter are important in the Wikidata ecosystem. For example, identifiers starting with Ls are for lexemes that indicate linguistic properties of a word or phrase, e.g., the Swedish noun “modell” (L47542) has multiple meanings, only one of which is a simplified representation of reality (Q1979154). Similarly, Wikidata identifiers starting with E are for entity schemas, which are particularly useful for defining and validating items (see Tip 9). Wikidata is based on the knowledge graph management software Wikibase. Since the software is open-source, it is also used in a range of other specialist applications to host data as structured statements. Learning this way of thinking about information therefore enables participation beyond Wikidata. The main other example within the Wikimedia ecosystem is annotation of the Wikimedia Commons media-sharing platform. It is also being implemented in projects outside of Wikimedia that range from ontologies for botanical collections [17], a semantic map of the trade of enslaved people [18], or general research data management applications [19].

Tip 5: Improve existing data The easiest first edit to make is to add a new statement to an existing item. Just use the button and Wikidata will attempt to autocomplete and suggest potential properties and items as you type. A good way to get started with editing is to check out the external identifiers section on an item’s page and perhaps add some missing identifiers for the concept from a database you are familiar with. So for example, if you are on an item about a taxon, you could check whether it correctly states the corresponding GBIF taxon ID (P846), NCBI taxonomy ID (P685), MycoBank taxon name ID (P962), IPNI plant ID (P961), WoRMS-ID for taxa (P850), etc. These sorts of links out to external identifiers make Wikidata a valuable tool for easily cross referencing items between different resources for each concept. Another good way to get started is to explore items about research articles and review—and possibly add—statements for main subject (P921). A way of annotating such articles that is particularly unique to Wikidata is adding statements for describes a project that uses (P4510) to add important tools, techniques, or materials that the article highlights in its methods section. You can introduce a lot of extra richness to a statement including qualifiers via (Fig 1). The web interface can be customised with a range of extra tools and gadgets via your preferences to align its capabilities to what is most useful to you. You can also edit an item’s short description using the button at the top. Even though these aren’t machine-readable, the text is useful for humans to disambiguate between items at a glance (for example, the word “translation” might indicate “the creation of proteins using information from nucleic acids” or “a function that moves every point a constant distance in a specified direction in euclidean geometry” or “transfer of meaning from one language into another”).

Tip 6: Be bold, but not reckless Like editing Wikipedia [1], the apparent complexity of Wikidata can make getting started seem intimidating. The trick is to start small. Try looking up Wikidata items on some key papers in your field of research (or this list of PLOS Comp Biol articles) and see if you can add its keywords as main subject (P921) or its methods as describes a project that uses (P4510). Such annotation can get pretty detailed and granular as you can see in this example. To work out how to best model new data you want to integrate, you can check out the showcases that many Wikiprojects maintain (see Tip 4) to see how similar item types should be organised for consistency. If your planned additions extend on current examples, involving those experienced contributor communities in the data modelling decisions can ensure that new content is modelled consistently with existing statements. Remember, you can easily revert edits if you’ve made a mistake—go to the history tab at the top and click “undo.” If doing mass edits or additions (see Tip 9), remember to validate the updated data to make sure you’ve made the changes you intended to [8,23].

Tip 7: Add references (cite, cite, cite) Just like in Wikipedia, Wikidata is primarily a secondary resource and acts as a hub or proxy to other resources, ideally in a way that facilitates verifiability. All statements should therefore, whenever possible, cite their provenance to existing knowledge in other external reliable sources. These are added via the button. To cite research articles, books, and other common reference types, you can reference their Wikidata QID (Fig 3A). If the source you want to use as a reference doesn’t have a Wikidata item yet, you can add it using tools such as Scholia. It is also possible to reference entries in external databases (Fig 3B) or webpages (Fig 3C). For sources that might change over time like databases and webpages—it is best to include the date retrieved or even an archived URL. Lastly, especially when a concrete reference isn’t possible, it is useful to provide the heuristic used (Fig 3D; list). It’s worth including citations for even seemingly trivial statements if a reference is available, for example, the statement that an intron (Q207551) is part of (P361) a primary transcript (Q7243183) references 2 papers (Fig 3A). PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 3. Example reference types to support statements. Examples of (A) referencing to Wikidata items for journal review articles, (B) referencing to a database entry, (C) referencing to a website, or (D) using a heuristic estimate to justify a statement. These screenshots contain only text and data released under a CC0 licence. https://doi.org/10.1371/journal.pcbi.1011235.g003

Tip 8: Create new entities Don’t be afraid to create new items. In general, each item should describe a single concept. For example, there are separate items for the ɑ-defensin protein domain (Q4063641), ɑ-defensin propeptide domain (Q24727071), ɑ-defensin gene family (Q81639709), ɑ-defensin 1 mouse gene (Q18248700), ɑ-defensin 1 mouse protein (Q21421153), etc. It is trivially simple to create a new item: the “create new item” link on the left will allow you to define an item, assign a short description, and add any aliases that it might also be known by. Newly created items always need to be given an instance of (P31) or subclass of (P279) statement to link it into the wider knowledgebase, but otherwise there are no compulsory fields. An easy way to identify additional statements to add is by checking items of a similar type. The interface will also attempt to suggest potential properties as you add statements (Fig 4). Although it’s best to avoid duplicates, merging items later is easy if it turns out there’s more than one for the same thing. You can also use Cradle where you can populate new items via a lightweight form which prompts you to include the most common fields. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Fig 4. Property auto–suggestion. Once you start to add statements to an item (especially instance of/subclass of), the interface will begin to suggest common properties to add that other similar items include. Depicted are the suggestions given for a protein domain. For some properties, it will then also suggest common values for that statement. This screenshot contains only text and data released under a CC0 licence. https://doi.org/10.1371/journal.pcbi.1011235.g004 While all scientific concepts fit on Wikidata in principle, there are notability guidelines that advise on which things should or should not have items. For example, valid taxa, type specimens, or reference genomes are essentially automatically notable. In contrast, not all humans are sufficiently notable, though researchers who have published peer-reviewed articles usually are. Proposing new properties that can be used to link items is trickier. Compared to the >100M items, there are only 8K properties, so these have more a role of a controlled vocabulary. To propose a new property, simply list it and some example use cases at Wikidata:Property_proposal and experienced contributors will check if it makes sense to implement as proposed or with some changes or whether an already existing property can be adapted.

Tip 9: Edit information in bulk Once you’ve learnt how to add single statements and create single items, you’ll likely want to scale this up to edit information in bulk. Databases with a CC0 are becoming more common and can be integrated into Wikidata in full (e.g., CIViC, Wikipathways, Disease Ontology, and the Evidence and Conclusion Ontology). Other datasets (e.g., Uniprot, CC BY 4.0 licence) can still be integrated by linking out to them via external identifiers (example) or have their data integrated as a statement with proper referencing to attribute it (example). When getting into larger scale editing, it is generally best to scale up test sets to identify any issues that come up—do a batch of 10 or a hundred edits before trying a thousand or a million. There are a range of ways to achieve this. There are Wikidata Tools available that cover a range of common situations. Editing tools can generally only be used after a minimum number of manual edits (typically 50) or a minimum age of the account (typically 4 days). OpenRefine and Ontotext Refine take a spreadsheet of statements to be added and reconcile text strings in that spreadsheet to their most likely Wikidata items, flagging required manual intervention for ambiguous matches [24]. Ontotext Refine also contains an “RDF mapper,” which can help integrate Wikidata into external databases by generating a separate RDF that uses Wikidata’s identifiers but can be used outside of Wikidata. Quickstatements is a similar Wikidata editing tool, though it does not include the reconciliation functions so you’ll need to know any Wikidata QIDs to be included in statements beforehand [25]. Libraries are available in a range of languages (Table 1) to interface with Wikidata via its dynamic API and the query service. For example, the Wikidata integrator library can update items based on external resources and then confirm data consistency via SPARQL queries. It is used by multiple python bots to keep biology topics up to date, such as genes, diseases, and drugs (ProteinBoxBot) [14], or cell lines (CellosaurusBot) [26]. PPT PowerPoint slide

PNG larger image

TIFF original image Download: Table 1. Example Wikidata packages and libraries ( Example Wikidata packages and libraries ( extended list ). https://doi.org/10.1371/journal.pcbi.1011235.t001 Since Wikidata is expressed as RDF, it comes with an EntitySchema extension [27] that enables describing the schema of captured knowledge as Shape Expressions (ShEx)—a formal language to describe data on the Semantic Web [28]. EntitySchema have been created for a range of item classes (list), for example, the Protein Reactome Schema (E39) or clinical trial schema (E189). They act as documentation for the data deposited by data donors, but they also act as a document to describe expectations by users [8,28].

[END]
---
[1] Url: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011235

Published and (C) by PLOS One
Content appears here under this condition or license: Creative Commons - Attribution BY 4.0.

via Magical.Fish Gopher News Feeds:
gopher://magical.fish/1/feeds/news/plosone/