Wikidata:Requests for permissions/Bot/LargeDatasetBot

The following discussion is closed. Please do not modify it. Subsequent comments should be made in a new section. A summary of the conclusions reached follows.

Approved --Lymantria (talk) 07:10, 30 August 2019 (UTC)[reply]

LargeDatasetBot

LargeDatasetBot (talk • contribs • new items • new lexemes • SUL • Block log • User rights log • User rights • xtools)
Operator: GZWDer (talk • contribs • logs)

Task/s: Creating and maintaining items for scholar articles.

Code: Via fatameh, WikidataIntegrator, Pywikibot, QuickStatements, SourceMD and other tools. This does not have a single source code.

Function details: This is a fully-automatic bot. As the first task, the bot will import remaining 12 million entries from PubMed (see examples in 2017). This will take several months to complete.

After this the will import entries from other sources (ArXiv, Mathematical Reviews, etc.) ~6 million entries.

The bot will probably also add descriptions and citation data (mirroring Citationgraph bot) to existing and new items about scholar articles, and also creating items for authors of the article (Sources that are not deduplicated will not be imported; the bot will only look for sources like ORCID).

All items created will have at least one identifier, more from other databases may be added. Duplicate entries will be merged.

In the long-term future all entries in Microsoft Academic (200 million) will be imported to Wikidata (this is not something that will complete in the foreseeable future and I do not plan to do it currently either).

Note I do not even plan to start the first task immediately (that can be started at any time) so you may freely discuss this task and reach a consensus before it be started.--GZWDer (talk) 22:04, 29 July 2019 (UTC)[reply]

Please do a test for each dataset you plan to import (maybe 100 items each). --- Jura 22:16, 29 July 2019 (UTC)[reply]
- For PubMed (which is the first dataset) you may see 2017 edits. As the SourceMD tool is current broken I'm not able to import from DOI until the tool is fixed or I set up WikidataIntegrator in Toolforge; though there're plenty of past imports too.--GZWDer (talk) 22:37, 29 July 2019 (UTC)[reply]
  - Can you fix the unicode bugs? --- Jura 22:45, 29 July 2019 (UTC)[reply]
    - This may be fixed by have a script go through titles and author name strings, and find any non-ascii characters therein. Hope phab:T46581 be solved so that this may be done locally.--GZWDer (talk) 22:52, 29 July 2019 (UTC)[reply]
      - I'd rather not have someone run a bot who "hopes" that known bugs get fixed later by someone else. --- Jura 22:55, 29 July 2019 (UTC)[reply]
        Having a script to review all creation, or analysing incremental dumps, are also possible, though not pretty, way to fix the issue.--GZWDer (talk) 23:06, 29 July 2019 (UTC)[reply]
Plenty of paper names do contain mathml. Given, that Wikidata doesn't support mathml, those claims should be converted in the corresponding unicode characters. I agree with Jura, that data cleaning should be done before the import. If good data cleaning is done, I Support the import. ChristianKl ❪✉❫ 07:50, 30 July 2019 (UTC)[reply]
- We probably should discuss a few things on this, if it is to be a mass import in some consistent fashion... Here are a few issues from me! First, PubMed tends to munge titles and author names. If the paper has a DOI, I think it would be preferable for the bot to query the Crossref API for detailed metadata, rather than relying on what PubMed supplies. Second, I've merged about 30,000 duplicate articles that had been imported I believe via SourceMD - they had duplicate DOI's, identical titles, etc. How will the bot ensure it is not importing a duplicate? I think this has to be done via a query to the wikibase API rather than using the query service due to synchronization issues there. Third, note that DOI's are technically case insensitive - SourceMD has been upper-casing them before adding them. However the original publisher-supplied DOI may have mixed case - is this something we should try to preserve, or continue with the SourceMD approach? ArthurPSmith (talk) 14:16, 30 July 2019 (UTC)[reply]
Because provenance of the knowledge in Wikidata (and Wikipedia) is essential, primary research literature is essential. However, currently, there is no public, community-driven resource capturing scholarly literature with the annotation it needs. This annotation includes how literature cites each other (*all* scholarly article depend on other article for essential details), what facts they describe ("stated in", etc), what the articles "main topics" are, retraction info, why they cite each other (think Citation Typing Ontology), who authored it, at what affiliation. All this information is required to understand the provenance of knowledge (the facts we wish to capture in Wikidata). Today this is even more important than ever. Wikidata cannot be an decent encyclopedia if we do not track the provenance well (something long recognized in Wikipedia ("citation needed"). I also like to stress that I strongly favor Wikidata to take care of it ourselves: provenance is too important too decouple it from the facts (so, no separate Wikibase please). Now, different from Wikipedia is that Wikidata is a database, and not having articles in the database does not make sense to me. I'm fully aware the amount if literature is massive, but if an encyclopedia wants to cover knowledge, it *is* going to be massive. Therefore, Support in general. Now, it's important to have focus and be accurate and precise in the details, as well is what we capture about literature. There are some aspects that we must talk about. For example, do we want a preprint in Wikidata if the published version is too? Generally not, I guess (one way would be to remove the preprint item, and move the preprint DOI (..) to the full article). But this is a grey area, and practices vary between research/knowledge domains. Another aspect is where we source from. If we automate it, it should be public domain or CCZero. Moreover, I found the quality of Microsoft Academic for myself way too low, and do not support that as data source (and I also doubt it it CC0/PD). Finally, I have full trust the technical issues the complicate automatically entering all this data will be overcome (I know people are working on identifying the real technical reasons, e.g. editing an item with many properties is much more troublesome than an edit on a small item, something I run into with editing chemical structures too). It has to, anyway, if it every wants to be an encyclopedia to support the "sum of all knowledge". I like to remind people that 100M items is not a lot and if 100M articles is a problem, then Wikidata effectively failed as a project. --Egon Willighagen (talk) 07:05, 31 July 2019 (UTC)[reply]
@GZWDer: How is it ensured, that the correct language is attributed to the scholarly articles? Having in P1476 language = english in an item like Modelagem molecular aplicada ao desenvolvimento de moléculas com atividade antioxidante visando ao uso cosmético (Q57017676) is obviously wrong. Steak (talk) 09:26, 1 August 2019 (UTC)[reply]
- @Steak: Usually the language of the article title (unless they are translated to English) is the language of the journal the article is published in. The language of BJPS: Brazilian Journal of Pharmaceutical Sciences (Q15765904) is wrong too.--GZWDer (talk) 09:42, 1 August 2019 (UTC)[reply]

An item that isn't complete isn't automatically wrong, it's just incomplete. I would expect that there are plenty of minor journals where the language information is incomplete and you can't assume that the journal is monolingual just because it lists a single language in language of work or name (P407). ChristianKl ❪✉❫ 11:52, 1 August 2019 (UTC)[reply]

Many thanks to GZWDer for starting this discussion. I wish that approach had been used from the start - we would have a much more consistent dataset. ~~I am going to Oppose this because I think the infrastructure and the community are not ready for an import of this size~~, but happy to support this if someone from WMDE can confirm that they think the infrastructure can handle such an import. − Pintoch (talk) 13:34, 1 August 2019 (UTC)[reply]
Can we close this discussion and ask for a separate request for each source? --- Jura 17:20, 2 August 2019 (UTC)[reply]

If the consensus of this discussion is that we are okay with a certain source but not okay with the whole of Microsoft Scholar, then this conversation can be ended with giving only approval for those things where we have consensus. Given that there are a few general questions, I think it's helpful not to spread the discussion of the general issues over multiple requests. ChristianKl ❪✉❫ 14:27, 5 August 2019 (UTC)[reply]

I think this big-picture discussion (e.g. across multiple sources or Wikidata namespaces) is useful, but a bot permission page might not be the best place for it, since bots will need to have some specificity in terms of the source(s) they use, which should be considered as part of the approval process for each individual bot (or task).

For such individual bot discussions — and especially with regard to completeness, I think we could strive to engage more with things like expected completeness (P2429) statements and start referencing relevant community discussions there. I have started Property talk:P932 to test the waters for this.

In terms of what would be a healthy scale for Wikidata, I think it would be good if we could point some of the energy from such discussions into sketching out the Limits of Wikidata in more detail (and update them on a regular basis), which can then inform further discussion, decisions, workflows etc.

--Daniel Mietchen (talk) 02:39, 4 August 2019 (UTC)[reply]

@Daniel Mietchen: I think ultimately it is the developers' job to tell us what the infrastructure can handle. It is not clear to me that a WikiProject is the right place for such discussions - that generally tends to happen more on Phabricator. Bot requests are the right place to discuss large imports. − Pintoch (talk) 10:02, 4 August 2019 (UTC)[reply]

@Pintoch: I am sad for you that you feel the need to tell others what you think the developers "have" to tell us. The WMF is a rich organisation. It has as its mission to share in the sum of all knowledge. When we do not form notions of what we can achieve and how and why, how are we going to reach for those stars? There was a discussion on the mailing list where developers were told that other query tools are more powerful. The weak return was about budgets, not about technology. Just consider what is possible and what your ambitions are for Wikidata. Thanks, GerardM (talk) 15:20, 4 August 2019 (UTC)[reply]

In theory we do have community manager (in @Lea_Lacroix_(WMDE): that whose role it is to mediate between the needs of the community and the WMDE developers. ChristianKl ❪✉❫ 14:27, 5 August 2019 (UTC)[reply]

I thought devs are here to support volunteer contributors in building an encyclopedia and a database? What need should there be the other way round? --- Jura 14:54, 5 August 2019 (UTC)[reply]

I would call concerns about how much the database can handle needs from the developer side, but that's just semantics. My main point is that if there's an objection from their side, they should come here to voice it and there's no need for us to go to phabricator. ChristianKl ❪✉❫ 15:04, 5 August 2019 (UTC)[reply]

Oh surely. I think item creation capacity is capped anyways to prevent us from including all the internet ;) --- Jura 15:23, 5 August 2019 (UTC)[reply]

@Pintoch: I think developers should use the channels most appropriate for them, and combine this with documentation and outreach as necessary. This mostly works, in my opinion, but not necessarily when it comes to the limits of Wikidata. So Wikidata:WikiProject Limits of Wikidata is meant to collect pointers to existing (or missing) communications about such limits (many people can help with that), and to translate them into something not just developers can understand (this may well require rarer skills). --Daniel Mietchen (talk) 15:45, 5 August 2019 (UTC)[reply]

@Daniel Mietchen: I guess my main issue with using a dedicated WikiProject for these discussions is that it seems hard to come up with definite limits on, say, the maximum rate of item creations, and then judge imports by that standard. Item sizes vary, the way they are linked varies too, the way they are going to impact the search results and the WDQS depends on the data, and so on. So I am not sure how useful it is to debate these limits in the abstract. If WMDE is able to come up with such absolute limits, then that is great. (But is that really something that ought to be documented in a WikiProject and not a more official looking page? WikiProjects are great to coordinate editing in specific areas, but not so great for most other things, I would say). As far as I can tell these sort of decisions need to be taken on an individual basis by looking at the shape of the import (so, in pages like this one). I understand that WMDE might not want to get involved in "editorial decisions" of this sort but for large scale imports I think it would still make sense to request some feedback.

Now that being said if there is something particular (and precise!) you would want me to do in that WikiProject, let me know. − Pintoch (talk) 16:05, 5 August 2019 (UTC)[reply]

Hello all,

We cannot provide fixed numbers, as this is not only about technical restrictions, but plenty of other parameters (for example: will the community be able to handle a huge and quick growth of content)?

However, let me point you to the summary of a very interesting discussion that happened during Wikimania 2018: Wikidata:WikiCite/Roadmap. The WikiCite people together with the development team created an overview of the different options and scenarii that we could consider heading to, comparing risks and benefits. The table at the end of the page is particularly interesting.

I guess that if this discussion goes on, whatever the discussion page you choose, people active in the WikiCite project should be involved. Lea Lacroix (WMDE) (talk) 09:36, 8 August 2019 (UTC)[reply]

Yes, it makes sense to ping the people involved in WikiCite:

The Source MetaData WikiProject does not exist. Please correct the name. ChristianKl ❪✉❫ 12:09, 8 August 2019 (UTC)[reply]

Support I am a participant in the WikiCite project. I support the development and testing of the bot that can accomplish this task. I support academic citations being the pilot data set. I also want to recognize that @Pintoch: has routinely raised valid criticism that Wikidata as a project has reached the limits of Wikidata with regard to the number of items and statements that Wikidata can hold and the ability it has to query them. Even as we reach limits now, I know that Wikidata will expand its capacity not only for academic citations but for the next field of items. The current limit of Wikidata seems to be about 100 million items. If we knew that Wikidata could hold 200 million items with about as many statements, edits, and queries as we have now, then it would be very easy to support this import right now. It is important to have a pilot project and the WikiCite project is where the largest group of people are testing and engaged with a single networked data collection. Among all the other data collections various people will import to Wikidata, WikiCite is only the first big well-structured dataset to gain broad interest but it will not be the last. I wish that we could predict when we will have technical capacity to manage 500 million and 1 billion items at current demand and its current rate of growth. Blue Rasberry (talk) 15:37, 8 August 2019 (UTC)[reply]

Support per Egonw. Identify O(1000)-entry subsets and import those one subset at a time. Sj (talk) 22:29, 8 August 2019 (UTC)[reply]
Changing to Support since WMDE does not seem to oppose this. I still think it might be disruptive for the community and the infrastructure, but I think the millions of publication items that are already out there would be much more useful if complete datasets such as this one are imported. If there is an overwhelming consensus in the community for this sort of import and no clear message from WMDE against it, let's just do it and see what happens… − Pintoch (talk) 09:23, 13 August 2019 (UTC)[reply]
@GZWDer: Can you comment on the discussion? Would you be happy if we conclude this request with "Permission to import the remaining 12 million entries from PubMed is granted for the bot, provided the dataset is first cleaned (everything is in proper unicode) and the the imports are done one subset at a time as Sj proposed. Please create a new request after the PubMed imports are done for the next data set in question"? ChristianKl ❪✉❫ 10:54, 13 August 2019 (UTC)[reply]
- So this means 12 million entries from PubMed will be imported first, and new requests for permissions should be created for other imports. I do not have any issue for this.--GZWDer (talk) 10:58, 13 August 2019 (UTC)[reply]
  - Yes. ChristianKl ❪✉❫ 11:23, 13 August 2019 (UTC)[reply]

I am ready to approve the request as worded by ChristianKl in a couple of days, given that no new objections will be raised (considering these as test edits). Lymantria (talk) 13:22, 13 August 2019 (UTC)[reply]

I think we should see more recent edits than that. Especially since a series of problems with the tool being used have been identified since. 500 new edits should be easy to do. --- Jura 13:33, 13 August 2019 (UTC)[reply]
I am not happy with a bot without a flag making 599 edits. Can we grant a flag, let the bot make 500 or 1000 edits, and then make a break for the community to evaluate them?--Ymblanter (talk) 05:44, 14 August 2019 (UTC)[reply]
That sounds like a good way forward. ChristianKl ❪✉❫ 08:47, 14 August 2019 (UTC)[reply]
I have granted a bot flag for one week in order to make (500-1000) test edits. Lymantria (talk) 05:59, 15 August 2019 (UTC)[reply]

The quality of source datasets is very important, so let's have some discussion about that. @ArthurPSmith, Egonw: could you please give examples about low quality in PubMed and MAG?

- IMHO PubMed is one of the better structured datasets out there, especially the MESH classification which should be imported (we now have all 3 requisite props: MESH subject, concept, term). But is has no author ids
- MAG has author ids although many (most?) authors are not deduplicated (I had 15 records before I "claimed" them).
- CrossRef has the same quality problems as the aggregated datasets.
- re titles that include mathml, the suggestion of @ChristianKl: to convert to Unicode is not sufficient, since a lot of mathml cannot be converted to characters. We had a discussion to add new title props with special markup; same probably holds about latex markup, and chemical formulas. @GZWDer: I hope the initial import will include such complex cases so we're can examine them. --Vladimir Alexiev (talk) 08:21, 18 August 2019 (UTC)[reply]

@Vladimir Alexiev: There are many different issues going on. One is PubMed routinely does some sort of title-translation for example of articles with greek letters or mathematical notation in the title. For example Structure of β-trimyristin and β-tristearin from high-resolution X-ray powder diffraction data (Q29029898); in PubMed the β characters of the original title were converted to the string "beta"; both versions of this title were imported into Wikidata despite the identical DOI, and I had to merge them. PubMed also removed the period character following the initials of the author names and reversed so that last name was first, so in PubMed (and as imported) the names were like "Peschar R" rather than the "R. Peschar" of the original publication. A completely different case is Hearing loss in a cultural-historical context. Part 1 (Q50484936). The original article title is in German - "Gehörlosigkeit im kulturgeschichtlichen Kontext", which is not listed at PubMed or yet at Wikidata; instead a translation is given for the title, enclosed in square brackets. Worse, the same article is listed TWICE (identical journal and page number; identical DOI) with two different pmid's and slightly different titles. Again this was entered twice in Wikidata. There are many duplicates like this with slight variations in the title - often the issue is whether or not a subtitle (or in this case a "part" label) is included. Another variant of this problem seems to be common for older articles, for example Angiomatous malformations of the brain; successful extirpation in three cases (Q51372558). Again there are two pmid's with titles differing in capitalization and punctuation. I've run across some cases where one version had a word clearly mis-spelled - somebody obviously had typed it in incorrectly. I am guessing these must derive from versions of the titles provided by citing articles; if two citing authors typed the title differently, you end up with two different PubMed entries maybe? All that said, I think the relative percentage of these problems is not huge - surely less than 1%, maybe about 0.1%? But when we're talking about 10's of millions of items that leaves a lot of problem cases for people to deal with somehow... ArthurPSmith (talk) 14:39, 19 August 2019 (UTC)[reply]

I have a couple of concerns about the (potential) duplicate rate of these imports. If you have a look at the history of User:Andrew Gray/Royal Society biographies (all articles in one, quite small, journal), there's a steady trickle of new PubMed entries that duplicate existing articles - they end up being manually merged, which is a bit of a headache. On an individual scale, this isn't a problem, but I estimate something like 1% of the papers in that journal have been reimported over the past year or so.

One factor contributing to this is that PubMed seems to do odd things to article titles. Leaving aside the issues of mathml and the like, it seems to routinely add full stops at the end of article titles (see eg Escherichia coli DNA glycosylase Mug: a growth-regulated enzyme required for mutation avoidance in stationary-phase cells (Q34091159); compare Pubmed, original paper, Crossref). This presumably complicates duplicate-checking (since the two titles are now no longer identical). It would be good if these could be stripped out in the import - GZWDer, would this be possible as part of your data-cleaning?

In general, though, I'm happy to support this if it's clear we have the capacity to support it and it won't cause technical issues. Andrew Gray (talk) 10:25, 18 August 2019 (UTC)[reply]

Support per Egon. Cheers, Nomen ad hoc (talk) 13:46, 18 August 2019 (UTC).[reply]
Oppose until issues with autosuggest are resolved. Journal articles are now crowding out regular items for autosuggest, both on Wikidata and Commons. We need a way to filter them out in certain contexts, for example, when choosing what an image depicts on Commons. FYI, this issue has also been raised on the Wikidata project chat. Kaldari (talk) 01:23, 19 August 2019 (UTC)[reply]
Comment I had a look at the first tests: [1] Looks ok so far. Notably no new mojibake. --- Jura 19:46, 22 August 2019 (UTC)[reply]
- I have did some imports from PMID 1-~~9999~~99999 and 10000000-10000999.
- @Andrew Gray: If there're DOI information in Europe PMC output, and the DOI exists in Wikidata, the bot will amend existing item instead of creating a new one.
- @Jura1: Currently I does not check bad Unicode, and I don't find example of bad Unicode in imported items. The intended way to check them is recording all items created with exotic characters and fix them.
- @Vladimir Alexiev: non-deduplicated sources is never to be imported.
- I also imported some entries from Crossref. As the next step I'm planning to make a recursive ORCIDator but there're some issues in data in crossref (encoding error, HTML in title, this article is not in English and API output does not provide the language of the article) that need to be fixed.

--GZWDer (talk) 23:16, 22 August 2019 (UTC)[reply]

The title of Collaborative knowledge discovery & marshalling for intelligence & security applications (Q66711373) contains ugly code entities (#x00026;). This should not happen. Steak (talk) 20:49, 25 August 2019 (UTC)[reply]

This entry is imported from CrossRef and CrossRef as a collective source have a number of issues in its data. For HTML entities and tags in title we already have some AbuseFilter to track them. EuropePMC seems does not have this problem.--GZWDer (talk) 12:11, 26 August 2019 (UTC)[reply]

Looks good so far. I queried 4000 and checked a few. Nice idea to add articles about Wikipedia in priority.
Maybe a minor thing: As these items tend to get large and we generally don't require a reference for P31, it could be skipped.
Besides that, as far I'm concerned, feel free to go ahead with this source. --- Jura 14:02, 26 August 2019 (UTC)[reply]

I think the discussion is leading to a conclusion. I am ready to approve the request in a couple of days, provided that no further objections will be raised. Lymantria (talk) 06:04, 28 August 2019 (UTC)[reply]

The above discussion is preserved as an archive. Please do not modify it. Subsequent comments should be made in a new section.

Wikidata:Requests for permissions/Bot/LargeDatasetBot

LargeDatasetBot

Navigation menu

Search