Page MenuHomePhabricator

URL-encoding of external-id values in Wikidata frontend breaks (some) links
Open, MediumPublic

Description

Values of external-id properties link to the external database entry from the Wikidata frontend. In some cases the external weblink is not working due to URL-encoding of special characters within the external identifier, such as %, &, =, and maybe others. An example is https://www.wikidata.org/wiki/Q325887#P3520 with the correct identifier W%D6LLEKLA01. The extra URL-encoding translates it to W%25D6LLEKLA01 which is not working.

I originally requested an update at wikidata:MediaWiki talk:Gadget-AuthorityControl.js, but I learnt that this gadget is no longer responsible for the linking.

Event Timeline

Spaces can also cause problems: Obviously a space " " gets encoded as a plus "+". This breaks the links generated for Iconclass notation (P1256). For example http://iconclass.org/11H(COSMAS+&+DAMIAN) generated from "11H(COSMAS & DAMIAN)" instead of http://iconclass.org/rkd/11H(COSMAS%20&%20DAMIAN)/ (from this item).

@Marsupium Have you tried using the wmflabs tool wikidata-externalid-url? I used it for the formatter URL for Twitch game ID (P4467) in September because of the space issue, and it worked without any changes to the actual Toolforge code.

@Marsupium Have you tried using the wmflabs tool wikidata-externalid-url? I used it for the formatter URL for Twitch game ID (P4467) in September because of the space issue, and it worked without any changes to the actual Toolforge code.

Yes, I found that workaround in the end and then forgot to report here. Thanks for mentioning it!

Does the frontend need to URL encode external IDs? If they are meant to be resolvable there should be no need to do so in the first place right?

Turns out this is also a problem within WDQS. When it comes to IDs and URIs is there a reason for encoding this type of information in the first place instead of leaving that to applications?

https://www.wikidata.org/wiki/Property:P2000 is another one that's affected by the incorrect encoding of spaces to '+'. Given that these values are strings like "Cantigas de Santa Maria", which generates a failing link of http://www1.cpdl.org/wiki/index.php/Cantigas+de+Santa+Maria, it probably fails more often that it works. It should be encoded as %20, which would work everywhere.

Lydia_Pintscher raised the priority of this task from Low to Medium.Feb 12 2021, 5:12 PM
Lydia_Pintscher subscribed.

No, Given the number of reports let's raise it.

Lydia_Pintscher claimed this task.

I believe the changes in T271126 fixed this. If you still find cases that don't work please reopen.

Nikki subscribed.

I believe the changes in T271126 fixed this. If you still find cases that don't work please reopen.

It is still encoding #, which has lead to https://www.wikidata.org/w/index.php?diff=1513530153. I assume Epidosis's example is also still broken for the same reason, although I can't test it because that one was also replaced by a link to Toolforge.

Another problem of encoding: https://www.wikidata.org/wiki/Property:P2549 contains "&" in its IDs and due to encoding of "&" all the links are broken.

Example:

Could this be solved? Thanks!

The fundamental problem here is that encoding of the parameter is specific to which part of the url you are inserting it into .This is why we have so many functions to handle encoding in wiki pages: https://www.mediawiki.org/wiki/Help:Magic_words#URL_data

Considering that this is not specified per formatter right now what type of encoding to apply to escape the values, it is currently not possible to do this without mistakes. A solution could be to add a property to the url formatter which specifies what part of the url the value will be substituted into, and use that to determine the correct encoding when escaping the values for the url formatter.

@TheDJ sums up the situation well. Changing the way the encoding works at the moment would probably break more than it fixes.