Wikipedia:Bots/Requests for approval/ScannerBot
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard. The result of the discussion was Approved.
New to bots on Wikipedia? Read these primers!
- Approval process – How this discussion works
- Overview/Policy – What bots are/What they can (or can't) do
- Dictionary – Explains bot-related jargon
Operator: 0xDeadbeef (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 01:48, Thursday, May 5, 2022 (UTC)
Function overview: Removes tracker tags in Twitter links.
Automatic, Supervised, or Manual: Automatic
Programming language(s): Python
Source code available: gist
Links to relevant discussions (where appropriate):
Edit period(s): One time run
Estimated number of pages affected: <3000 per this query
Namespace(s): Mainspace
Exclusion compliant (Yes/No): Yes
Function details: Finds twitter.com URLs and remove parameters named as s
, t
, or cxt
.
Discussion
editComments before task change
|
---|
Comment: if a bot account is needed, I will probably use ScannerBot. 0xDEADBEEF (T C) 01:51, 5 May 2022 (UTC)[reply]
|
- Note: The functionality and the scope of the bot was made more specific. See page history for more details. 0xDeadbeef (T C) 06:28, 14 May 2022 (UTC)[reply]
- Regex? Primefac (talk) 15:13, 14 May 2022 (UTC)[reply]
- @Primefac: You can look at the gist I linked.
https://twitter\.com/\w+/status/\d+\?[^\s}<|]+
is used to match the URL, and then urllib is used to parse, and then remove the parameters. 0xDeadbeef (T C) 15:19, 14 May 2022 (UTC)[reply]- You'll likely want
https:\/\/linproxy.fan.workers.dev:443\/https\/twitter\.com\/\w+\/status\/\d+\?[^\s}<|]+
for regex, to escape the/
characters. (Same for below). Headbomb {t · c · p · b} 01:13, 17 May 2022 (UTC)[reply]- I embedded the regex as a Python raw string which does not need to escape forward slashes. 0xDeadbeef (T C) 01:17, 17 May 2022 (UTC)[reply]
- But dots still need escaping? Headbomb {t · c · p · b} 01:56, 17 May 2022 (UTC)[reply]
- Yes because
.
and\.
have different meanings in regex. 0xDeadbeef (T C) 02:30, 17 May 2022 (UTC)[reply]- I know. Just surprised one needs escaping and the other doesn't. Not important, if the code works, it works. Headbomb {t · c · p · b} 10:24, 17 May 2022 (UTC)[reply]
- @Headbomb, for what it's worth, I believe it's because some non-python RegEx is enclosed in / . . . /, so
/
needs to be escaped, but in python RegEx is just given as a string ' . . . ' ― Qwerfjkltalk 14:22, 29 May 2022 (UTC)[reply]
- @Headbomb, for what it's worth, I believe it's because some non-python RegEx is enclosed in / . . . /, so
- I know. Just surprised one needs escaping and the other doesn't. Not important, if the code works, it works. Headbomb {t · c · p · b} 10:24, 17 May 2022 (UTC)[reply]
- Yes because
- But dots still need escaping? Headbomb {t · c · p · b} 01:56, 17 May 2022 (UTC)[reply]
- I embedded the regex as a Python raw string which does not need to escape forward slashes. 0xDeadbeef (T C) 01:17, 17 May 2022 (UTC)[reply]
- You'll likely want
- @Primefac: You can look at the gist I linked.
- Regex? Primefac (talk) 15:13, 14 May 2022 (UTC)[reply]
- You'll want to detect primary URLs, or skip archive URLs, changing those will break them. Archive URLs can be 20+ types, it's probably easiest to detect if the twitter URL starts with "/" (example in Brandon Clarke). -- GreenC 16:15, 14 May 2022 (UTC)[reply]
- Yeah, I should probably match
[^/]
or[\s=>]
for it to be primary. 0xDeadbeef (T C) 02:07, 15 May 2022 (UTC)[reply]- Great, thanks. Also WebCite like
https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com
.. couple others use?url=
vs. "/" as the break point. -- GreenC 03:12, 15 May 2022 (UTC)[reply]- @GreenC: Hmm, then it would be hard to distinguish a template parameter from a URL parameter in an URL...
{{Foo|1=https://twitter.com}}
https://www.webcitation.org/6d0sXMyOT?url=https://twitter.com
0xDeadbeef (T C) 04:03, 15 May 2022 (UTC)[reply]- Right, I can't say what the regex would be. One method is match every string "/https?://twitter" and convert to "__hidestring__" (same with "?url=") - and when done convert those hidden strings back before saving the article. The "__hidestring__" might be "__hidestring-fs-http__" or "__hidestring-fs-https__" so you know how to revert back. Or really best, save the literal string in a table and the hidden string is the table identifier so it be restored. That way it can match on "/https?://(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])[.])*twitter" which will capture all hostname(s) such as "/http://beta.twitter" -- GreenC 17:33, 15 May 2022 (UTC)[reply]
- Okay I used a negative lookbehind and you can look at the tests here: https://regexr.com/6lmgl 0xDeadbeef (T C) 23:18, 15 May 2022 (UTC)[reply]
(?<!\?url=|/|cache:)https://twitter\.com/\w+/status/\d+/?\?[^\s}<|]+
0xDeadbeef (T C) 04:25, 16 May 2022 (UTC)[reply]- Nice. There is also sometimes very rarely protocol relative (WP:PRURL) eg.
{{cite web |url=//twitter.com}}
. They are so uncommon and can be tricky it would probably be OK to skip or log them if it doesn't fit with the regex. -- GreenC 05:21, 16 May 2022 (UTC)[reply]- a quick search seems to show that it is fine. I've fixed all three that appeared from that search. 0xDeadbeef (T C) 06:52, 16 May 2022 (UTC)[reply]
- Nice. There is also sometimes very rarely protocol relative (WP:PRURL) eg.
- Right, I can't say what the regex would be. One method is match every string "/https?://twitter" and convert to "__hidestring__" (same with "?url=") - and when done convert those hidden strings back before saving the article. The "__hidestring__" might be "__hidestring-fs-http__" or "__hidestring-fs-https__" so you know how to revert back. Or really best, save the literal string in a table and the hidden string is the table identifier so it be restored. That way it can match on "/https?://(([a-zA-Z0-9]|[a-zA-Z0-9][a-zA-Z0-9\\-]*[a-zA-Z0-9])[.])*twitter" which will capture all hostname(s) such as "/http://beta.twitter" -- GreenC 17:33, 15 May 2022 (UTC)[reply]
- Great, thanks. Also WebCite like
- Yeah, I should probably match
- Note: number of pages affected has been lowered following a quick search with
insource:
. 0xDeadbeef (T C) 04:23, 21 May 2022 (UTC)[reply] - {{BAG assistance needed}} Requesting BAG assistance due to stale BRFA. 0xDeadbeef (T C) 05:08, 27 May 2022 (UTC)[reply]
- To be clear: This BRFA has been inactive for some time. Primefac told me that they wanted input from other BAG members first. I would like to know if this is declined or approved for trial. Thanks. 0xDeadbeef 07:43, 28 May 2022 (UTC)[reply]
- Looks fine to me for trial. All issues raised above appear addressed anyway. -- GreenC 19:05, 29 May 2022 (UTC)[reply]
- Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Let's give it a try. — The Earwig (talk) 21:18, 30 May 2022 (UTC)[reply]
- Trial complete. [1] 0xDeadbeef 04:57, 31 May 2022 (UTC)[reply]
- Deadbeef, checked one edit and noticed the Wayback link actually works with the tracker removed. Who knew. After all that above :) Wayback magic. But can't say this holds true for every link, it's the kind of thing would have to verify with a header check on the Wayback link with tracking removed. It would be like an added feature to the bot, only if you wanted to try. - GreenC 06:18, 31 May 2022 (UTC)[reply]
- So I tried querying the wayback machine api to fix archive.org URLs: [2] Looking at the preview of the bot's edits, it looks fine. Perhaps it needs an extended trial? 0xDeadbeef 08:01, 31 May 2022 (UTC)[reply]
- (@The Earwig) 0xDeadbeef 11:52, 4 June 2022 (UTC)[reply]
- That's great, as it checks there is a copy in the API, it should be good to go. - GreenC 15:35, 4 June 2022 (UTC)[reply]
- {{BAG assistance needed}} 0xDeadbeef 05:28, 12 June 2022 (UTC)[reply]
- Approved. @0xDeadbeef: Thanks for your patience. Edits look good. I am fine with the expanded functionality for Wayback links and don't see a need for an extra trial provided you monitor these changes. — The Earwig (talk) 02:35, 13 June 2022 (UTC)[reply]
- So I tried querying the wayback machine api to fix archive.org URLs: [2] Looking at the preview of the bot's edits, it looks fine. Perhaps it needs an extended trial? 0xDeadbeef 08:01, 31 May 2022 (UTC)[reply]
- Deadbeef, checked one edit and noticed the Wayback link actually works with the tracker removed. Who knew. After all that above :) Wayback magic. But can't say this holds true for every link, it's the kind of thing would have to verify with a header check on the Wayback link with tracking removed. It would be like an added feature to the bot, only if you wanted to try. - GreenC 06:18, 31 May 2022 (UTC)[reply]
- Trial complete. [1] 0xDeadbeef 04:57, 31 May 2022 (UTC)[reply]
- To be clear: This BRFA has been inactive for some time. Primefac told me that they wanted input from other BAG members first. I would like to know if this is declined or approved for trial. Thanks. 0xDeadbeef 07:43, 28 May 2022 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard.