Page MenuHomePhabricator

RFC: Make some aspects of Tidy's whitespace stripping behavior part of wikitext parsing "spec"
Closed, ResolvedPublic

Description

Revised RFC with open questions

In HTML5, whitespace is considered significant since CSS rules can affect rendering. This affects wikitext parsing because the parser needs to figure out how to handle whitespace in source wikitext. As of today, the PHP parser doesn't do anything special with whitespace in most cases (with caveats that I'll skip here). But the parser tests do mostly treat whitespace as being significant. However, since the core PHP parser is backed by Tidy, we end up having to look at the combined effect of (PHP parser + Tidy) . Tidy strips whitespace in *all* HTML tags. While this is buggy behavior from a HTML5 point of view, editors have come to rely on some aspects of this whitespace stripping. Most notably, CSS for horizontal lists (hlist CSS) used in isolation and in navboxes rely on whitespace in nested lists to have been stripped out.

In a post-Tidy world of Remex and Parsoid that are based on HTML5, there is no automatic whitespace stripping anymore. This affects rendering of hlists and navboxes as a result. There are at least 3 possibilities in front of us

  1. get editors to fix their wikitext by eliminating whitespace where it is significant -- this proposal has the drawback of being too invasive and requires regular editors to be aware of how some other editors might style that content, especially when the nested lists come through template parameters. This is acceptable if the impact is limited, but at first blush, it appears that this is used widely.
  2. rely on javascript to strip whitespace for these inline-styled lists -- this is potentially acceptable but doesn't work if javascript is turned off
  3. make the core wikitext spec whitespace-insensitive for all or for some subset of the native wikitext constructs.

At a first cut, it would be useful for an RFC discussion to evaluate the merits of these proposals -- in reality proposal 2 vs proposal 3. I am equally in favour of both 2 and 3.

Assuming proposal 3 is the favoured approach, here are further decisions that need to be made. There are 3 sub-proposals:

  • 3a. All wikitext constructs AND html tags will be whitespace insensitive (caveat: except existing whitespace behavior in parser functions) -- I personally do not favour this proposal because this introduces an egregious inconsistency with the HTML5 spec .. i.e. html tags in wikitext documents behave differently compared to everywhere else, and also can introduce subtle problems when html content from elsewhere is pasted.
  • 3b. Only native wikitext constructs (non-html tags) will be whitespace insensitive - so * # : ; {| |} | ! || !! '' ''' [[..]]
  • 3c. Whitespace within these native wikitext constructs will be trimmed - so * # : ; {| |} | ! || !! = |+ . Effectively, this means that table cells, table captions, table headings, list items, and headings will have their whitespace trimmed.

I think only proposals 3b and 3c above are really viable. I personally prefer proposal 3c.

Summary

It would be useful for the TechCom to weigh in on the two open questions both by itself, and via an IRC discussion. We want to find out any concerns and unresolved issues with these proposals.

Original Description (with context)

In T155634: Tidy strips whitespace after HTML tags AND adds newlines between HTML tags, we discovered that both Parsoid and Tidy replacements introduce rendering differences with the existing rendering because of Tidy's HTML4-era behavior of stripping white-space from HTML tags. i.e. <some-html-tag> foo </some-html-tag> renders as <some-html-tag>foo</some-html-tag> after it goes through Tidy. In the HTML5 landscape, this is buggy / broken behavior because CSS rules like display:inline or white-space:nowrap will render the two forms differently.

Our initial inclination was to break this Tidy behavior and require editors to fix pages that use the white-space sensitive CSS rules so that they get the desired rendering that Tidy provides. While sensible, courtesy @Izno and @Quiddity, two competing concerns that have surfaced. (1) There are some heavily used templates like hlist and navbox that would require fixing up a lot of pages across all wikis, and automated fixups can be a little bit more tricky. Even if done, there is the following concern which is: (2) Editors seem to prefer adding white-space in their native wikitext markup for readability reasons.

Given this situation, one option would be to treat native *block* wikitext constructs (lists, headings, tables) as being insensitive to leading/trailing whitespace. So, this whitespace will be stripped as part of parsing and will not make it to the final serialized HTML output. However, *inline* wikitext constructs (italics, bold, links) will continue to treat whitespace as sensitive. Separately, we will continue to treat HTML tags as whitespace sensitive and whitespace in there will not be stripped. That would be reaching too far.

This does not affect how editors write / use markup. This is a transparent step to ameliorate rendering differences with Parsoid and a Tidy replacement. We think this can be done in both Parsoid and PHP parser relatively easily.

However, I am filing this ticket to gather any concerns / problems that we might be overlooking. This task will not block the pilot deployment of the Tidy replacement. But, a final disabling of Tidy will be blocked on resolving this ticket one way or another.

Related Tasks: T155634, T183173, T47803, T53004

Related Objects

Mentioned In
T375319: newlines not trimmed from elements in default parser, breaking CSS selector :empty
T257651: Reply tool messes up replying to users with signatures followed by html comments
T237241: Implement basic book referencing
T195486: html2wt of existing list items, headings, table cells
T195414: Whitespace normalized outside of modified tree
T194658: Empty table cells aren't serialized correctly
T194763: stripUnnecessaryIndentPreNowikis in wikitext serializer is over-aggressive and can introduce semantic diffs
T195174: Edge case tokenizing difference
T195317: Comment prevents category from being migrated out of list
T194777: Category links are swallowed into definition list output even when they are outside the list
T155634: Tidy strips whitespace after HTML tags AND adds newlines between HTML tags
T190534: When trimming whitespace, trim unicode whitespace as well.
T183173: VE's whitespace normalization code in headings is being tripped up by html5 fallback ids in Parsoid's output for headings
Mentioned Here
T195414: Whitespace normalized outside of modified tree
rGPARdccfeafde935: Protect indented table syntax from indent pre parsing
T194658: Empty table cells aren't serialized correctly
T194763: stripUnnecessaryIndentPreNowikis in wikitext serializer is over-aggressive and can introduce semantic diffs
T194777: Category links are swallowed into definition list output even when they are outside the list
T195174: Edge case tokenizing difference
T195317: Comment prevents category from being migrated out of list
T177102: Please consider inserting a blank line before and after lists, because it's pretty
T153265: Language converter source text and language names cannot use <nowiki> escaping.
T53004: Exclude outer whitespace from headings and list items
T183173: VE's whitespace normalization code in headings is being tripped up by html5 fallback ids in Parsoid's output for headings
T47803: Whitespace between == == headers should be insignificant
T155634: Tidy strips whitespace after HTML tags AND adds newlines between HTML tags

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The space between the heading and the "[edit]" link would be larger.

image.png (41×434 px, 4 KB)
image.png (41×434 px, 4 KB)

I see ... because the Edit links are part of the <H*> tag.

Based on @ssastry's comments in T183173, it seems something related to this is causing spaces to appear before section headers in the visual editor when syntax of the form == Header == is used. This is a big change in functionality which is unexpected from the user's perspective, and needs to be changed or fixed.

Please forgive me if I have misunderstood anything. I still learning how the technical stack for the visual editor works. Please do correct me and explain it to me like I am five! :-)

Recapping some discussion in IRC:

<subbu> TL:DR; (a) parsoid doesn't strip whitespace in its output (b) VE is stripping that whitespace maybe (c) parsoid added html5 ids with fallback ids in some cases (d) the fallback id is interfering with (b) and cause leading whitespace being displayed.
<cscott> yeah, i think that's a good summary, esp since the visual effect only seems to happen in sections with diacritics.

<cscott> https://gerrit.wikimedia.org/r/395831 is the VE side patch to do the stripping
<subbu> so, that explains why no one noticed those leading spaces that Prsoid leaves behind ... because looks like VE is already stripping that whitespace in its code :)
<cscott> https://gerrit.wikimedia.org/r/#/c/395831/5/modules/ve-mw/init/ve.init.mw.Target.js is where we strip the fallback IDs

<subbu> anyway two different solution paths: (a) immediate term: fix the regression in VE with the introduction of fallback ids in parsoid output (T183173) (b) long term: address T157418 in Parsoid & PHP Parser
<subbu> @Deskana, i recommend (a) for now :) .. because (b) is not happening right away.

<cscott> i'm generally in favor always of stripping out exhaustive data-parsoid representations of meaningless whitespace and just stripping/normalizing it, then using selser to preserve it when unedited.
<subbu> yes, if we do this whitespace stripping part (in headings, list items, table cells) part of the spec, we won't represent it in data-parsoid.

<subbu> anyway, parsoid doesn't have that same rendering wrt whitespace in headings because we weren't sure if it was a bug or a feature in the core output. but, looks like we are treating it as a feature at this point. :)
<subbu> and our position is that if we do it for one construct, we do it for all native block-wikitext constructs.
<cscott> "with the caveat that intentional whitespace in these constructs should be preserved with <nowiki> (or other escaping)"
<cscott> i'm assuming that nowiki escaping of whitespace works for all of these constructs (except language converter, where i know it doesn't: T153265)
<cscott> but if we find other constructs where you can't represent leading/trailing whitespace *even if you wanted to* i think that case should be studied harder
(in general I think it should be possible at a spec level to represent any HTML with some wikitext string, even if we scrub some of these "that the editors couldn't possibly have meant" with our scrubWikitext mode)

<subbu> so, VE folks are asking for whitespace stripping in headings ot be part of the spec
<subbu> izno is asking that for list items.
<subbu> so, that leaves table cells .. i think we should just make it the default behaviour there as well.
<subbu> are there any other wikitext constructs left?
<cscott> template arguments have weird whitespace stripping
<subbu> right .. and there is this weird diff between templates and parser functions.
<cscott> anyway, i think we agree that templates/parser functions are related but probably not as easy a decision
<cscott> [[ File:Foo.jpg | caption = blah ]]
<cscott> same minor issue there where you can't represent leading newlines in a media caption, but i don't think that's been complained about
<cscott> language converter too: -{ en-gb : foo ; en-uk: bar }-
<cscott> all that whitespace is stripped I believe (i should double-check)

ssastry renamed this task from Make some aspects of Tidy's whitespace stripping behavior part of wikitext parsing "spec" to RFC: Make some aspects of Tidy's whitespace stripping behavior part of wikitext parsing "spec".Feb 16 2018, 1:05 AM
ssastry raised the priority of this task from Medium to High.
ssastry added a project: TechCom-RFC.

Note to TechCom: we need to tackle this soon -- see https://www.mediawiki.org/wiki/Topic:U7ovc7uujsvcwb8f. This might benefit from a relatively early TechCom discussion and potentially IRC office hour to figure out what the spec ought to be. If the TechCom is agreeable, I can update this task that summarizes the proposal along with some open questions that need to be resolved so that we can implement them in the PHP parser and Parsoid.

TL;DR Strip all whitespace inserted for source readability in the long run.

  • Whitespace inserted for source readability is whitespace inserted for source readability and nothing else.
  • Intermediate solutions might become necessary, Linter detection of dubious cases might help.

German Wikipedia is inserting spaces around almost every headline you might find anywhere, and rarely you find list items with no space after * and table syntax is widely grouped:

== Example ==
# First item
| col1 || col2 ||colspan="2"| col3+4

We won’t expect such things being preserved and influencing anything.

(BTW, it would be very kind if written back by VE, perhaps per site config)

If I want to keep whitespace it is my business to preserve it:

(Something{{#if: {{{other|}}} | &#32;different }})

Note that inside <pre> elements whitespace is supposed to be preserved, which also could be emulated by style="white-space:pre" and even worse class="preformatted".

In general,

xxx<tag>  inside  </tag>yyyy

should be treated as

xxx <tag>inside</tag> yyyy

(I think that has been the Tidy approach) with one or two exceptions (actually more as block rather than inline):

<syntaxhighlight>
<pre>

We are not writing in Whitespace, aren’t we?

In general,

xxx<tag>  inside  </tag>yyyy

should be treated as

xxx <tag>inside</tag> yyyy

No, moving whitespace around (rather than preserving it in the wikitext source) causes a large number of bugs with Tidy and would continue to do so without Tidy. It's also rather clearly outside the scope of this RFC, which speaks only to stripping.

That aside, I don't know that I agree w.r.t. anything other than (certain) HTML tags--and their wikitext equivalent.

TL;DR Strip all whitespace inserted for source readability in the long run.

  • Whitespace inserted for source readability is whitespace inserted for source readability and nothing else.
  • Intermediate solutions might become necessary, Linter detection of dubious cases might help.

In the description, I make a distinction between (a) native wikitext constructs used for "block" content (b) native wikitext constructs used for "inline" content (c) HTML tags.

In HTML5, whitespace is significant and can be affected by CSS. In fact, ironically, the reason this ticket exists in the first place is because of this behavior. Editors relied on Tidy stripping whitespace for styling hlists, and since a HTML5-compliant parser like Remex stops stripping whitespace, that became a problem. So, given that background, it does not make sense to treat HTML tags differently, i.e. have wikitext spec treat whitespace in HTML tags as insignificant. I think that is confusing if someone adds whitespace to HTML tags and write CSS to take advantage of that - there is no reason to break the html5 + css model.

That said, a good compromise would be to treat native wikitext constructs as whitespace insensitive. The decision to make is whether this should be restricted to block constructs like (headings, lists, tables) or carry over to inline constructs like quotes and links.

Transclusions and parser functions have their own inconsistent whitespace rules which we are not going to touch in this proposal.

(BTW, it would be very kind if written back by VE, perhaps per site config)

In many cases, Parsoid has been adding whitespace around new content for lists, headings, tables. There is T177102: Please consider inserting a blank line before and after lists, because it's pretty which will be addressed one of these days. If there are other areas where Parsoid is not doing the right thing, maybe file a separate phab task? That would be helpful.

First respone, before answering in detail: There is one more scenario.

  • There are native Wiki tags, and they are for a dozen of years used to be stripped, and no HTML 4/5 law applies.

Consider

<references>
<ref name="this">
Something, whaffle, whaffle, and lentgthy quotation.
</ref>
<ref name="that">
Another thing, and very long as well, some 1000 characters.
</ref>
</references>

Traditionally all things were stripped. Migration to HTML5 might cause a rupture in Tidy Wikis waking up in a Remex Wiki, which should be spoiled by Linter (there is already analysis for nowrap issues where it became obvious).

Inline issues are mostly invisible:

Some '' italic '' thing.

However, some are not in HTML5 inline and will disturb:

A  <code>code </code> example.
Line <u> me </u> below.

At least for wiki elements the traditional stripping should be kept:

([http://www.example.org text ])
([[ Link]])

While we might follow HTML5 rules for HTML elements, and correct via Linter where undesired visible effects occur, for our own syntax we should maintain our grandparents policy which always generated stripped elements rather than replacing '' by <i> or </i> exactly in the same place.

Note also that VisualEditor does not move selected whitespace at the end of a word outside an enclosed area, but selection of a word by doubleclick is including trailing space before next word in many environments. This might cause undesired behaviour, too.

First respone, before answering in detail: There is one more scenario.

  • There are native Wiki tags, and they are for a dozen of years used to be stripped, and no HTML 4/5 law applies.

For extension tags, none of this applies -- extensions are responsible for interpreting their content, including whitespace sensitivity. So, as long as Cite extension doesn't change, behavior of <ref>, <references> will not be affected.

Inline issues are mostly invisible:

Some '' italic '' thing.

However, some are not in HTML5 inline and will disturb:

A  <code>code </code> example.
Line <u> me </u> below.

At least for wiki elements the traditional stripping should be kept:

([http://www.example.org text ])
([[ Link]])

So, you are effectively arguing for proposal 3b. from the description of this task. Noted.

yeah.

To be more precise – “stripping” does mean: Throw the whitespace from inside to outside, as Tidy did/does.

some ''italic ''thing

results in the same as

some ''italic'' thing

(at least required for inline flow).

For block elements (tables, lists) I do not see any expectation that inside whitespace would change behaviour outside. There “stripping” may be taken literally.

Or TL;DR

  • Maintain 2001/Tidy appearance for Wikisyntax.
  • For HTML elements, start Linter procedures for visible changes as of <code> and <u>, later keep whitespace by HTML5 rules.

yeah.

To be more precise – “stripping” does mean: Throw the whitespace from inside to outside, as Tidy did/does.

some ''italic ''thing

results in the same as

some ''italic'' thing

(at least required for inline flow).

We will not support this "bug/feature" of moving around whitespace. This is https://www.mediawiki.org/wiki/Help:Extension:Linter/tidy-whitespace-bug linter category.

However, that category only identifies pages where that whitespace shifting makes a difference in rendering, i.e. when there are whitespace CSS rules in effect (the linter flagging is a heuristics based effort since it looks for that CSS rule / class name). In the common case where whitespace CSS rules don't apply, this doesn't affect rendering which you can verify by looking at this HTML in a browser.

<div>a <i>x</i> b</div>
<div>a<i> x</i> b</div>
<div>a <i>x </i>b</div>
<div>a<i> x </i>b</div>

So, between the linter category and the normal html5 rendering rules, this Tidy whitespace shifting behaviour is covered and doesn't need to be supported going forward.

Or TL;DR

  • Maintain 2001/Tidy appearance for Wikisyntax.
  • For HTML elements, start Linter procedures for visible changes as of <code> and <u>, later keep whitespace by HTML5 rules.

Noted for consideration in today's IRC discussion.

It has been for precision. The wording “stripping whitespace in inline wikisyntax” would literally mean, that

some ''italic ''thing
some [[#target|internally linked ]]thing

would be stripped as

some ''italic''thing
some [[#target|internally linked]]thing

and at least the second one could require memorizing the trailing whitespace, I presume. Depends on the link generation procedure, but is not simply replacing '' by end tag.

It has been for precision. The wording “stripping whitespace in inline wikisyntax” would literally mean, that

some ''italic ''thing
some [[#target|internally linked ]]thing

would be stripped as

some ''italic''thing
some [[#target|internally linked]]thing

and at least the second one could require memorizing the trailing whitespace, I presume. Depends on the link generation procedure, but is not simply replacing '' by end tag.

Oh right ... I see what you are saying now. So, that is actually a very good reason for why proposal 3c is the better solution. :-)

Change 415761 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] WIP: Proof of Concept implementation of RFC proposed in T157418

https://gerrit.wikimedia.org/r/415761

Change 415770 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/core@master] WIP: Proof of Concept implementation of RFC proposed in T157418

https://gerrit.wikimedia.org/r/415770

This is being moved to Last Call with option 3c being the proposed solution.

3c. Whitespace within these native wikitext constructs will be trimmed - so * # : ; {| |} | ! || !! = |+ . Effectively, this means that table cells, table captions, table headings, list items, and headings will have their whitespace trimmed.

Last Call will close next week on 2018-03-14 at 3pm PST (21:00 UTC, 22:00 CET)

https://gerrit.wikimedia.org/r/#/c/415770/ is the PHP parser patch and is now ready for initial review. I'll update the Parsoid patch tomorrow.

Change 415770 merged by jenkins-bot:
[mediawiki/core@master] RFC T157418: Trim whitespace in table cells, list items, headings

https://gerrit.wikimedia.org/r/415770

Is this now Resolved?

Need to fix that in Parsoid as well. Plus, publish the change on wiki -- need to figure out what the right place is .. perhaps mw:Specs/Wikitext. Once that is done, this can be resolved.

Change 415761 merged by jenkins-bot:
[mediawiki/services/parsoid@master] RFC T157418: Trim whitespace in wikitext headings, lists, tables

https://gerrit.wikimedia.org/r/415761

ssastry claimed this task.

Now implemented in both PHP parser and Parsoid (Parsoid code is yet to be deployed).

The change is documented at https://www.mediawiki.org/w/index.php?title=Specs/wikitext/1.0.0&type=revision&diff=2777139&oldid=2209985&diffmode=source

Change 433004 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Revert "RFC T157418: Trim whitespace in wikitext headings, lists, tables"

https://gerrit.wikimedia.org/r/433004

Change 433004 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Revert "RFC T157418: Trim whitespace in wikitext headings, lists, tables"

https://gerrit.wikimedia.org/r/433004

Our round trip testing infrastructure compares roundtripped wikitext with original wikitext. In order to figure out which diffs are syntactic and which are semantic, it does diffing with a bunch of normalizations and other tricks. But, those tricks are proving insufficient with the kind of normalization that is happening with this whitespace trimming. So, we need to update our round-trip diffing strategy. In order to not block other Parsoid deployments (since all deployments have to go through round trip testing and not have any unexplainable semantic diffs), I am reverting this patch till then.

Change 433004 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Revert "RFC T157418: Trim whitespace in wikitext headings, lists, tables"

https://gerrit.wikimedia.org/r/433004

Our round trip testing infrastructure compares roundtripped wikitext with original wikitext. In order to figure out which diffs are syntactic and which are semantic, it does diffing with a bunch of normalizations and other tricks. But, those tricks are proving insufficient with the kind of normalization that is happening with this whitespace trimming. So, we need to update our round-trip diffing strategy. In order to not block other Parsoid deployments (since all deployments have to go through round trip testing and not have any unexplainable semantic diffs), I am reverting this patch till then.

Is there a task for that separate work (add as blocker to this task)?

Our round trip testing infrastructure compares roundtripped wikitext with original wikitext. In order to figure out which diffs are syntactic and which are semantic, it does diffing with a bunch of normalizations and other tricks. But, those tricks are proving insufficient with the kind of normalization that is happening with this whitespace trimming. So, we need to update our round-trip diffing strategy. In order to not block other Parsoid deployments (since all deployments have to go through round trip testing and not have any unexplainable semantic diffs), I am reverting this patch till then.

Is there a task for that separate work (add as blocker to this task)?

No .. we are likely going to get it fixed in a day or so. So, didn't bother.

Change 433010 had a related patch set uploaded (by Subramanya Sastry; owner: Subramanya Sastry):
[mediawiki/services/parsoid@master] Revert "Revert "RFC T157418: Trim whitespace in wikitext headings, lists, tables""

https://gerrit.wikimedia.org/r/433010

Change 433010 merged by jenkins-bot:
[mediawiki/services/parsoid@master] Revert "Revert "RFC T157418: Trim whitespace in wikitext headings, lists, tables""

https://gerrit.wikimedia.org/r/433010

Does this task have anything to do with the fact that

== Innehåll ==

is changed to

==Innehåll==

in https://sv.wikipedia.org/w/index.php?diff=42944724?

Okay! It took over a month to work out all the kinks in the Parsoid ecosystem arising from this change to wikitext.

After a bunch of false starts on the Parsoid end (two back to back reverts .. the last and only previous Parsoid revert was in 2013, I think) trying to prevent dirty diffs from VE edits from this whitespace trimming, I think we finally nailed this. We also bumped Parsoid's HTML version number to 1.7.0 to properly deal with whitespace differences in what VE edits and what Parsoid now handles.

This is live on the Parsoid cluster as of y'day.

Change 437516 had a related patch set uploaded (by Jforrester; owner: Jforrester):
[mediawiki/services/mobileapps@master] tests: Update expected HTML for T157418 whitespace changes

https://gerrit.wikimedia.org/r/437516

Change 437516 merged by jenkins-bot:
[mediawiki/services/mobileapps@master] tests: Update expected HTML for T157418 whitespace changes (v1.7.0)

https://gerrit.wikimedia.org/r/437516