-
Notifications
You must be signed in to change notification settings - Fork 808
Customize Spliterator implementations for better parallelism #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| /** | ||
| * Get this collection's spliterator. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add @since POI 5.2.0 on any new public methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
poi-ooxml/src/main/java/org/apache/poi/xssf/streaming/SXSSFWorkbook.java
Show resolved
Hide resolved
| import java.util.Map; | ||
| import java.util.NoSuchElementException; | ||
| import java.util.Set; | ||
| import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you keep the explicit imports and remove the * import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, this was automatically formatted by my IDE, will remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| * | ||
| * @return a spliterator of the sheets. | ||
| */ | ||
| default Spliterator<Sheet> sheetSpliterator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe, we only need spliterator() - I know we have 2 iterator methods but that is some legacy stuff - I don't think new stuff needs to add 2 separate methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will remove it, I agree its better to have 1 method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
d6ccaba to
5640530
Compare
|
@daniel-shuy generally looks good - but could you add some test coverage for some of the new methods? We don't necessarily need full coverage but some regression tests would be a pre-req for a merge. |
5640530 to
6130a24
Compare
| */ | ||
| @Override | ||
| public Spliterator<Sheet> spliterator() { | ||
| return new SheetSpliterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this better than Spliterators.spliterator(sheets, Spliterator.ORDERED)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its actually the same, because SheetSpliterator is delegating to sheets.spliterator(), which is actually calling Spliterators.spliterator(this, Spliterator.ORDERED).
IMO its still better to delegate to the backing Collection's spliterator(), so that if the backing Collection changes, there is no risk of the Spliterator's characteristics going out of sync.
I added SheetSpliterator to be consistent with sheetIterator(), which creates an instance of SheetIterator to delegate sheets.iterator(). But now that I think about it, its probably better to simply do a cast, reducing an object instantiation, eg.
@Override
@SuppressWarnings("unchecked")
public Spliterator<Sheet> spliterator() {
return (Spliterator<Sheet>)(Spliterator<? extends Sheet>) sheets.spliterator();
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you'll need to cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately sheets.spliterator() returns Spliterator<XSSFSheet>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you try omitting the cast and see if it compiles? I just added spliterators in another project and was suprised that I didn't need to cast but it worked without the cast - pjfanning/excel-streaming-reader@4ac5fb8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried, it doesn't work. Using Spliterators.spliterator(this, Spliterator.ORDERED) avoids having to cast because Spliterators.spliterator(Collection, int) takes in Collection<? extends T> and returns Spliterator<T>. If only Iterable<T>#spliterator() returned Spliterator<? extends T> instead of Spliterator<T> 🤦♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough - use the cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Iterable#spliterator() default implementation has poor splitting capabilities, is unsized, and does not report any spliterator characteristics
6130a24 to
2077ac1
Compare
2077ac1 to
e367af3
Compare
Why
There are many classes in Apache POI that implements Iterable, but only implements Iterable#iterator() without overriding Iterable#spliterator().
The default implementation of Iterable#spliterator() is:
Spliterator#spliteratorUnknownSize(Iterator, int) returns a
Spliteratorwith no initial size estimate, which has poor splitting capabilities. The default implementation has to do this because not allIterables have a fixed size.This is important when trying to convert the
Iterableto a parallelStream. For example, when trying to process Rows in a XSSFSheet (XSSFSheetimplementsIterable<Row>) in parallel, the performance is terrible:The
XSSFSheet'sIteratoris backed by a TreeMap#values():poi/poi-ooxml/src/main/java/org/apache/poi/xssf/usermodel/XSSFSheet.java
Line 95 in cf1354d
poi/poi-ooxml/src/main/java/org/apache/poi/xssf/usermodel/XSSFSheet.java
Lines 2049 to 2067 in cf1354d
TreeMap#values()returns a Collection, and thus has a fixed size. ItsSpliteratoris hence sized and has good splitting capabilities, allowing it to be properly parallelized. Therefore an easy way to customizeXSSFSheet#spliterator()is to simply delegate to the underlyingTreeMap#values(), eg.How
Iteratorfactory method with a backingCollection/Iterable, I simply delegated theSpliteratorfactory method to the underlyingCollection/Iterable.Iteratorfactory method but do not have a backingCollection/Iterable, but have a fixed size, I customized theSpliteratorfactory method to return a sizedSpliteratorusing Spliterators#spliterator(Iterator, long, int).Iteratorfactory method but do not have a fixed size remain unchanged.I also took the liberty to perform the following refactors:
iterator()implementation in child classes toWorkbook/Sheet/Rowinterface usingdefaultmethods (theiterator()method is an alias ofsheetIterator()/rowIterator()/cellIterator(), and is currently being duplicated in all child classes)Iterableinterface toIntMapperandXDDFTextParagraph(so that they can be iterated using an enhancedforloop)None of the changes break source/binary compatibility.
For anyone interested, the current workaround is create a sized
Spliteratorusing Sheet#getPhysicalNumberOfRows(), eg.This however, is still less optimal than using the backing
Collection'sSpliterator.