Memory Robustness code refactoring to increase capacity and reduce memory footprint #5449

kylasa · 2023-03-13T23:02:21Z

Description

With benchmarking felarge/webgraph/searchctr graphs, it is observed that the memory bottleneck is because of dgl graph creation, in the convert_partition.py module. Currently we can only partition felarge graph no less than 12 partitions and webgraph no less than 32 partitions.

To overcome this issue, so as to increase the size of the graph partition per node following code changes are made using this PR:

refactor the code in the convert_partition::create_dgl_object into smaller functions so that we can accomplish the following.
numpy's buffers are more reliably free'ed by the pythons' garbage collector when wrapper into a function. This means that upon functions exit, it appears that garbage collector seems to work the best and the buffers will become available for use by the system.
The large function, to create DGL object, is refactored into 3 smaller functions. Each of these functions process the data and returns processed data which is later used for creating dgl graph object. After each of these functions, appropriate data can be deleted and their memory freed up to become available to the system.

No additional (unit) test cases are instrumented to this PR because all the existing end-to-end test cases in the unit test framework will be sufficient to test this functional refactoring of the code.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
I've leverage the tools to beautify the python and c++ code.
The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

dgl-bot · 2023-03-13T23:02:49Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot · 2023-03-13T23:07:34Z

Commit ID: 8cb0bb4

Build ID: 1

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2023-03-13T23:50:02Z

Commit ID: ebbcc03

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

frozenbugs · 2023-03-14T08:45:22Z

Is there a way to verify the following statement in the unit test?

numpy's buffers are more reliably free'ed by the pythons' garbage collector when wrapper into a function. This means that upon functions exit, it appears that garbage collector seems to work the best and the buffers will become available for use by the system.

frozenbugs · 2023-03-14T08:17:37Z