Skip to content

Conversation

@kylasa
Copy link
Collaborator

@kylasa kylasa commented Mar 13, 2023

Description

With benchmarking felarge/webgraph/searchctr graphs, it is observed that the memory bottleneck is because of dgl graph creation, in the convert_partition.py module. Currently we can only partition felarge graph no less than 12 partitions and webgraph no less than 32 partitions.

To overcome this issue, so as to increase the size of the graph partition per node following code changes are made using this PR:

  1. refactor the code in the convert_partition::create_dgl_object into smaller functions so that we can accomplish the following.
  2. numpy's buffers are more reliably free'ed by the pythons' garbage collector when wrapper into a function. This means that upon functions exit, it appears that garbage collector seems to work the best and the buffers will become available for use by the system.
  3. The large function, to create DGL object, is refactored into 3 smaller functions. Each of these functions process the data and returns processed data which is later used for creating dgl graph object. After each of these functions, appropriate data can be deleted and their memory freed up to become available to the system.

No additional (unit) test cases are instrumented to this PR because all the existing end-to-end test cases in the unit test framework will be sufficient to test this functional refactoring of the code.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@kylasa kylasa requested a review from Rhett-Ying March 13, 2023 23:02
@kylasa kylasa self-assigned this Mar 13, 2023
@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 13, 2023

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 13, 2023

Commit ID: 8cb0bb4

Build ID: 1

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 13, 2023

Commit ID: ebbcc03

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@frozenbugs frozenbugs requested review from frozenbugs and removed request for Rhett-Ying March 14, 2023 07:44
@frozenbugs
Copy link
Collaborator

Is there a way to verify the following statement in the unit test?

numpy's buffers are more reliably free'ed by the pythons' garbage collector when wrapper into a function. This means that upon functions exit, it appears that garbage collector seems to work the best and the buffers will become available for use by the system.

offset to be used when assigning edge global ids in the current partition
return_orig_ids : bool, optional
Indicates whether to return original node/edge IDs.
schema : json dictionary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you follow the DGL convention, e.g.

labels: dict[str, Tensor]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for others.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and other types, such as numpy array -> numpy.array

schema : json dictionary
dictionary object created by reading the metadata.json file for the
current dataset
part_id : integer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int

return_orig_ids : bool, optional
Indicates whether to return original node/edge IDs.
schema : json dictionary
dictionary object created by reading the metadata.json file for the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize the first char.
Same for others.

numpy array :
shuffle_global_nids, assigned after the data-shuffling phase, for the
nodes in the current partition
tuple :
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tuple of what? follow the convention: tuple[blablabla]

gc.collect()
logging.info(
f"There are {len(shuffle_global_src_id)} edges in partition {part_id}"
f"[Rank: {part_id}] There are {len(shuffle_global_src_id)} edges in partition {part_id}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too long, over 80 chars.

shuffle_global_dst_id = None
global_src_id = None
global_dst_id = None
# global_edge_id = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

return_orig_eids=False,
):
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary blank line.

Returns:
--------
dgl object
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which object?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix the types above and below according to the comment.

dictionary
map between edge type(string) and edge_type_id(int)
dict of tensors
If `return_orig_nids=True`, return a dict of 1D tensors whose key is the node type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

over 80 chars.

dstids = dstids - nids[0] + offset

# return the values
return uniques, idxes, srcids, dstids
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional] Since you changed almost 80% of the file, I'd suggest you to also update the other piece to follow the convention.

)
memory_snapshot("ShuffleGlobalID_Lookup_Complete: ", rank)

def prepare_local_data(src_data, local_part_id):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unit test for this? make sure the src_data are properly poped?

REV_DATA_TYPE_ID,
)

DATA_TYPE_ID = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference better this DATA_TYPE_ID and utils.DATA_TYPE_ID, why do you need to create a new one?

... as part of improving the unit test coverage of the data pre processing pipeline.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants