-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[DistDGL][Robustness]Use appropriate delimiter when reading edge files. #5447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
To trigger regression tests:
|
1. reverting back to the original pytest_utils.py. This will remove the random delimiter used when creating edge files.
thvasilo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've mentioned this before: The amount of setup required for the tests is very large, and they seem to be duplicating the implementation. As a result the tests are brittle and refactoring the code will definitely lead to us having to change the test as well.
I'm also concerned that the values as noted in the comments do not seem to follow an expected pattern yet the tests pass. Are we sure we are testing the right things here?
Please take another look at the comments, fix what is needed and we can take a second look.
Let's ask the question: what's the minimum amount of setup possible to test the functionality that this code change adds.
| assert np.all(exp_etype_ids == edge_dict[constants.ETYPE_ID]) | ||
|
|
||
| # validate edge_tids here. | ||
| assert edge_tids["n1:e1:n1"][0] == ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For constants like "n1:e1:n1" better to define them as a file constant EDGE_TYPE = "n1:e1:n1" and use the variable throughout the code.
|
|
||
|
|
||
| def _validate_edges( | ||
| rank, world_size, num_chunks, edge_dict, edge_tids, schema_map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edge_dict is not in the docstring
| edge_feats.append(data) | ||
|
|
||
| if len(edge_feats) == 0: | ||
| actual_results = edge_features["n1:e1:n1/edge_feat_1/0"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here define constants and use something like f"{EDGE_TYPE}/{FEATURE_NAME}/0"
| else: | ||
| edge_feats = np.concatenate(edge_feats) | ||
|
|
||
| # assert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
| edge_feats = np.concatenate(edge_feats) | ||
|
|
||
| # assert | ||
| assert np.all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use numpy.testing.assert_array_equal
| schema["edge_type"] = ["n1:e1:n1"] | ||
| schema["node_type"] = ["n1"] | ||
|
|
||
| edges = {} | ||
| edges["n1:e1:n1"] = {} | ||
| edges["n1:e1:n1"]["format"] = {} | ||
| edges["n1:e1:n1"]["format"]["name"] = edge_fmt | ||
| edges["n1:e1:n1"]["format"]["delimiter"] = edge_fmt_del |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use constants
|
|
||
| # Make sure that the spawned process, mimicing ranks/workers, did | ||
| # not generate any errors or assertion failures | ||
| assert len(sh_dict) == 0, f"Spawned processes reported some errors !!!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the user know which errors were encountered?
|
|
||
|
|
||
| @pytest.mark.parametrize( | ||
| "world_size, num_chunks, num_parts", [[1, 1, 4], [4, 1, 4]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we keep num_chunks to 1? Can we add one case for [4,4,4]?
| data = np.arange(10, dtype=np.int32) | ||
| for _ in range(9): | ||
| data = np.vstack((data, np.arange(10, dtype=np.int32) + 10 * idx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct?
idx = 0
In [40]: data
Out[40]:
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]], dtype=int32)
idx = 1
In [43]: data
Out[43]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]], dtype=int32)
In [44]: idx = 2
In [46]: data
Out[46]:
array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29]], dtype=int32)
| data = np.arange(100, 110, dtype=np.int64) | ||
| for _ in range(9): | ||
| data = np.vstack( | ||
| (data, np.arange(100, 110, dtype=np.int64) + 100 * idx) | ||
| ) | ||
| node_feats.append(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct?
idx = 0
In [48]: data = np.arange(100, 110, dtype=np.int64)
...: for _ in range(9):
...: data = np.vstack(
...: (data, np.arange(100, 110, dtype=np.int64) + 100 * idx)
...: )
...: data
Out[48]:
array([[100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
[100, 101, 102, 103, 104, 105, 106, 107, 108, 109]])
In [49]: idx = 1
In [50]: data = np.arange(100, 110, dtype=np.int64)
...: for _ in range(9):
...: data = np.vstack(
...: (data, np.arange(100, 110, dtype=np.int64) + 100 * idx)
...: )
...: data
Out[50]:
array([[100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
[200, 201, 202, 203, 204, 205, 206, 207, 208, 209],
[200, 201, 202, 203, 204, 205, 206, 207, 208, 209],
[200, 201, 202, 203, 204, 205, 206, 207, 208, 209],
[200, 201, 202, 203, 204, 205, 206, 207, 208, 209],
[200, 201, 202, 203, 204, 205, 206, 207, 208, 209],
[200, 201, 202, 203, 204, 205, 206, 207, 208, 209],
[200, 201, 202, 203, 204, 205, 206, 207, 208, 209],
[200, 201, 202, 203, 204, 205, 206, 207, 208, 209],
[200, 201, 202, 203, 204, 205, 206, 207, 208, 209]])
In [51]: idx = 2
In [52]: data = np.arange(100, 110, dtype=np.int64)
...: for _ in range(9):
...: data = np.vstack(
...: (data, np.arange(100, 110, dtype=np.int64) + 100 * idx)
...: )
...: data
Out[52]:
array([[100, 101, 102, 103, 104, 105, 106, 107, 108, 109],
[300, 301, 302, 303, 304, 305, 306, 307, 308, 309],
[300, 301, 302, 303, 304, 305, 306, 307, 308, 309],
[300, 301, 302, 303, 304, 305, 306, 307, 308, 309],
[300, 301, 302, 303, 304, 305, 306, 307, 308, 309],
[300, 301, 302, 303, 304, 305, 306, 307, 308, 309],
[300, 301, 302, 303, 304, 305, 306, 307, 308, 309],
[300, 301, 302, 303, 304, 305, 306, 307, 308, 309],
[300, 301, 302, 303, 304, 305, 306, 307, 308, 309],
[300, 301, 302, 303, 304, 305, 306, 307, 308, 309]])
Description
Webgraph dataset edge files have '\t' as the delimiter. But the pipeline, by default assumes ' ' as the delimiter because of which it fails to read webgraph files.
Checklist
Please feel free to remove inapplicable items for your PR.
Changes