-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Jira Link: DB-9190
Description
The code for MasterPathHandlers::RenderLoadBalancerViewPanel causes the TServer to crash under some conditions.
In particular, the following line can throw:
~/code/yugabyte-db/src/yb/master/master-path-handlers.cc:3450:
const auto& tserver_table_to_replicas_mapping = tserver_tree.at(tserver_id);
This is because we are looking up the TServer ID for a TServer returned by:
~/code/yugabyte-db/src/yb/master/master-path-handlers.cc:3101:
ts_manager->GetAllDescriptors(&descs);
that returns all the TServers that have ever registered since the start of this process except that TServers that were replaced by a new TServer at the same RPC address.
However, the map we are looking it up in only includes TServers for tablets belonging to running tables:
Status MasterPathHandlers::CalculateTServerTree(TServerTree* tserver_tree, int max_table_count) {
auto tables = master_->catalog_manager()->GetTables(GetTablesMode::kRunning);
If we have moved everything off of a node and then decommissioned it, then there will be no tables containing tablets on that node but it will still be in the list of registered TServers. This will cause the at to throw, which currently crashes the TServer because it's not caught.
The documentation for the GetAllDescriptors call is misleading:
/nfusr/dev-server/mlillibridge/code/yugabyte-db/src/yb/master/ts_manager.h:108:
// Return all of the currently registered TS descriptors into the provided list.
void GetAllDescriptors(TSDescriptorVector* descs) const;
This would seem to imply that only active TServers are on the list but that's not true.
It takes some digging, but you can discover things like:
/nfusr/dev-server/mlillibridge/code/yugabyte-db/src/yb/master/ts_manager.h:73:
// Note that TSDescriptors are never deleted, even if the TS crashes
// and has not heartbeated in quite a while. This makes it simpler to
// keep references to TSDescriptors elsewhere in the master without
// fear of lifecycle problems. Dead servers are "dead, but not forgotten"
// (they live on in the heart of the master).
The list of registered TServers is kept in servers_by_id_, with TServers never being removed from this list. Sometimes TServers on the list are marked as removed but that only happens when a TServer replaces an old one with the same RPC address.
Consider if we should update the comments to make future mistakes like this less likely.
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
- I confirm this issue does not contain any sensitive information.