Forum Discussion

Issac's avatar
Issac
Icon for Expert rankExpert
2 months ago
Solved

Automatic CUG Addition /Removal

Hi All,

It is possible to automatically remove the collectors from CUG if the cpu/memory have reached above 90 % . 

The failover of devices should work with the performance of DC load as well ,not only down.

  • Hello Isaac,

    At the moment, the CUG load balancing does not have this capability. I recommend submitting this as a feature enhancement to the Ideas Hub here in the Nexus Community, so that our Product Management team is aware there is a desire for increased performance considerations in regard to CUG load balancing.

    Antonio Andres

    Principal Technical Support Engineer | ScienceLogic

12 Replies

  • Hello Isaac,

    Currently, aligning and removing collectors from a collector group is a manual process. Removing a collector if it has CPU or memory utilization over 90% is not advisable because we do not know what the root cause of what is causing the performance load. For example, if there is a single device that is misconfigured and is causing CPU to spike above 90%, removing the collector will simply load balance this device to a different collector and then that new collector will have its CPU spike to over 90%. Not to mention the added load of redistributing those devices to the remaining collectors. 

    If you do see a collector consistently at over 90%, adding a collector generally helps with spreading the load. If that still does not work, please submit a case to ScienceLogic Support and we can investigate the root cause of the issue and provide a solution.

    Antonio Andres

    Principal Technical Support Engineer | ScienceLogic

    • Issac's avatar
      Issac
      Icon for Expert rankExpert

      Hi Andres ,

      At high level these must be considered for load balancing , not only DC down can cause service outage , if we have resource crunch the polling is going to be affected for end devices and there will be missed polls where the data and alerts might be missed.

      • TonyAndres's avatar
        TonyAndres
        Icon for Moderator rankModerator

        Hello Isaac,

        At the moment, the CUG load balancing does not have this capability. I recommend submitting this as a feature enhancement to the Ideas Hub here in the Nexus Community, so that our Product Management team is aware there is a desire for increased performance considerations in regard to CUG load balancing.

        Antonio Andres

        Principal Technical Support Engineer | ScienceLogic

  • if there was some feature that facilitated this I would suggest that it was triggered via runbook actions that can be configured and not something directly related to the CUG setup. For instance in the use case given,  if you oversubscribe a CUG and all of the collectors were running over a specific CPU utilization you would not want all of your collectors removed from the group or put into a failed state. that would create complete data loss and you would be much worse off than just having some data gaps or delayed data. Lots of use cases would need to be explored to develop this kind of feature. 

     

    Jason. 

    • Issac's avatar
      Issac
      Icon for Expert rankExpert

      Yes I agree its needs some logic to be defined as well for this not use case. 

  • Issac​ if this is something you want in the product we should come up with some reasonable use cases and then you can write something up in depth in Ideas Hub. But this would take a long time to implement into the product and has the potential to to cause more issues than solve them in my opinion. That said, its going to be a tough sell to the product management team in charge of CUG management. 

     

    • Issac's avatar
      Issac
      Icon for Expert rankExpert

      Jasonkeck-GDIT​  Assume that you have CUG running with 6 collectors with 700 devices each , if one of your collector process is hung or busy the polling of 700 devices are impacted . Since the collector availability is up. During this case  you are getting data gaps for 700 devices as well as alerts are missed.

      If we have a smart way of actually checking the core process loads and performance the failover should happen . Example if mysql is consuming more than 90% cpu ... we should failover the devices to other collectors and it should come out of CUG after that it can recycle its own services.

      • Ahhh,, I see where you are going, not just removing the collector from the cug but follow a workflow like this.

        if zero collectors are going through "this process" run load balancing without collector that triggered this process > wait 2 minutes to finish current collections (prevent data loss) > reboot server wait 5 minutes, validate rows behind on server = zero > mark collector as working > trigger load balancer with this collector in the CUG to re-align devices to this this collector for collection , if anything fails leave out the collector, create a critical event for SL1 Administrators, if everything works as expected create a minor event for Sl1 administrators to look at health of the CUG. 

        Does a process like that make sense to you?

  • Sounds like a really cool use case that Skylar Analytics should be able to handle.