Skip to content

High load improvements

Carina Antunes requested to merge debug-sentry-1190 into master

Debug sentry error: https://push-notifications-sentry.web.cern.ch/sentry/consumers/issues/1190

Problems identified:

1. Non thread safe code.

Python-megabus creates a thread per connection. Only on the Production setup we have multiple connections/hosts per queue. Thus the problem was only detected on production.

We are sharing the etcd client as a global variable, affected by all threads. Leads to a race condition, in which all threads triggers the re-authenticate call when the token expires.

2. Race condition on the python-megabus reconnect loop.

Python-megabus has a reconnect loop, currently running with the default, 5 minutes. This loop exists in order to deal with changes to the dynamic address of the queues, ie mb123.cern.ch, mb456.cern.ch, etc. So, every 5 minutes the loop disconnects all connections, gets the dynamic hosts list and connects again.

Because we are using the subscribe mode client-individual, this loop leads to a race condition between ack/nacks and disconnects.


Changes:

  • Add ActiveMQ Docker logging config for development
  • Add ClientIndividualConsumer and ClientIndividualListener. These classes allow:
    • create a listener per connection, to avoid one listener instance shared among all threads
    • each listener to call ack/nack in their connection
    • cleanup connections list (was growing forever, every 5 minutes appending new connections)
  • Catch BrokenPipeError, until stomp.py has a proper fix to the race condition. Provided a MR.
  • Upgrade from centos stream 8 to centos stream 9
  • Bump python 3.6 (EOL was 23 Dec 2021) to 3.9 (EOL 05 Oct 2025)
    • Replace pycrypto (unmaintained, obsolete, and contains security vulnerabilities.) with pycryptodome
  • Log thread_id in all logs
  • Make Auditing a class instance, property of the processor instance
  • Share SQLAlchemy engine and connection pooler (set as a class property, protected by a lock) to avoid multiple calls to automap_base which would throw a SA warning
  • Retry etcd client on external disconnect/ failure to create client
  • Fix typo: Uppercase priority coming in email gateway

TODO:

Edited by Carina Antunes

Merge request reports

Loading