High load improvements
Debug sentry error: https://push-notifications-sentry.web.cern.ch/sentry/consumers/issues/1190
Problems identified:
1. Non thread safe code.
Python-megabus creates a thread per connection. Only on the Production setup we have multiple connections/hosts per queue. Thus the problem was only detected on production.
We are sharing the etcd client as a global variable, affected by all threads. Leads to a race condition, in which all threads triggers the re-authenticate call when the token expires.
2. Race condition on the python-megabus reconnect loop.
Python-megabus has a reconnect loop, currently running with the default, 5 minutes. This loop exists in order to deal with changes to the dynamic address of the queues, ie mb123.cern.ch, mb456.cern.ch, etc. So, every 5 minutes the loop disconnects all connections, gets the dynamic hosts list and connects again.
Because we are using the subscribe mode client-individual
, this loop leads to a race condition between ack/nacks
and disconnects.
Changes:
- Add ActiveMQ Docker logging config for development
- Add ClientIndividualConsumer and ClientIndividualListener. These classes allow:
- create a listener per connection, to avoid one listener instance shared among all threads
- each listener to call ack/nack in their connection
- cleanup connections list (was growing forever, every 5 minutes appending new connections)
- Catch BrokenPipeError, until stomp.py has a proper fix to the race condition. Provided a MR.
- Upgrade from centos stream 8 to centos stream 9
- Bump python 3.6 (EOL was 23 Dec 2021) to 3.9 (EOL 05 Oct 2025)
- Replace
pycrypto
(unmaintained, obsolete, and contains security vulnerabilities.) withpycryptodome
- Replace
- Log thread_id in all logs
- Make Auditing a class instance, property of the processor instance
- Share SQLAlchemy engine and connection pooler (set as a class property, protected by a lock) to avoid multiple calls to automap_base which would throw a SA warning
- Retry etcd client on external disconnect/ failure to create client
- Fix typo: Uppercase priority coming in email gateway
TODO:
- Create a MR to python-megabus proposing the new classes. Wait until fix is merged in stomp.py until we propose a PR to python-megabus with the new classes. https://github.com/jasonrbriggs/stomp.py/issues/393
- Only reconnect if host list changes - Improvement for later
- Look into adapting current sqlAlchemy implementation to use DeferredReflection https://docs.sqlalchemy.org/en/14/orm/extensions/declarative/#sqlalchemy.ext.declarative.DeferredReflection