[00:32:45] <wikibugs>	 (03CR) 10Dzahn: "I was about to say the name needs to be added to the cert, but then i saw we already have etherpad.discovery.wmnet in files/ssl even." [dns] - 10https://gerrit.wikimedia.org/r/631557 (owner: 10Dzahn)
[00:36:30] <wikibugs>	 (03CR) 10Dzahn: "~/puppet/files/ssl$ openssl x509 -in etherpad.discovery.wmnet.crt -noout -text | grep DNS" [dns] - 10https://gerrit.wikimedia.org/r/631557 (owner: 10Dzahn)
[00:37:17] <wikibugs>	 (03CR) 10Dzahn: "[etherpad1002:/etc/envoy/listeners.d] $ grep etherpad 00-tls_terminator_7443.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/631555 (owner: 10Dzahn)
[00:39:59] <wikibugs>	 (03PS1) 10Dzahn: add etherpad-next.discovery, point to etherpad1003 [dns] - 10https://gerrit.wikimedia.org/r/631559
[00:41:25] <wikibugs>	 (03PS1) 10Dzahn: site: add etherpad role to etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/631560
[00:45:31] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:50:14] <wikibugs>	 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10CDanis) Hey, sorry for the delay, we should be able to deploy this tomorrow.
[00:50:25] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:34:52] <icinga-wm>	 PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:42:08] <icinga-wm>	 PROBLEM - Check the last execution of search-drop-query-clicks on stat1007 is CRITICAL: CRITICAL: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[04:27:55] <wikibugs>	 (03CR) 10Ammarpad: "The commit summary does not say much and maybe is not quite accurate. But I am assuming this only applies to links for CI test results jus" [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox)
[04:34:56] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 59.37 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[04:36:34] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 72.44 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[04:56:02] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:57:42] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:08:26] <wikibugs>	 10Operations, 10ops-eqiad, 10netops, 10User-Kormat, 10User-jijiki: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Marostegui)
[05:27:50] <wikibugs>	 (03PS1) 10Marostegui: dns: Remove es2011 dns entries [dns] - 10https://gerrit.wikimedia.org/r/631566 (https://phabricator.wikimedia.org/T264261)
[05:28:45] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Remove es2011 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/631567 (https://phabricator.wikimedia.org/T264261)
[05:29:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es2011 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/631567 (https://phabricator.wikimedia.org/T264261) (owner: 10Marostegui)
[05:30:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es2011 from dbctl T264261', diff saved to https://phabricator.wikimedia.org/P12893 and previous config saved to /var/cache/conftool/dbconfig/20201002-053020-marostegui.json
[05:30:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:30:27] <stashbot>	 T264261: decommission es2011.codfw.wmnet - https://phabricator.wikimedia.org/T264261
[05:33:55] <wikibugs>	 (03PS1) 10Marostegui: es2026: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/631568
[05:34:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2026: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/631568 (owner: 10Marostegui)
[05:41:08] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove es2011 puppet references [puppet] - 10https://gerrit.wikimedia.org/r/631569 (https://phabricator.wikimedia.org/T264261)
[05:41:25] <wikibugs>	 (03PS2) 10Marostegui: mariadb: Remove es2011 puppet references [puppet] - 10https://gerrit.wikimedia.org/r/631569 (https://phabricator.wikimedia.org/T264261)
[05:43:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission
[05:43:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Remove es2011 puppet references [puppet] - 10https://gerrit.wikimedia.org/r/631569 (https://phabricator.wikimedia.org/T264261) (owner: 10Marostegui)
[05:48:30] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[05:48:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:49:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dns: Remove es2011 dns entries [dns] - 10https://gerrit.wikimedia.org/r/631566 (https://phabricator.wikimedia.org/T264261) (owner: 10Marostegui)
[05:50:34] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2011.codfw.wmnet - https://phabricator.wikimedia.org/T264261 (10Marostegui) Ready for #dc-ops
[05:57:35] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set debian buster for stat100[467] [puppet] - 10https://gerrit.wikimedia.org/r/631544 (https://phabricator.wikimedia.org/T255028) (owner: 10Elukey)
[05:57:41] <wikibugs>	 (03PS2) 10Elukey: Set debian buster for stat100[467] [puppet] - 10https://gerrit.wikimedia.org/r/631544 (https://phabricator.wikimedia.org/T255028)
[05:59:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: ASAP) rack/setup/install clouddb10[13-20] - https://phabricator.wikimedia.org/T260441 (10Marostegui) Great news, thank you @Cmjohnson! The racking plan looks good to me, we don't have much requirements other than trying to spr...
[06:20:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:46:12] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add page_props and user_properties to analytics sqoop [puppet] - 10https://gerrit.wikimedia.org/r/629070 (https://phabricator.wikimedia.org/T258047) (owner: 10Joal)
[06:47:50] <wikibugs>	 (03PS1) 10Elukey: Set an-worker110[0-2] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631683 (https://phabricator.wikimedia.org/T255138)
[06:51:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set an-worker110[0-2] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631683 (https://phabricator.wikimedia.org/T255138) (owner: 10Elukey)
[06:51:16] <_joe_>	 !log restarting php-fpm on all appservers in eqiad, in batches of 10%, for testing the procedure suggested at T264362
[06:51:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:51:22] <stashbot>	 T264362: Scap feature: restart php-fpm on deployment - https://phabricator.wikimedia.org/T264362
[06:52:40] <icinga-wm>	 PROBLEM - Host ganeti5002 is DOWN: PING CRITICAL - Packet loss = 100%
[06:52:59] <elukey>	 this is downtime expired, it has bitten me last week --^
[06:53:01] <_joe_>	 uh that wasn't me :P
[06:53:06] <_joe_>	 hah ok
[06:53:12] <elukey>	 :D
[06:53:19] <_joe_>	 why do we just downtime for a short time hosts that are completely down?
[06:53:40] <elukey>	 I added a week and reported in the task, didn't know how much time it was needed, checking again
[06:53:58] <elukey>	 https://phabricator.wikimedia.org/T261130
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201002T0700)
[07:03:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/631438 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi)
[07:07:18] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:07:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/631421 (https://phabricator.wikimedia.org/T264221) (owner: 10Arturo Borrero Gonzalez)
[07:08:43] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Reduce reconnectTimeout for etcd to 0.1 seconds [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/631686 (https://phabricator.wikimedia.org/T264362)
[07:12:26] <wikibugs>	 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Joe) >>! In T93049#6501395, @Legoktm wrote: > While MassMessage is how users see the problem (e.g. no one...
[07:14:26] <wikibugs>	 (03CR) 10Hashar: "I had the same stance as Kosta: target=_blank is an antipattern since browsers do not offer a way to NOT open in a new page whereas one ca" [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox)
[07:20:45] <wikibugs>	 (03Abandoned) 10Gehel: [wip] adding some type annotations [software/cumin] - 10https://gerrit.wikimedia.org/r/630202 (owner: 10Gehel)
[07:23:57] <moritzm>	 !log installing libx11 security updates on buster
[07:24:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:08] <wikibugs>	 (03PS1) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783)
[07:29:43] <godog>	 !log prometheus codfw/k8s, add 50G to the LV
[07:29:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[07:31:59] <wikibugs>	 (03PS2) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783)
[07:34:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[07:35:17] <godog>	 !log swift codfw-prod bump weight for ms-be2057 - T261633
[07:35:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:22] <stashbot>	 T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633
[07:37:39] <wikibugs>	 (03PS3) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783)
[07:40:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[07:40:26] <wikibugs>	 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff)
[07:40:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] profile: remove rollout flag for rsyslog queues [puppet] - 10https://gerrit.wikimedia.org/r/631438 (https://phabricator.wikimedia.org/T226703) (owner: 10Filippo Giunchedi)
[07:40:57] <wikibugs>	 (03PS4) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783)
[07:42:05] <moritzm>	 !log installing libcommons-compress-java security updates
[07:42:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[07:50:40] <wikibugs>	 10Operations: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 (10MoritzMuehlenhoff)
[07:52:40] <wikibugs>	 10Operations, 10vm-requests: Site: 4 VM request for LDAP replicas - https://phabricator.wikimedia.org/T264390 (10MoritzMuehlenhoff)
[07:52:49] <wikibugs>	 (03PS5) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783)
[07:52:59] <wikibugs>	 10Operations, 10vm-requests: Site: 4 VM request for LDAP replicas - https://phabricator.wikimedia.org/T264390 (10MoritzMuehlenhoff) p:05Triage→03Medium
[07:53:16] <wikibugs>	 10Operations: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 (10MoritzMuehlenhoff) p:05Triage→03Medium
[07:54:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[07:55:39] <wikibugs>	 10Operations, 10SRE-Access-Requests: Update public key for production shell - https://phabricator.wikimedia.org/T264392 (10DED)
[07:56:06] <wikibugs>	 (03PS6) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783)
[07:58:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[07:59:01] <wikibugs>	 10Operations, 10SRE-Access-Requests: Update public key for production shell - https://phabricator.wikimedia.org/T264392 (10ArielGlenn) Note to whoever handles this: after seeing accountcheck email, I let the user know via irc that there was an issue, and this task was filed as a result.
[08:01:51] <wikibugs>	 10Operations, 10SRE-Access-Requests: Update public key for production shell - https://phabricator.wikimedia.org/T264392 (10MoritzMuehlenhoff) a:03herron This got added in T263692, also assigning to Keith.
[08:07:08] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Wrong charset on mailman HTML (Japanese) - https://phabricator.wikimedia.org/T264384 (10Aklapper)
[08:07:19] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10Aklapper)
[08:10:49] <wikibugs>	 (03CR) 10Ema: [C: 03+1] Reduce reconnectTimeout for etcd to 0.1 seconds [debs/pybal] (1.15) - 10https://gerrit.wikimedia.org/r/631686 (https://phabricator.wikimedia.org/T264362) (owner: 10Giuseppe Lavagetto)
[08:16:17] <logmsgbot>	 !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@5713fb0]: Fix lexeme dumps expected date
[08:16:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:52] <logmsgbot>	 !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@5713fb0]: Fix lexeme dumps expected date (duration: 01m 35s)
[08:17:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:59] <wikibugs>	 (03PS7) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783)
[08:18:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:19:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:20:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[08:21:36] <wikibugs>	 (03PS8) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783)
[08:24:23] <moritzm>	 !log installing nginx security updates on puppetdb*
[08:24:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Just out of curiosity, why not just try to import 5.2 instead? It's not that big a deal (couple of changes and cross fleet PCCs) and it wi" [puppet] - 10https://gerrit.wikimedia.org/r/631522 (owner: 10Dzahn)
[08:26:24] <wikibugs>	 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff)
[08:29:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Heh, thanks! I 've never had the time to do this more consistently, this is nice!" [puppet] - 10https://gerrit.wikimedia.org/r/631555 (owner: 10Dzahn)
[08:29:44] <moritzm>	 !log installing pyzmq bugfix update from buster point release
[08:29:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:56] <wikibugs>	 (03PS9) 10Gehel: Introduce an interface for progress bars. [software/cumin] - 10https://gerrit.wikimedia.org/r/631702 (https://phabricator.wikimedia.org/T212783)
[08:29:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] site: add etherpad role to etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/631560 (owner: 10Dzahn)
[08:30:09] <logmsgbot>	 !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@5713fb0]: Fix lexeme dumps expected date
[08:30:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:42] <logmsgbot>	 !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@5713fb0]: Fix lexeme dumps expected date (duration: 00m 33s)
[08:30:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:31] <wikibugs>	 10Operations, 10vm-requests: Site: 4 VM request for LDAP replicas - https://phabricator.wikimedia.org/T264390 (10akosiaris) /me rubberstamping. Thanks for this!
[08:36:02] <wikibugs>	 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm)
[08:36:13] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 (10JMeybohm)
[08:41:26] <logmsgbot>	 !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@5713fb0]: Test stat1007 deploy
[08:41:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:01] <logmsgbot>	 !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@5713fb0]: Test stat1007 deploy (duration: 00m 34s)
[08:42:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:06] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) a:05RobH→03Cmjohnson @Cmjohnson an-worker1111 seems to be in the wrong rack: cloudsw1-c8-eqiad.mgmt.eqiad.wmnet https://librenms.wikimedia....
[08:43:09] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles)
[08:46:26] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Migrate Gerrit to profile::java - https://phabricator.wikimedia.org/T264182 (10hashar) A few notes:  Gerrit runs with Java 8 for now. The reason, I believe, is that wh...
[08:47:49] <icinga-wm>	 RECOVERY - cassandra-b service on restbase1029 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:48:39] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.64.16.181:7001 on restbase1029 is OK: SSL OK - Certificate restbase1029-b valid until 2022-09-29 10:16:48 +0000 (expires in 727 days) https://phabricator.wikimedia.org/T120662
[08:49:46] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team (Development services), 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Migrate Gerrit to profile::java - https://phabricator.wikimedia.org/T264182 (10MoritzMuehlenhoff) >>! In T264182#6511478, @hashar wrote: > We need the `dbg` package in...
[08:52:55] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) The most logical suspect is the rollout of Varnish 6 {T263557}
[08:53:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: thirdparty/kubeadm-k8s-1-17: introduce helm3 package [puppet] - 10https://gerrit.wikimedia.org/r/631421 (https://phabricator.wikimedia.org/T264221) (owner: 10Arturo Borrero Gonzalez)
[08:54:04] <logmsgbot>	 !log dcausse@deploy1001 Started deploy [wikimedia/discovery/analytics@5713fb0]: Test stat1007 deploy
[08:54:07] <logmsgbot>	 !log dcausse@deploy1001 Finished deploy [wikimedia/discovery/analytics@5713fb0]: Test stat1007 deploy (duration: 00m 03s)
[08:54:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] add etherpad.discovery.wmnet, point to etherpad1002 [dns] - 10https://gerrit.wikimedia.org/r/631557 (owner: 10Dzahn)
[08:58:15] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) [[ https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent?orgId=1 | It seems to affect North America before Europe ]], and the timing lines up with the ro...
[08:58:28] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) p:05Triage→03High
[08:59:54] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime
[08:59:55] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:59:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:59] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime
[09:00:00] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:00:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:05] <logmsgbot>	 !log root@cumin1001 START - Cookbook sre.hosts.downtime
[09:00:05] <logmsgbot>	 !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:00:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:20] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: aptrepo: add gpg key for baltocdn external repository [puppet] - 10https://gerrit.wikimedia.org/r/631711 (https://phabricator.wikimedia.org/T264221)
[09:01:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: add gpg key for baltocdn external repository [puppet] - 10https://gerrit.wikimedia.org/r/631711 (https://phabricator.wikimedia.org/T264221) (owner: 10Arturo Borrero Gonzalez)
[09:01:56] <wikibugs>	 (03PS3) 10JMeybohm: lvs: Remove mathoid non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/629329 (https://phabricator.wikimedia.org/T255875)
[09:05:03] <hashar>	 !log gerrit: running garbage collector
[09:05:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:04] <hnowlan>	 !log bootstrapping restbase1029-b cassandra 
[09:07:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] lvs: Remove mathoid non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/629329 (https://phabricator.wikimedia.org/T255875) (owner: 10JMeybohm)
[09:08:21] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm
[09:08:21] <logmsgbot>	 !log jmm@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[09:08:21] <wikibugs>	 (03PS3) 10JMeybohm: lvs: Remove zotero non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/629339 (https://phabricator.wikimedia.org/T255869)
[09:08:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:09:02] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm
[09:09:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:42] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] lvs: Remove zotero non-TLS endpoint 2/2 [puppet] - 10https://gerrit.wikimedia.org/r/629339 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm)
[09:10:44] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) Checkin in to report that calls from OKAPI have stopped tonight. Thanks @RBrounley_WMF (and the team)!  So if we still see the starvatio...
[09:11:13] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: aptrepo: updates: fix baltocdn key id [puppet] - 10https://gerrit.wikimedia.org/r/631712 (https://phabricator.wikimedia.org/T264221)
[09:11:49] <arturo>	 !log added helm3 package to buster-wikimedia/thirdparty/kubeadm-k8s-1-17 (T264221)
[09:11:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:54] <stashbot>	 T264221: Upgrade the nginx ingress controller in Toolforge (and likely PAWS) - https://phabricator.wikimedia.org/T264221
[09:12:03] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/631522 (owner: 10Dzahn)
[09:12:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: updates: fix baltocdn key id [puppet] - 10https://gerrit.wikimedia.org/r/631712 (https://phabricator.wikimedia.org/T264221) (owner: 10Arturo Borrero Gonzalez)
[09:12:25] <jayme>	 !log running puppet on lvs servers - T255875 T255869
[09:12:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:12:31] <stashbot>	 T255875: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875
[09:12:32] <stashbot>	 T255869: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869
[09:14:41] <jayme>	 !log restarting pybal on lvs1016.eqiad.wmnet,lvs2010.codfw.wmnet - T255875 T255869
[09:14:44] <wikibugs>	 10Operations, 10SRE-Access-Requests: Update public key for production shell for dedcode - https://phabricator.wikimedia.org/T264392 (10Peachey88)
[09:14:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:15:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on cumin1001.eqiad.wmnet for hosts: ` ['an-worker1117.eqiad.wmnet'] ` The l...
[09:17:23] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.16:1969, 10.2.1.20:10042]) https://wikitech.wikimedia.org/wiki/PyBal
[09:17:32] <jayme>	 pybal is me
[09:17:49] <jayme>	 !log restarting pybal on lvs1015.eqiad.wmnet,lvs2009.codfw.wmnet - T255875 T255869
[09:17:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:55] <stashbot>	 T255875: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875
[09:17:56] <stashbot>	 T255869: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869
[09:18:14] <jayme>	 !log running ipvsadm -D -t 10.2.2.20:10042; ipvsadm -D -t 10.2.2.16:1969 on lvs1016.eqiad.wmnet,lvs1015.eqiad.wmnet - T255875 T255869
[09:18:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:01] <jayme>	 !log running ipvsadm -D -t 10.2.1.20:10042; ipvsadm -D -t 10.2.1.16:1969 on lvs2010.codfw.wmnet,lvs2009.codfw.wmnet - T255875 T255869
[09:19:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:19] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[09:20:28] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10I18n: Several unreadable mailing list descriptions (Mojibake) due to wrong charset encodings, should be Unicode - https://phabricator.wikimedia.org/T261031 (10whym) In the case of [Japanese](https://lists.wikimedia.org/mailman/listinfo/wikija-l), the encoding set by...
[09:21:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-2] "Actually, this won't work. etherpad can't be run multi instance currently. The reason for that is that the ueberdb, the component that abs" [puppet] - 10https://gerrit.wikimedia.org/r/631560 (owner: 10Dzahn)
[09:22:13] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 (10JMeybohm)
[09:22:18] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 (10JMeybohm)
[09:22:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-2] "Same reasoning as https://gerrit.wikimedia.org/r/c/operations/puppet/+/631560. We can't have 2 etherpad instances running in parallel, it " [dns] - 10https://gerrit.wikimedia.org/r/631559 (owner: 10Dzahn)
[09:22:38] <wikibugs>	 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm)
[09:27:00] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[09:27:01] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:27:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:15] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2106 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12896 and previous config saved to /var/cache/conftool/dbconfig/20201002-092715-kormat.json
[09:27:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:21] <stashbot>	 T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831
[09:28:14] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime
[09:28:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:35] <wikibugs>	 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm)
[09:30:10] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:30:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Add DNS entry ldap-replica1001.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631713 (https://phabricator.wikimedia.org/T264390)
[09:35:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entry ldap-replica1001.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631713 (https://phabricator.wikimedia.org/T264390) (owner: 10Muehlenhoff)
[09:47:25] <icinga-wm>	 PROBLEM - SSH on ms-be2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:48:07] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[09:48:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] am: ensure ends_at is sent with a timezone [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631495 (owner: 10Filippo Giunchedi)
[09:48:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: ensure ends_at is sent with a timezone [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631495 (owner: 10Filippo Giunchedi)
[09:48:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: catch and report ApiException [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/631499 (owner: 10Filippo Giunchedi)
[09:48:47] <icinga-wm>	 RECOVERY - SSH on ms-be2020 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:48:49] <icinga-wm>	 PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:51:29] <wikibugs>	 (03PS1) 10Kormat: dbutil: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631717
[09:51:29] <icinga-wm>	 RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:52:23] <wikibugs>	 (03PS2) 10Kormat: dbutil: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631717
[09:56:08] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers
[09:56:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:19] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey)
[09:58:00] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0)
[09:58:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:23] <wikibugs>	 10Operations: makevm cookbook fails get_vm() call - https://phabricator.wikimedia.org/T264409 (10MoritzMuehlenhoff)
[09:59:44] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) an-worker1117 is fixed, it was preferring to PXE boot as opposed to boot from disk, so the loop was endless.
[10:06:47] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.ganeti.makevm
[10:06:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:14] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 33%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12897 and previous config saved to /var/cache/conftool/dbconfig/20201002-101313-kormat.json
[10:13:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:19] <stashbot>	 T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831
[10:13:20] <wikibugs>	 (03PS1) 10JMeybohm: deployment_server::helmfile: Allow default secrets per environment [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917)
[10:14:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] deployment_server::helmfile: Allow default secrets per environment [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917) (owner: 10JMeybohm)
[10:14:37] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] dbutil: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631717 (owner: 10Kormat)
[10:15:41] <wikibugs>	 (03Merged) 10jenkins-bot: dbutil: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631717 (owner: 10Kormat)
[10:16:22] <wikibugs>	 (03PS2) 10JMeybohm: deployment_server::helmfile: Allow default secrets per environment [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917)
[10:16:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Add DNS entry ldap-replica1001.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631721 (https://phabricator.wikimedia.org/T264390)
[10:16:53] <wikibugs>	 (03PS2) 10Muehlenhoff: Add DNS entry ldap-replica1002.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631721 (https://phabricator.wikimedia.org/T264390)
[10:20:43] <wikibugs>	 (03PS1) 10Kormat: WMFReplication: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631722
[10:21:57] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] WMFReplication: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631722 (owner: 10Kormat)
[10:22:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entry ldap-replica1002.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631721 (https://phabricator.wikimedia.org/T264390) (owner: 10Muehlenhoff)
[10:22:58] <wikibugs>	 (03Merged) 10jenkins-bot: WMFReplication: Look at the current user's ~/.my.cnf [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/631722 (owner: 10Kormat)
[10:23:52] <wikibugs>	 (03PS1) 10JMeybohm: Add dummy default secrets [labs/private] - 10https://gerrit.wikimedia.org/r/631724 (https://phabricator.wikimedia.org/T260917)
[10:23:57] <icinga-wm>	 PROBLEM - SSH on ms-be2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:24:19] <wikibugs>	 (03PS1) 10Klausman: aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408)
[10:24:30] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add dummy default secrets [labs/private] - 10https://gerrit.wikimedia.org/r/631724 (https://phabricator.wikimedia.org/T260917) (owner: 10JMeybohm)
[10:26:05] <wikibugs>	 (03CR) 10Kosta Harlan: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox)
[10:26:53] <icinga-wm>	 RECOVERY - SSH on ms-be2017 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:28:17] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 67%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12898 and previous config saved to /var/cache/conftool/dbconfig/20201002-102817-kormat.json
[10:28:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:23] <stashbot>	 T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831
[10:36:06] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[10:36:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:47] <wikibugs>	 (03PS3) 10JMeybohm: deployment_server::helmfile: Allow default secrets per environment [puppet] - 10https://gerrit.wikimedia.org/r/631720 (https://phabricator.wikimedia.org/T260917)
[10:38:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:40:25] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm
[10:40:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:59] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] mtail: upgrade mtail across the fleet to 3.0.0~rc35-3+wmf3 [puppet] - 10https://gerrit.wikimedia.org/r/631501 (https://phabricator.wikimedia.org/T263728) (owner: 10Cwhite)
[10:43:21] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2106 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12899 and previous config saved to /var/cache/conftool/dbconfig/20201002-104320-kormat.json
[10:43:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:27] <stashbot>	 T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831
[10:44:43] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[10:44:44] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[10:44:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:54] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2110 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12900 and previous config saved to /var/cache/conftool/dbconfig/20201002-104453-kormat.json
[10:44:57] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) I've improved the per-DC/host dashboard: https://grafana.wikimedia.org/d/M7xQ_BeWk/response-time-by-host  The change is clearly visible on Esams on 2020-09-29 and on Eqsin...
[10:44:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:19] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) a:03ema
[10:46:16] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10Gilles) @ema as discussed on IRC, it seems sensible to roll back the change on at least one host on Esams for a few days next week to verify that this is what's causing the issue.
[10:46:25] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:47:24] <logmsgbot>	 !log jmm@cumin2001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97)
[10:47:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:33] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm
[10:47:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:57:15] <logmsgbot>	 !log jmm@cumin2001 END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97)
[10:57:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:56] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm
[10:59:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:05:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Add DNS entry ldap-replica2004.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631750 (https://phabricator.wikimedia.org/T264390)
[11:10:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add DNS entry ldap-replica2004.wikimedia.org based on what the makevm cookbook assigned [dns] - 10https://gerrit.wikimedia.org/r/631750 (https://phabricator.wikimedia.org/T264390) (owner: 10Muehlenhoff)
[11:12:13] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6512038, @Gilles wrote: > @ema as discussed on IRC, it seems sensible to roll back the change on at least one host on Esams for a few days next week to verify t...
[11:22:25] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[11:22:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:54] <wikibugs>	 10Operations: makevm cookbook fails get_vm() call - https://phabricator.wikimedia.org/T264409 (10MoritzMuehlenhoff) So, I've created three VMs and this happened in one out of three cases only.
[11:33:33] <wikibugs>	 (03PS1) 10Muehlenhoff: Add DHCP entries for ldap-replica100[12], ldap-replica2004 [puppet] - 10https://gerrit.wikimedia.org/r/631755 (https://phabricator.wikimedia.org/T264390)
[11:35:07] <wikibugs>	 (03PS2) 10Muehlenhoff: Add DHCP entries for ldap-replica100[12], ldap-replica2004 [puppet] - 10https://gerrit.wikimedia.org/r/631755 (https://phabricator.wikimedia.org/T264390)
[11:42:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add DHCP entries for ldap-replica100[12], ldap-replica2004 [puppet] - 10https://gerrit.wikimedia.org/r/631755 (https://phabricator.wikimedia.org/T264390) (owner: 10Muehlenhoff)
[11:53:22] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 75%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12901 and previous config saved to /var/cache/conftool/dbconfig/20201002-115322-kormat.json
[11:53:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:45] <stashbot>	 T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831
[12:02:23] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.64.16.181:9042 on restbase1029 is OK: TCP OK - 0.000 second response time on 10.64.16.181 port 9042 https://phabricator.wikimedia.org/T93886
[12:03:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: move deployment-prep swift settings off Horizon [puppet] - 10https://gerrit.wikimedia.org/r/631758
[12:04:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Once this is merged I'll remove the settings from Horizon hiera project/prefix/instance (!) puppet" [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi)
[12:05:19] <hnowlan>	 !log bootstrapping restbase1029-c 
[12:05:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:51] <icinga-wm>	 RECOVERY - cassandra-c service on restbase1029 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:06:35] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.64.16.182:7001 on restbase1029 is OK: SSL OK - Certificate restbase1029-c valid until 2022-09-29 10:16:51 +0000 (expires in 726 days) https://phabricator.wikimedia.org/T120662
[12:08:26] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2110 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12902 and previous config saved to /var/cache/conftool/dbconfig/20201002-120825-kormat.json
[12:08:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:38] <stashbot>	 T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831
[12:14:06] <wikibugs>	 10Operations, 10SRE-swift-storage, 10User-fgiunchedi: ms-be2017 slower than the rest of the cluster while rebalancing - https://phabricator.wikimedia.org/T264270 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Tentatively resolving since a reboot seem to indeed bring speed back up, will reopen if/when...
[12:14:08] <wikibugs>	 10Operations, 10SRE-swift-storage, 10Patch-For-Review, 10User-fgiunchedi: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 (10fgiunchedi)
[12:18:19] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[12:18:19] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:18:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:31] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2140 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12903 and previous config saved to /var/cache/conftool/dbconfig/20201002-121830-kormat.json
[12:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 35487824 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:23:18] <stashbot>	 T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831
[12:25:57] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 101856 and 80 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:26:25] <effie>	 !log disable puppet on mwdebug1001
[12:26:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:21] <icinga-wm>	 PROBLEM - Memcached on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[12:35:45] <icinga-wm>	 RECOVERY - Memcached on mwdebug1001 is OK: TCP OK - 0.010 second response time on 10.64.32.123 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[12:42:35] <icinga-wm>	 PROBLEM - Memcached on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[12:44:50] <effie>	 that is me ^
[12:45:05] <effie>	 sorry 
[12:45:34] <effie>	 another alert is coming, sorry for the noise
[12:50:09] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:52:24] <wikibugs>	 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Pchelolo) In this case the queue is executing the job multiple times because these are different jobs. Th...
[12:52:49] <icinga-wm>	 RECOVERY - Memcached on mwdebug1001 is OK: TCP OK - 0.001 second response time on 10.64.32.123 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[12:54:38] <wikibugs>	 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Samwalton9) It seems like the Signpost article was a user error, per the discussion at https://en.wikiped...
[12:57:28] <wikibugs>	 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Pchelolo) > The Books & Bytes example I mentioned further up, however, was definitely only sent once.  Un...
[12:58:19] <wikibugs>	 (03PS1) 10Elukey: Set an-worker110[6-9] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631764 (https://phabricator.wikimedia.org/T255140)
[12:58:30] <wikibugs>	 (03CR) 10Muehlenhoff: "Something to fix before this gets enabled in general: This currently breaks the Icinga check for free disk space:" [puppet] - 10https://gerrit.wikimedia.org/r/630889 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond)
[13:00:12] <moritzm>	 !log installing Linux 4.19.146 on Buster updates (from latest Buster point release, at this point only installing the updates, no reboots (yet))
[13:00:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:40] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10BBlack) Just throwing in some random points/counterpoints to ponder:  * It's possible it does take more than a day or three for the frontend caches to settle into an optimal patter...
[13:04:45] <icinga-wm>	 PROBLEM - Memcached on mwdebug1001 is CRITICAL: connect to address 10.64.32.123 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[13:06:03] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set an-worker110[6-9] as Hadoop workers [puppet] - 10https://gerrit.wikimedia.org/r/631764 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey)
[13:06:25] <icinga-wm>	 RECOVERY - Memcached on mwdebug1001 is OK: TCP OK - 0.002 second response time on 10.64.32.123 port 11210 https://wikitech.wikimedia.org/wiki/Memcached
[13:09:07] <icinga-wm>	 PROBLEM - etherpad_up reduced availability on alert1001 is CRITICAL: 0 le 0.8 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:09:35] <icinga-wm>	 PROBLEM - etherpad.wikimedia.org HTTP on etherpad1002 is CRITICAL: connect to address 10.64.32.178 and port 9001: Connection refused https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
[13:09:55] <wikibugs>	 (03PS1) 10Marostegui: labsdb: Change weights on labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/631768
[13:10:07] <icinga-wm>	 PROBLEM - etherpad_lite_process_running on etherpad1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
[13:12:18] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic: Elevated latency starting 2020-09-28 - https://phabricator.wikimedia.org/T264398 (10BBlack) Eh maybe a few more to think about too:  * The train this week caused a fair amount of churn with  the rollout + rollback of 1.36.0-wmf.11.  Is there any chance the train i...
[13:13:07] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) **What happens when onhost memcached in unavailable? ** https://phabr...
[13:13:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] labsdb: Change weights on labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/631768 (owner: 10Marostegui)
[13:14:45] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) **What happens when onhost memcached in unavailable? ** https://phabricator.wikimedia.org/T244340#6211682  @elukey  @aaron   With the con...
[13:15:35] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:20:42] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 75%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12904 and previous config saved to /var/cache/conftool/dbconfig/20201002-132042-kormat.json
[13:20:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:49] <stashbot>	 T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831
[13:25:54] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:27:38] <icinga-wm>	 PROBLEM - DPKG on etherpad1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[13:32:58] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:34:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:35:46] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'db2140 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12905 and previous config saved to /var/cache/conftool/dbconfig/20201002-133545-kormat.json
[13:35:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:52] <stashbot>	 T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831
[13:36:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Cumin aliases for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/631773
[13:37:53] <Urbanecm>	 !log Create bot_passwords table at fishbowl wikis (T258356)
[13:37:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:59] <stashbot>	 T258356: Allow users at all private/fishbowl wikis to use botpasswords - https://phabricator.wikimedia.org/T258356
[13:38:44] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3062 is OK: HTTP OK: HTTP/1.0 200 OK - 23579 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:39:44] <wikibugs>	 (03PS1) 10Ppchelko: Force local short descriptions for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631775 (https://phabricator.wikimedia.org/T263493)
[13:46:55] <wikibugs>	 (03CR) 10Elukey: "pcc looks good https://puppet-compiler.wmflabs.org/compiler1003/25623/" [puppet] - 10https://gerrit.wikimedia.org/r/629409 (https://phabricator.wikimedia.org/T251777) (owner: 10Fdans)
[13:51:36] <icinga-wm>	 PROBLEM - SSH on ms-be2020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:52:30] <icinga-wm>	 RECOVERY - etherpad_lite_process_running on etherpad1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/node /usr/share/etherpad-lite/node_modules/ep_etherpad-lite/node/server.js https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
[13:52:40] <icinga-wm>	 RECOVERY - etherpad.wikimedia.org HTTP on etherpad1002 is OK: HTTP OK: HTTP/1.1 200 OK - 8996 bytes in 0.696 second response time https://wikitech.wikimedia.org/wiki/Etherpad.wikimedia.org
[13:55:06] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:56:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:56:24] <icinga-wm>	 PROBLEM - Disk space on ping3001 is CRITICAL: DISK CRITICAL - free space: / 72 MB (2% inode=69%): /tmp 72 MB (2% inode=69%): /var/tmp 72 MB (2% inode=69%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping3001&var-datasource=esams+prometheus/ops
[13:56:35] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) >>! In T244340#6197415, @Krinkle wrote: > If the local-memcached's bl...
[13:57:44] <icinga-wm>	 RECOVERY - DPKG on etherpad1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[13:57:50] <icinga-wm>	 RECOVERY - SSH on ms-be2020 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:00:43] <wikibugs>	 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff)
[14:03:14] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:04:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:04:42] <icinga-wm>	 PROBLEM - Disk space on ping2001 is CRITICAL: DISK CRITICAL - free space: / 64 MB (2% inode=68%): /tmp 64 MB (2% inode=68%): /var/tmp 64 MB (2% inode=68%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping2001&var-datasource=codfw+prometheus/ops
[14:07:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add Cumin aliases for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/631773 (owner: 10Muehlenhoff)
[14:08:13] <moritzm>	 !log purging some unused kernels on ping* (these only have 3GB "disks")
[14:08:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:25] <effie>	 !log enable puppet on mwdebug1001
[14:08:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:22] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631782
[14:15:57] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/631783
[14:16:56] <icinga-wm>	 RECOVERY - Disk space on ping3001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping3001&var-datasource=esams+prometheus/ops
[14:17:14] <wikibugs>	 (03PS2) 10JMeybohm: envoy: New upstream version 1.15.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/631383 (https://phabricator.wikimedia.org/T264157)
[14:18:52] <icinga-wm>	 RECOVERY - Disk space on ping2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ping2001&var-datasource=codfw+prometheus/ops
[14:19:48] <moritzm>	 !log installing LLVM 7 bugfix updates from Buster point release
[14:19:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:06] <wikibugs>	 (03PS1) 10CDanis: /home/cdanis dotfiles updates [puppet] - 10https://gerrit.wikimedia.org/r/631785
[14:21:06] <icinga-wm>	 RECOVERY - etherpad_up reduced availability on alert1001 is OK: (C)0.8 le (W)0.9 le 1 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_exporters_%22up%22_metrics_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:22:01] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] /home/cdanis dotfiles updates [puppet] - 10https://gerrit.wikimedia.org/r/631785 (owner: 10CDanis)
[14:23:56] <icinga-wm>	 RECOVERY - Check the last execution of search-drop-query-clicks on stat1007 is OK: OK: Status of the systemd unit search-drop-query-clicks https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[14:24:26] <icinga-wm>	 RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:35:58] <wikibugs>	 (03PS2) 10Klausman: aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408)
[14:36:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] aptrepo: Add rocm 3.8 packages to reprepro [puppet] - 10https://gerrit.wikimedia.org/r/631725 (https://phabricator.wikimedia.org/T264408) (owner: 10Klausman)
[14:37:53] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] foundation.wikimedia.org: Add .well-known/matrix/server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631530 (https://phabricator.wikimedia.org/T261531) (owner: 10Krinkle)
[14:37:59] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] docroot: expand foundation.wikimedia.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631529 (https://phabricator.wikimedia.org/T261531) (owner: 10Krinkle)
[14:38:46] <wikibugs>	 (03Merged) 10jenkins-bot: docroot: expand foundation.wikimedia.org docroot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631529 (https://phabricator.wikimedia.org/T261531) (owner: 10Krinkle)
[14:38:50] <wikibugs>	 (03Merged) 10jenkins-bot: foundation.wikimedia.org: Add .well-known/matrix/server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631530 (https://phabricator.wikimedia.org/T261531) (owner: 10Krinkle)
[14:43:12] <cdanis>	 ^ changes look good on mwdebug2002, didn't break foundation.wm.o, and the .well-known file is served
[14:45:18] <logmsgbot>	 !log cdanis@deploy1001 Synchronized docroot/wikimediafoundation.org: Separate foundation.wikimedia.org docroot & add .well-known/matrix/server T261531 4573776bd 2fb4c20ae (duration: 01m 01s)
[14:45:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:23] <stashbot>	 T261531: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531
[14:46:11] <wikibugs>	 (03PS1) 10Herron: admin: update ssh key for user dedcode [puppet] - 10https://gerrit.wikimedia.org/r/631788 (https://phabricator.wikimedia.org/T264392)
[14:46:34] <wikibugs>	 (03PS1) 10Hnowlan: WIP: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966)
[14:47:25] <wikibugs>	 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10CDanis) This is live now @bcampbell -- have Element give it a shot and let us know?
[14:47:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan)
[14:49:15] <elukey>	 nice! --^
[14:51:15] <wikibugs>	 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10CDanis)
[14:51:30] <wikibugs>	 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10CDanis)
[14:53:43] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) p:05Triage→03High
[14:54:03] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) >>! In T264364#6510505, @Johan wrote: > Whoever ends up handling this on the community side:: I'll add this to next week's Tec...
[14:55:24] <icinga-wm>	 RECOVERY - cassandra-c CQL 10.64.16.182:9042 on restbase1029 is OK: TCP OK - 0.000 second response time on 10.64.16.182 port 9042 https://phabricator.wikimedia.org/T93886
[14:57:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_restbase_esams,swagger_check_wikifeeds_eqiad} site={eqiad,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:58:35] <wikibugs>	 (03PS1) 10Elukey: Import the action module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905)
[15:00:50] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:01:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Import the action module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[15:04:14] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:05:15] <wikibugs>	 (03PS1) 10Clarakosi: changeprop: Add x-request-id header to jobqueue requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/631794
[15:05:54] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:06:01] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF)
[15:08:04] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF)
[15:12:12] <wikibugs>	 (03CR) 10Ppchelko: [C: 04-1] "You need to bump the chart version in Chart.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/631794 (owner: 10Clarakosi)
[15:13:02] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) I just had to do another service restart.
[15:14:07] <wikibugs>	 (03PS2) 10Clarakosi: changeprop: Add x-request-id header to jobqueue requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/631794
[15:14:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:14:35] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] "Let's deploy on Mon" [deployment-charts] - 10https://gerrit.wikimedia.org/r/631794 (owner: 10Clarakosi)
[15:16:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:19:39] <wikibugs>	 (03PS2) 10Hnowlan: WIP: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966)
[15:20:47] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: cassandra: use profile::java [puppet] - 10https://gerrit.wikimedia.org/r/631789 (https://phabricator.wikimedia.org/T261966) (owner: 10Hnowlan)
[15:21:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] bastionhost::pop: remove tftp from bastions [puppet] - 10https://gerrit.wikimedia.org/r/629496 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn)
[15:23:39] <wikibugs>	 (03CR) 10Dzahn: "This removed the ferm firewall holes for TFTP (but service was already stopped anyways)" [puppet] - 10https://gerrit.wikimedia.org/r/629496 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn)
[15:25:38] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=ores,name=eqiad
[15:25:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:40] <wikibugs>	 (03CR) 10Elukey: "@Volans: I am a bit confused about why prospector fails here and not for spicerack, so let me know if you have any ideas before I get too " [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[15:28:08] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.64.48.234:7001 on restbase1030 is OK: SSL OK - Certificate restbase1030-a valid until 2022-09-29 10:16:53 +0000 (expires in 726 days) https://phabricator.wikimedia.org/T120662
[15:28:11] <hnowlan>	 !log bootstrapping restbase1030-a
[15:28:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:46] <icinga-wm>	 RECOVERY - cassandra-a service on restbase1030 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:32:22] <wikibugs>	 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10bcampbell) Thanks all. It's working. https://federationtester.matrix.org/#foundation.wikimedia.org
[15:33:13] <wikibugs>	 10Operations, 10DNS, 10Matrix, 10Traffic, 10Patch-For-Review: Configure subdomain foundation.wikimedia.org to enable *:foundation.wikimedia.org Matrix user IDs - https://phabricator.wikimedia.org/T261531 (10CDanis) 05Open→03Resolved 🎉
[15:39:17] <_joe_>	 !log restarting redis on rdb2003, instance 6380
[15:39:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:25] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 1: Code-Review-2" [puppet] - 10https://gerrit.wikimedia.org/r/631560 (owner: 10Dzahn)
[15:42:50] <wikibugs>	 (03Abandoned) 10Dzahn: add etherpad-next.discovery, point to etherpad1003 [dns] - 10https://gerrit.wikimedia.org/r/631559 (owner: 10Dzahn)
[15:45:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "I'm fine with either and don't have an opinion on this. Let me know once there is consensus one way or the other for merging." [puppet] - 10https://gerrit.wikimedia.org/r/631237 (owner: 10Paladox)
[15:51:13] <wikibugs>	 (03PS1) 10JMeybohm: profile::envoy::builder: fix repos in chroot [puppet] - 10https://gerrit.wikimedia.org/r/631802
[15:54:25] <wikibugs>	 (03PS2) 10Elukey: Import the action module from spicerack [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905)
[15:54:36] <wikibugs>	 (03CR) 10Cwhite: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/629653 (https://phabricator.wikimedia.org/T263727) (owner: 10Giuseppe Lavagetto)
[15:55:17] <wikibugs>	 (03CR) 10Elukey: "Interesting, I swapped "setup" with "setup_method" and it worked." [software/pywmflib] - 10https://gerrit.wikimedia.org/r/631792 (https://phabricator.wikimedia.org/T257905) (owner: 10Elukey)
[15:57:33] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10Okapi, and 3 others: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10Joe) Small status update:  in order to grant everyone a quieter weekend (hopefully!), we've repooled eqiad and raised manually the max client...
[16:05:59] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] profile::envoy::builder: fix repos in chroot [puppet] - 10https://gerrit.wikimedia.org/r/631802 (owner: 10JMeybohm)
[16:10:34] <wikibugs>	 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) Thanks everyone for your help here.
[16:11:27] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Elitre) Thanks for the shoutout in All Staff BTW!
[16:11:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:13:10] <wikibugs>	 (03PS2) 10JMeybohm: profile::envoy::builder: fix repos in chroot [puppet] - 10https://gerrit.wikimedia.org/r/631802
[16:13:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:15:13] <wikibugs>	 (03CR) 10Dzahn: "very good. imho nothing should be in Horizon and everything in the repo. for this reason." [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi)
[16:23:32] <wikibugs>	 (03CR) 10Dzahn: "I am seeing more things like "swift::params::account_keys" in project puppet but they are not in this change yet?" [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi)
[16:25:30] <wikibugs>	 (03CR) 10Dzahn: "But I also don't see a prefix puppet for swift hosts.. were these on individual instances?" [puppet] - 10https://gerrit.wikimedia.org/r/631758 (owner: 10Filippo Giunchedi)
[16:27:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add etherpad.discovery.wmnet, point to etherpad1002 [dns] - 10https://gerrit.wikimedia.org/r/631557 (owner: 10Dzahn)
[16:28:02] <wikibugs>	 (03PS2) 10Dzahn: add etherpad.discovery.wmnet, point to etherpad1002 [dns] - 10https://gerrit.wikimedia.org/r/631557
[16:32:11] <wikibugs>	 (03PS2) 10Dzahn: ATS/Etherpad: replace backend host name with discovery record [puppet] - 10https://gerrit.wikimedia.org/r/631555
[16:39:33] <wikibugs>	 10Operations, 10SRE-tools: makevm cookbook fails get_vm() call - https://phabricator.wikimedia.org/T264409 (10herron) p:05Triage→03Medium
[16:42:30] <wikibugs>	 10Operations, 10SRE-tools: makevm cookbook fails get_vm() call - https://phabricator.wikimedia.org/T264409 (10Volans) 05Open→03Invalid Nothing to do here, that is not a failure, the cookbook is just polling to get the craeted VM. The `[1/20, retrying in 3.00s]` is the first call that failed, the second one...
[16:43:57] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Change urbanecm's SSH production key - https://phabricator.wikimedia.org/T264345 (10herron) p:05Triage→03Medium
[16:46:31] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] profile::envoy::builder: fix repos in chroot [puppet] - 10https://gerrit.wikimedia.org/r/631802 (owner: 10JMeybohm)
[16:49:03] <effie>	 !log disable puppet on mw2271 and briefly depool it 
[16:49:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:51] <wikibugs>	 (03PS1) 10Ssingh: dnsdist: temporarily disable validate_cmd for dnsdist.conf [puppet] - 10https://gerrit.wikimedia.org/r/631827 (https://phabricator.wikimedia.org/T263789)
[16:56:14] <icinga-wm>	 PROBLEM - Memcached on mw2271 is CRITICAL: connect to address 10.192.48.93 and port 11210: Connection refused https://wikitech.wikimedia.org/wiki/Memcached
[16:57:14] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1003/25628/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631827 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh)
[17:03:24] <icinga-wm>	 PROBLEM - Check systemd state on mw2271 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:10:14] <icinga-wm>	 RECOVERY - Check systemd state on mw2271 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:11:12] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:13:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] ATS/Etherpad: replace backend host name with discovery record [puppet] - 10https://gerrit.wikimedia.org/r/631555 (owner: 10Dzahn)
[17:15:13] <wikibugs>	 10Operations, 10Maps: Migrate maps to Buster - https://phabricator.wikimedia.org/T264292 (10herron) p:05Triage→03Medium
[17:15:45] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10Goal: Track all directly-owned SRE datasets into the new inventory system - https://phabricator.wikimedia.org/T264275 (10herron) p:05Triage→03Medium
[17:15:55] <mutante>	 !log submitted puppet refactoring change on maps servers
[17:15:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:16:02] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10Epic, 10Goal: Plan WMF infrastructure for 100% coverage of data recovery - https://phabricator.wikimedia.org/T264272 (10herron) p:05Triage→03Medium
[17:16:19] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service: Migrate WDQS to profile::java - https://phabricator.wikimedia.org/T264181 (10herron) p:05Triage→03Medium
[17:16:47] <wikibugs>	 10Operations, 10Puppet: Switch puppetdb to profile::java - https://phabricator.wikimedia.org/T264178 (10herron) p:05Triage→03Medium
[17:17:18] <wikibugs>	 10Operations, 10Patch-For-Review: Switch cergen to profile::java - https://phabricator.wikimedia.org/T264177 (10herron) p:05Triage→03Medium
[17:17:36] <wikibugs>	 (03CR) 10Dzahn: "confirmed NOOP on maps2004" [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn)
[17:17:56] <wikibugs>	 10Operations, 10Analytics: Switch Zookeeper to profile::java - https://phabricator.wikimedia.org/T264176 (10herron) p:05Triage→03Medium
[17:18:21] <wikibugs>	 10Operations: Migrate remaining services using Java to profile::java - https://phabricator.wikimedia.org/T264174 (10herron) p:05Triage→03Medium
[17:18:36] <wikibugs>	 10Operations, 10Data-Persistence-Backup, 10Goal: Define a methodology to track WMF services backup requirements - https://phabricator.wikimedia.org/T264274 (10herron) p:05Triage→03Medium
[17:18:59] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Update public key for production shell for dedcode - https://phabricator.wikimedia.org/T264392 (10herron) p:05Triage→03High
[17:19:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:20:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:21:31] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10Sbailey) Ok, there must be some other way to verify security. My previous SSH key is gone, I need a new one installed so I can log in to scandium somehow.
[17:22:43] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] admin: update ssh key for user dedcode [puppet] - 10https://gerrit.wikimedia.org/r/631788 (https://phabricator.wikimedia.org/T264392) (owner: 10Herron)
[17:32:13] <wikibugs>	 (03CR) 10Herron: [C: 03+2] admin: update ssh key for user dedcode [puppet] - 10https://gerrit.wikimedia.org/r/631788 (https://phabricator.wikimedia.org/T264392) (owner: 10Herron)
[17:40:58] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10wiki_willy) After escalating to technical account rep, replacement CPUs are being shipped by Dell, and can wait to be replaced when Papaul is back from vacation.
[17:43:04] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:46:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:48:08] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10herron) Hi @Sbailey I've reached out to you via google chat and by email to verify.  Thanks!
[17:53:35] <wikibugs>	 (03CR) 10Dzahn: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/631522 (owner: 10Dzahn)
[18:04:02] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:07:51] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Update public key for production shell for dedcode - https://phabricator.wikimedia.org/T264392 (10herron) 05Open→03Resolved This has been done, I'll transition to resolved now
[18:11:03] <wikibugs>	 (03Abandoned) 10Dzahn: site: add etherpad role to etherpad1003 [puppet] - 10https://gerrit.wikimedia.org/r/631560 (owner: 10Dzahn)
[18:14:54] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@da6a098]: oozie: query_clicks_hourly needs to wait on codfw events
[18:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:16:55] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@da6a098]: oozie: query_clicks_hourly needs to wait on codfw events (duration: 02m 01s)
[18:16:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:22:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:27:31] <effie>	 !log enable puppet on mw2271
[18:27:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:31:53] <wikibugs>	 (03PS1) 10Dzahn: remove etherpad1003 from site and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/631842
[18:35:01] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission
[18:35:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:53:38] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:57:34] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.234:9042 on restbase1030 is OK: TCP OK - 0.000 second response time on 10.64.48.234 port 9042 https://phabricator.wikimedia.org/T93886
[18:59:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_wikifeeds_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:00:26] <wikibugs>	 (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/631848
[19:01:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:04:52] <wikibugs>	 (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/631848
[19:05:35] <wikibugs>	 (03PS1) 10Razzi: oozie: use admin groups to determine admin access [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660)
[19:08:36] <wikibugs>	 (03CR) 10Razzi: "Catalog compiler: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/25634/console" [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[19:09:02] <wikibugs>	 (03CR) 10Dzahn: "@Razzi Hi, are you ok with this? would be nice if we can maybe merge this before the oozie changes you are working on, will probably requi" [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn)
[19:10:18] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10Krinkle) Short summary of IRC convo: Per [doc](https://docs.google.com/docume...
[19:12:02] <wikibugs>	 (03CR) 10Dzahn: "this shows how it's no difference on an oozie::server https://puppet-compiler.wmflabs.org/compiler1002/25636/an-coord1001.eqiad.wmnet/inde" [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn)
[19:14:33] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[19:14:38] <wikibugs>	 10Operations, 10vm-requests: EQIAD: 1 VM request for etherpad - https://phabricator.wikimedia.org/T101492 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `etherpad1003.eqiad.wmnet` - etherpad1003.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga   - Found Ganeti...
[19:14:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:42] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "and here is the same for some oozie::clients https://puppet-compiler.wmflabs.org/compiler1003/25637/" [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn)
[19:17:20] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] remove etherpad1003 from site and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/631842 (owner: 10Dzahn)
[19:17:57] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn)
[19:18:44] <icinga-wm>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:18:51] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "thank you:) submitting" [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn)
[19:20:17] <wikibugs>	 (03PS3) 10CDanis: VCL: A heavy hammer for dire circumstances. [puppet] - 10https://gerrit.wikimedia.org/r/631848
[19:22:05] <wikibugs>	 (03CR) 10Dzahn: "I ran puppet on an-coord1001 an-airflow1001 and there was no change." [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn)
[19:22:53] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:23:45] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:24:34] <wikibugs>	 (03CR) 10Dzahn: "great idea to avoid the additonal list of admins in yaml !:)  Thanks for letting me merge the hiera->lookup change before this. This will " [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[19:27:34] <wikibugs>	 (03PS2) 10Dzahn: oozie: use admin groups to determine admin access [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[19:28:35] <wikibugs>	 (03CR) 10Dzahn: "PS2: manual rebase on top of I4d869ebfdcb9e1de  (there, fixed it so you don't have to because of my change)" [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[19:30:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/25638/an-coord1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631849 (https://phabricator.wikimedia.org/T262660) (owner: 10Razzi)
[19:31:05] <wikibugs>	 (03PS1) 10Revi: GrowthExperiments: Change Help Page URL for kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/631854 (https://phabricator.wikimedia.org/T254364)
[19:33:07] <wikibugs>	 (03PS1) 10Dzahn: remove etherpad1003 [dns] - 10https://gerrit.wikimedia.org/r/631855
[19:33:52] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] remove etherpad1003 [dns] - 10https://gerrit.wikimedia.org/r/631855 (owner: 10Dzahn)
[19:35:39] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] dnsdist: temporarily disable validate_cmd for dnsdist.conf [puppet] - 10https://gerrit.wikimedia.org/r/631827 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh)
[19:37:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh)
[19:38:12] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "per https://phabricator.wikimedia.org/T264345#6510040" [puppet] - 10https://gerrit.wikimedia.org/r/631515 (https://phabricator.wikimedia.org/T264345) (owner: 10Urbanecm)
[19:40:07] <volans|off>	 chaomodus: could you have a look at the uncommitted dns check please?
[19:42:41] <wikibugs>	 (03PS1) 10Bstorm: Fixing typo in docker registry class [puppet] - 10https://gerrit.wikimedia.org/r/631856
[19:43:26] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] Fixing typo in docker registry class [puppet] - 10https://gerrit.wikimedia.org/r/631856 (owner: 10Bstorm)
[19:43:50] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "While I can't really confirm the firewall part (I get connected to something from bast3004 with ipmitool command provided), I am still all" [puppet] - 10https://gerrit.wikimedia.org/r/631430 (owner: 10Muehlenhoff)
[19:48:59] <icinga-wm>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:52:30] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dnsdist: temporarily disable validate_cmd for dnsdist.conf [puppet] - 10https://gerrit.wikimedia.org/r/631827 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh)
[19:55:09] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:57:15] <wikibugs>	 (03PS2) 10Dzahn: Remove profile::ipmi::mgmt from role::bastionhost::pop [puppet] - 10https://gerrit.wikimedia.org/r/631430 (owner: 10Muehlenhoff)
[19:57:17] <wikibugs>	 (03PS1) 10Dzahn: consolidate bastionhost roles, remove module [puppet] - 10https://gerrit.wikimedia.org/r/631858
[19:57:57] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "after this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/631858" [puppet] - 10https://gerrit.wikimedia.org/r/631430 (owner: 10Muehlenhoff)
[20:03:28] <wikibugs>	 (03PS18) 10Ssingh: dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789)
[20:03:33] <wikibugs>	 (03PS2) 10Dzahn: consolidate bastionhost roles, remove module [puppet] - 10https://gerrit.wikimedia.org/r/631858
[20:04:40] <wikibugs>	 (03CR) 10Ssingh: "Rebased on top of production; no other changes: https://puppet-compiler.wmflabs.org/compiler1003/25641/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh)
[20:04:51] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dnsdist: add acl parameter for webserver configuration [puppet] - 10https://gerrit.wikimedia.org/r/631508 (https://phabricator.wikimedia.org/T263789) (owner: 10Ssingh)
[20:06:37] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Wikidough: Upgrade to dnsdist 1.5.0 - https://phabricator.wikimedia.org/T263789 (10ssingh)
[20:06:41] <wikibugs>	 (03PS3) 10Dzahn: consolidate bastionhost roles, remove module [puppet] - 10https://gerrit.wikimedia.org/r/631858
[20:11:02] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/25642/" [puppet] - 10https://gerrit.wikimedia.org/r/631858 (owner: 10Dzahn)
[20:12:49] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:16:52] <wikibugs>	 (03CR) 10Herron: [C: 03+2] admin: Change urbanecm's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/631515 (https://phabricator.wikimedia.org/T264345) (owner: 10Urbanecm)
[20:16:59] <wikibugs>	 (03PS3) 10Herron: admin: Change urbanecm's SSH key [puppet] - 10https://gerrit.wikimedia.org/r/631515 (https://phabricator.wikimedia.org/T264345) (owner: 10Urbanecm)
[20:21:03] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:28:10] <Urbanecm>	 herron: thanks, the new key seems to work
[20:30:33] <wikibugs>	 (03PS2) 10Dzahn: trafficserver: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/631291
[20:30:54] <wikibugs>	 (03CR) 10Dzahn: trafficserver: replace hiera() with lookup() (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/631291 (owner: 10Dzahn)
[20:31:51] <herron>	 Urbanecm: np!
[20:34:23] <wikibugs>	 (03PS3) 10Dzahn: cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312
[20:34:43] <wikibugs>	 (03CR) 10Dzahn: cassandra: add data types, remove validation code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn)
[20:34:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cassandra: add data types, remove validation code [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn)
[20:52:59] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10leila) @gsingers Please review and approve if you are fine with it.
[20:56:41] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to researchers and analytics-privatedata-users for Leila Zia - https://phabricator.wikimedia.org/T264472 (10gsingers) Approved.
[21:07:55] <volans|off>	 mutante: the 'Uncommitted DNS changes in Netbox' icinga alert is because of changes related to etherpad1003 that were not committed apparently
[21:08:10] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "https://puppet-compiler.wmflabs.org/compiler1003/25643/deploy1001.eqiad.wmnet/change.deploy1001.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/631306 (owner: 10Dzahn)
[21:09:27] <mutante>	 volans|off: oh? but there is nothing pending commit
[21:10:00] <mutante>	 I used decom cookbook and it told me to manually remove from DNS and then i did that?
[21:10:59] <volans|off>	 see the alert above (21:55 UTC and the related wikitech page)
[21:11:29] <volans|off>	 Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)
[21:11:31] <mutante>	 here is the change where i removed it https://gerrit.wikimedia.org/r/c/operations/dns/+/631855
[21:11:39] <volans|off>	 it's the automated one
[21:11:41] <volans|off>	 not the manual one
[21:11:46] <volans|off>	 see https://phabricator.wikimedia.org/T101492#6513678
[21:11:55] <mutante>	 it told me explicitly to manually remove it
[21:12:10] <volans|off>	 yes but the automated one failed too
[21:12:21] <volans|off>	 we need both automated and manual until we fully migrate all DCs
[21:12:26] <volans|off>	 3 out of 5 are done
[21:12:31] <volans|off>	 (the pops)
[21:12:53] <mutante>	 sorry, i don't know what you mean by "not committed" then
[21:13:53] <mutante>	 repeat running the decom cookbook after making the manual change?
[21:14:24] <volans|off>	 Fri 21:55:09   icinga-wm| PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:15:21] <chaomodus>	 it means that changes in netbox are not committed to the automated dns repository
[21:15:28] <chaomodus>	 i can help from here if needbe
[21:15:42] <volans|off>	 I'll dig more next week on the transient failure
[21:22:55] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[21:22:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:26:31] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:26:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:15] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[21:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:30] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:35:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:36:37] <mutante>	 I ran sre.dns.netbox and it showed the diff.. committed.. rescheduled icinga check. did not fix it
[21:36:53] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:36:59] <mutante>	 ran cookbook a second time. no more diff
[21:37:02] <mutante>	 there it is, ok
[21:43:44] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10decommission-hardware: Decommission analytics10[28-31,33-41] - https://phabricator.wikimedia.org/T227485 (10Dzahn) The hosts here are showing up in a weird state. When running the DNS cookbook you get warnings that these hosts exist but are not "in devices...
[21:55:33] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:59:58] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,name=mw2271.codfw.wmnet
[22:00:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:19] <mutante>	 !log depooling mw2271 because Icinga alerts about memcached and SAL shows there were ongoing tests of some kind on it
[22:00:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:09] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[22:03:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [mostread] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:03:49] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/re
[22:04:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:04:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:04:17] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image dat
[22:04:17] <icinga-wm>	 016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a respo
[22:04:17] <icinga-wm>	 https://wikitech.wikimedia.org/wiki/Wikifeeds
[22:04:39] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_4101: Servers kubernetes2010.codfw.wmnet, kubernetes2004.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:04:49] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:04:49] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET
[22:04:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:04:53] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_restbase_cluster_eqiad,swagger_check_wikifeeds_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:05:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:05:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:05:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:05:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:05:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:06:09] <mutante>	 ugh.. short spike in response time on appservers but already over
[22:06:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:06:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:06:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikifeeds_4101: Servers kubernetes2016.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2001.codfw.wmnet, kubernetes2008.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2003.codfw.wmnet, kubernetes2014.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:06:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:06:48] <icinga-wm>	 PROBLEM - LVS wikifeeds codfw port 4101/tcp - A node webservice supporting featured wiki content feeds. termbox.svc.eqiad.wmnet IPv4 #page on wikifeeds.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[22:06:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:07:23] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[22:07:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:07:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:07:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:07:27] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[22:07:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value
[22:07:29] <icinga-wm>	 g keys: [mostread] https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:07:40] <rzl>	 good evening
[22:07:59] <mutante>	 there was a short spike in response time on appservers but it is already back to normal
[22:08:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:08:06] <mutante>	 then rescheduled some of the restbase checks
[22:08:10] <mutante>	 and seeing recoveries
[22:08:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:08:23] <rzl>	 mutante: ah cool
[22:08:24] <icinga-wm>	 RECOVERY - LVS wikifeeds codfw port 4101/tcp - A node webservice supporting featured wiki content feeds. termbox.svc.eqiad.wmnet IPv4 #page on wikifeeds.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1003 bytes in 1.178 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[22:08:26] <rzl>	 no idea where it came from?
[22:08:27] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:08:28] <mutante>	 looks like i can just ACK it in VO
[22:08:29] <mutante>	 no
[22:09:05] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[22:09:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:09:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:09:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:09:19] <mutante>	 was about to send the RESOLVED code in VO but it is already done too
[22:09:48] <mutante>	 rescheduling the rest of them
[22:10:03] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:10:16] <rzl>	 whatever it was, it correlates with a jump in s8 scrape time
[22:10:45] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:10:59] <cdanis>	 hi
[22:11:24] <rzl>	 cdanis: hi, latency spike recovered on its own, just curiosity at this point
[22:11:29] <mutante>	 dbtree does not show lag on s8
[22:11:49] <cdanis>	 not replication lag, mutante, but the time that it took for the prometheus exporter to run its load-checking queries
[22:12:04] <cdanis>	 which tracks reasonably well with either query queue depth or cpu saturation
[22:12:05] <mutante>	 cdanis: this spike  https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-datasource=codfw%20prometheus%2Fops&var-method=GET&viewPanel=9
[22:12:14] <cdanis>	 which, on s8, because it is hit in some way by ~every wikipedia
[22:12:20] <cdanis>	 often causes appserver latency spikes
[22:12:47] <cdanis>	 we had several incidents of this in eqiad due to cpu starvation on s8 until DBA added an extra host
[22:14:28] <cdanis>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2080&var-port=9104
[22:14:57] <cdanis>	 there was a pretty big read load spike on all s8 replicas
[22:15:55] <mutante>	 db2080 (part of s8) seems to be pretty busy and just started having traffic this morning
[22:16:38] <cdanis>	 on some of the hosts there was a smaller load spike at 21:00 (the top of the previous hour) -- and the latest one correlates with the top of this hour
[22:16:40] <cdanis>	 👀
[22:16:51] <cdanis>	 https://grafana.wikimedia.org/d/000000273/mysql?viewPanel=3&orgId=1&var-server=db2091&var-port=9104
[22:17:18] <cdanis>	 we didn't add any new crons or anything did we
[22:17:36] <rzl>	 2098 and 2099 also show CPU jumps from ~0 to ~30% starting at 21:30 and 22:04 respectively
[22:17:37] <mutante>	 good point it started at the top of the hour. same is true for response time on appservers
[22:17:42] <mutante>	 30 seconds after the hour
[22:18:05] <rzl>	 those are s2 and s4 though, might be unrelated
[22:18:32] <rzl>	 just stands out because the rest of them are flat, except for the CPU bump on the s8 hosts
[22:18:41] <rzl>	 https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=codfw&var-cluster=mysql&var-instance=All&var-datasource=thanos&from=now-1h&to=now is where I'm looking
[22:21:45] <mutante>	 17899 Oct  2 22:00:04 mwmaint1002 systemd[1]: Started MediaWiki periodic job mediawiki_tor_exit_node.
[22:21:47] <rzl>	 I'm inclined to leave it and go back to our evenings, I don't think there's anything much tod o here
[22:21:48] <mutante>	 17900 Oct  2 22:00:04 mwmaint1002 systemd[1]: Started MediaWiki periodic job db_lag_stats_reporter.
[22:21:51] <mutante>	 17901 Oct  2 22:00:04 mwmaint1002 systemd[1]: Started MediaWiki periodic job wikibase_repo_prune2.
[22:21:54] <mutante>	 17902 Oct  2 22:00:04 mwmaint1002 systemd[1]: Started MediaWiki periodic job wikibase_repo_prune_test.
[22:21:57] <mutante>	 17903 Oct  2 22:00:04 mwmaint1002 systemd[1]: Started MediaWiki periodic job wikidata-updateQueryServiceLag.
[22:22:00] <mutante>	 17904 Oct  2 22:00:04 mwmaint1002 systemd[1]: Started MediaWiki periodic job update_flaggedrev_stats.
[22:22:03] <mutante>	 all those at that time , heh
[22:22:18] <rzl>	 well, our evenings in ET that is, excuse me mutante
[22:22:35] <rzl>	 I don't think I'd seen "mediawiki_tor_exit_node" before
[22:22:48] <cdanis>	 I assume that flags them as open proxies
[22:22:56] <cdanis>	 for blocking editing from them
[22:23:02] <rzl>	 ah sure yeah
[22:23:25] <mutante>	 yes, that's it, and you did not see it because we used that as the first cron to convert to a timer
[22:23:28] <mutante>	 and then you did the rest
[22:23:33] <mutante>	 or so
[22:23:37] <rzl>	 ha, got it
[22:24:26] <rzl>	 anyway it doesn't look like any of those are new
[22:24:42] <rzl>	 wait that's mwmaint1002
[22:24:47] <rzl>	 mwmaint2001 is the right host
[22:24:53] <mutante>	 the only part i am wondering is "weren't these more spaced out in time"
[22:25:00] <mutante>	 ah, of course
[22:25:26] <rzl>	 (the timers will run in eqiad, they'll just check etcd, see they're in the wrong place, and not do anything)
[22:25:37] <mutante>	 ok, you answered the next question
[22:33:41] <rzl>	 yeah I got nothing 🤷 if it comes back we can take another look
[22:34:29] <mutante>	 yea, i checked mwmaint2001 logs around that time but the crons are not new and not matching the pattern to start at exactly the hour and stop 3 min later
[22:34:36] <mutante>	 agreed rzl
[22:37:52] <mutante>	 btw, replica lag on db2099 (s4) which you mentioned earlier has disabled notifications in Icinga. so that could mean it's known unless they were forgotten
[22:42:45] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "Duh, I was wondering why this still says "require_encrypted_keys' expects a Boolean value, got String" but "yes" is not really Boolean :)" [puppet] - 10https://gerrit.wikimedia.org/r/631306 (owner: 10Dzahn)
[22:44:49] <wikibugs>	 (03PS3) 10Dzahn: keyholder: hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/631306
[22:58:23] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/25644/deploy1001.eqiad.wmnet/index.html https://puppet-compiler.wmflabs.org/compiler1002/" [puppet] - 10https://gerrit.wikimedia.org/r/631306 (owner: 10Dzahn)
[23:00:36] <wikibugs>	 (03CR) 10Dzahn: "and here is the reason I added the Optionals before:" [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn)
[23:04:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/25648/netflow3001.esams.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/631303 (owner: 10Dzahn)
[23:13:20] <wikibugs>	 10Operations, 10Analytics, 10Research-Backlog, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10leila)
[23:16:31] <icinga-wm>	 PROBLEM - SSH on ms-be2056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:21:21] <icinga-wm>	 RECOVERY - SSH on ms-be2056 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[23:21:43] <wikibugs>	 (03CR) 10Dzahn: cache::ssl::unified: hiera->lookup, add data types (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn)
[23:21:54] <wikibugs>	 (03PS3) 10Dzahn: cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460
[23:22:34] <wikibugs>	 (03CR) 10Dzahn: "yea, these should not even be inside the module but refactoring base is for another day" [puppet] - 10https://gerrit.wikimedia.org/r/631307 (owner: 10Dzahn)
[23:22:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn)
[23:27:45] <wikibugs>	 10Operations, 10MediaWiki-Documentation, 10User-Dereckson, 10patch-welcome: Repair "svn.wikimedia.org/doc/" redirect for doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Dereckson) **Current status**  svn.wikimedia.org/ redirects to https://phabricator.wikimedia.org/diffusion/ svn.wikimedia...
[23:33:25] <wikibugs>	 10Operations, 10Analytics, 10Research-Backlog, 10WMF-Legal, 10User-Elukey: Enable layered data-access and sharing for a new form of collaboration - https://phabricator.wikimedia.org/T245833 (10Nuria) My opinion on this request is that having non throughly supervised contributors accessing data introduces...
[23:51:28] <wikibugs>	 (03PS2) 10Dzahn: thumbor: role->profile, hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/630694
[23:53:15] <wikibugs>	 (03PS4) 10Dzahn: cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460
[23:54:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn)
[23:56:00] <wikibugs>	 (03CR) 10Dzahn: wmcs::postgres: hiera->lookup and add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/628459 (owner: 10Dzahn)
[23:56:05] <wikibugs>	 (03PS2) 10Dzahn: wmcs::postgres: hiera->lookup and add data types [puppet] - 10https://gerrit.wikimedia.org/r/628459
[23:57:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:58:51] <wikibugs>	 (03PS5) 10Dzahn: cache::ssl::unified: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/628460
[23:59:31] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets