[00:00:05] <jouncebot>	 twentyafterfour: May I have your attention please! Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T0000)
[00:03:41] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add a new type of database to the installer from extension" [core] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615439 (https://phabricator.wikimedia.org/T258664) (owner: 10Legoktm)
[00:03:45] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add a new type of database to the installer from extension" [core] (wmf/1.35.0-wmf.41) - 10https://gerrit.wikimedia.org/r/615440 (https://phabricator.wikimedia.org/T258664) (owner: 10Legoktm)
[00:04:17] <legoktm>	 twentyafterfour: are you deploying now?
[00:07:27] <legoktm>	 syncing to mwdebug1001
[00:09:48] <icinga-wm>	 RECOVERY - Check systemd state on webperf1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:11:15] <legoktm>	 syncing...
[00:11:49] <logmsgbot>	 !log legoktm@deploy1001 scap failed: average error rate on 3/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org/goto/e474f13ffac6b8c3bf919c4aeafc8c9b for details)
[00:11:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:54] <legoktm>	 uhoh
[00:14:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:15:08] <legoktm>	 ok, it didn't sync properly
[00:15:37] <legoktm>	 >[{exception_id}] {exception_url} Error from line 459 of /srv/mediawiki/php-1.36.0-wmf.1/includes/libs/rdbms/database/Database.php: Class 'MediaWiki\Installer\Services\InstallerDBSupport' not found 
[00:15:42] <legoktm>	 of course not, because I'm trying to remove that
[00:15:51] <legoktm>	 trying once more...
[00:15:56] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:16:56] <logmsgbot>	 !log legoktm@deploy1001 Synchronized php-1.36.0-wmf.1/includes/: T258664: Revert "Add a new type of database to the installer from extension" (duration: 01m 09s)
[00:17:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:17:03] <stashbot>	 T258664: 25% latency regression July 2nd due to InstallerExtensionSelector service running in production - https://phabricator.wikimedia.org/T258664
[00:17:26] <legoktm>	 uh
[00:17:58] <legoktm>	 I think exceptions are spiking again
[00:19:25] <legoktm>	 ok, looks fine
[00:19:44] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:20:16] <logmsgbot>	 !log legoktm@deploy1001 Scap failed!: 9/9 canaries failed their endpoint checks(https://en.wikipedia.org)
[00:20:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:20:35] <legoktm>	 ok, I'll sync more atomically...
[00:21:43] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10User-Elukey: Reimage one memcached shard to Buster - https://phabricator.wikimedia.org/T252391 (10aaron) Given the libketama-style consistent hashing in twemproxy and that, AFAIK, CentralAuth sessions can regenerate (notwithstanding one-off CSRF token failu...
[00:21:45] <legoktm>	 sorry
[00:22:47] <logmsgbot>	 !log legoktm@deploy1001 Synchronized php-1.35.0-wmf.41/includes/libs/rdbms/database/Database.php: T258664: Revert "Add a new type of database to the installer from extension" (duration: 01m 05s)
[00:22:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:22:52] <stashbot>	 T258664: 25% latency regression July 2nd due to InstallerExtensionSelector service running in production - https://phabricator.wikimedia.org/T258664
[00:24:13] <logmsgbot>	 !log legoktm@deploy1001 Synchronized php-1.35.0-wmf.41/includes/: T258664: Revert "Add a new type of database to the installer from extension" (2/2) (duration: 01m 08s)
[00:24:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:24:30] <legoktm>	 there we go
[00:25:18] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:27:42] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19001592 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:31:26] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2024680 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[01:11:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] docker::registry: Allow param config to override defaults [puppet] - 10https://gerrit.wikimedia.org/r/615581 (owner: 10BryanDavis)
[02:16:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:18:36] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:00:57] <wikibugs>	 10Operations, 10Graphoid, 10serviceops, 10Chinese-Sites, and 3 others: Undeploy graphoid for phase 2 wiki's - https://phabricator.wikimedia.org/T258463 (10Shizhao)
[04:31:38] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Reduce read pressure on memcached servers by adding a machine-local Memcache instance - https://phabricator.wikimedia.org/T244340 (10aaron) >>! In T244340#6211682, @elukey wrote: > Side note: if not...
[04:36:06] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5331 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:37:58] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 13 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:45:16] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[04:47:02] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[05:27:51] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1019: Reduce labsdb1009 weight" [puppet] - 10https://gerrit.wikimedia.org/r/615442
[05:28:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1019: Reduce labsdb1009 weight" [puppet] - 10https://gerrit.wikimedia.org/r/615442 (owner: 10Marostegui)
[05:29:28] <marostegui>	 !log Restore labsdb1009's original weight 
[05:29:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:45] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and nda groups for edtadros - https://phabricator.wikimedia.org/T256435 (10Joe) @Jrbranaa ping again :)
[06:15:02] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:20:50] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[06:29:24] <icinga-wm>	 PROBLEM - ores on ores2009 is CRITICAL: connect to address 10.192.48.90 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[06:34:58] <icinga-wm>	 RECOVERY - ores on ores2009 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.878 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores
[06:55:21] <Majavah>	 good morning
[06:55:32] <Majavah>	 anyone with logstash access that could get the stack trace for https://phabricator.wikimedia.org/T258666?
[07:08:48] <_joe_>	 Majavah: done, turns out it's a duplicate of T212428
[07:08:49] <stashbot>	 T212428: includes/Revision/RevisionStore.php: Main slot of revision (number) not found in database! - https://phabricator.wikimedia.org/T212428
[07:09:25] <Majavah>	 _joe_: thank you!
[07:10:07] <_joe_>	 also I'm looking at if it got worse over the last few days
[07:10:18] <_joe_>	 but no
[07:10:37] <Majavah>	 well that ticket is claiming that it broke one extension (FileImporter) completely on this train
[07:10:55] <_joe_>	 Majavah: I don't see that from the error rate
[07:12:11] <_joe_>	 but ok, I'll let someone else change that priority
[07:22:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Also return uid from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/615664
[07:22:07] <wikibugs>	 (03PS1) 10Muehlenhoff: Also return uid from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/615665
[07:24:01] <Elitre>	 hey, are there known Job Queue issues and/or who should I tag on Phab to flag them? Thanks.
[07:35:42] <_joe_>	 Elitre: no known issues besides some occasional overload that might lose some jobs
[07:35:52] <_joe_>	 please tag operations
[07:36:12] <Elitre>	 _joe_: thanks. I had the same mass message delivered twice. I almost wanna cry.
[07:36:20] <wikibugs>	 (03PS4) 10Kormat: mariadb::monitor::prometheus: Remove unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879)
[07:36:24] <_joe_>	 uh, interesting
[07:36:40] <wikibugs>	 (03PS6) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566)
[07:36:41] <wikibugs>	 10Operations, 10MassMessage: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre)
[07:36:48] <_joe_>	 so the problem is not that it was delivered, but that it was delivered twice
[07:37:04] <_joe_>	 then sorry, probably we need to add more tags, I'll take care of it
[07:37:08] <Elitre>	 yeah.
[07:37:12] <_joe_>	 also that bug is a bit old :D
[07:37:24] <Elitre>	 I can add them no problem, I just don't know which those are
[07:37:28] <kormat>	 _joe_: i think the preferred term is "mature" ;)
[07:38:10] <_joe_>	 kormat: in this case, ripe
[07:38:14] <Elitre>	 yes, they found evidence of the mass message system along with some dynos which were excavated recently.
[07:38:34] <wikibugs>	 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Joe)
[07:38:54] <RhinosF1>	 This isn't the first time this has happened. There's at least https://phabricator.wikimedia.org/T232379 as well.
[07:39:47] <kormat>	 _joe_: 👌
[07:40:08] <wikibugs>	 (03PS2) 10Muehlenhoff: Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877)
[07:40:10] <wikibugs>	 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Joe) @Pchelolo @Ottomata do we have any way to verify how this happened from eventgate and changeprop logs?
[07:40:55] <Elitre>	 I should probably specify, whatever you do to investigate, I hope that doesn't involve sending that message again :p
[07:41:19] <wikibugs>	 (03PS3) 10Muehlenhoff: Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877)
[07:41:19] <_joe_>	 Elitre: it looks like https://phabricator.wikimedia.org/T232379#5556920 is still the issue.
[07:41:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff)
[07:42:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff)
[07:42:49] <wikibugs>	 (03CR) 10Elukey: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/615664 (owner: 10Muehlenhoff)
[07:46:39] <wikibugs>	 (03PS4) 10Muehlenhoff: Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877)
[07:48:42] <Elitre>	 _joe_: yes, and IIUC what pchelolo wrote, then we probably should do something about it now!
[07:49:14] <wikibugs>	 (03PS5) 10Muehlenhoff: Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877)
[07:54:34] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] ratelimit: add new docker image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan)
[07:59:01] <wikibugs>	 (03PS1) 10Volans: mgmt: netbox-generated data for mgmt codfw [dns] - 10https://gerrit.wikimedia.org/r/615668 (https://phabricator.wikimedia.org/T233183)
[08:08:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Also return uid from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/615664 (owner: 10Muehlenhoff)
[08:09:20] <icinga-wm>	 RECOVERY - Keyholder SSH agent on netmon2001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder
[08:15:35] <wikibugs>	 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10jcrespo) Thanks for your work on this. One clarification, for those of us that are not that familiar with LV...
[08:16:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1107 to move it to m2 T257540', diff saved to https://phabricator.wikimedia.org/P12024 and previous config saved to /var/cache/conftool/dbconfig/20200723-081650-marostegui.json
[08:16:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:57] <stashbot>	 T257540: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540
[08:19:59] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1107 from s1 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/615669 (https://phabricator.wikimedia.org/T257540)
[08:20:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1107 from s1 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/615669 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui)
[08:21:41] <elukey>	 
[08:21:51] <wikibugs>	 (03CR) 10Volans: "I've manually verified all of them both programmatically with a diff script and visually (screen-by-screen) comparison. Instructions on ho" [dns] - 10https://gerrit.wikimedia.org/r/615668 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[08:21:57] <wikibugs>	 10Operations, 10observability: Change smokeping to have pinging active/active, with alerts active/standby - https://phabricator.wikimedia.org/T258675 (10fgiunchedi)
[08:22:13] <wikibugs>	 10Operations, 10observability: Change smokeping to have pinging active/active, with alerts active/standby - https://phabricator.wikimedia.org/T258675 (10fgiunchedi)
[08:22:16] <wikibugs>	 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi)
[08:24:05] <wikibugs>	 (03PS5) 10ArielGlenn: rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856)
[08:24:18] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] proton: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615254 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm)
[08:24:24] <wikibugs>	 (03PS3) 10JMeybohm: proton: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615254 (https://phabricator.wikimedia.org/T256843)
[08:24:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn)
[08:24:38] <wikibugs>	 (03CR) 10Marostegui: [V: 03+2 C: 03+2] "misc requires a full refactor that will be done at a later time" [puppet] - 10https://gerrit.wikimedia.org/r/615669 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui)
[08:26:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove db1107 from s1 T257540', diff saved to https://phabricator.wikimedia.org/P12025 and previous config saved to /var/cache/conftool/dbconfig/20200723-082647-marostegui.json
[08:26:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:53] <stashbot>	 T257540: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540
[08:27:45] <wikibugs>	 (03PS2) 10JMeybohm: termbox: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615255 (https://phabricator.wikimedia.org/T256843)
[08:27:53] <wikibugs>	 (03PS2) 10JMeybohm: wikifeeds: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615256 (https://phabricator.wikimedia.org/T256843)
[08:27:59] <wikibugs>	 (03PS2) 10JMeybohm: zotero: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615257 (https://phabricator.wikimedia.org/T256843)
[08:28:10] <wikibugs>	 (03PS2) 10JMeybohm: _scaffold: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615259 (https://phabricator.wikimedia.org/T256843)
[08:29:13] <logmsgbot>	 !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'proton' for release 'production' .
[08:29:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:31] <wikibugs>	 10Operations, 10observability: Change smokeping to have pinging active/active, with alerts active/standby - https://phabricator.wikimedia.org/T258675 (10ayounsi) 👍  Good idea! I'd say send alerts only from one host as it's already quite loud (no easy way to mute alerts). Also T169860 is most likely the future...
[08:34:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: smokeping: match documentroot with smokeping installation [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967)
[08:35:38] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "I checked the codfw network records in authdns2001:/srv/git/netbox_dns_snippets and they lgtm." [dns] - 10https://gerrit.wikimedia.org/r/615668 (https://phabricator.wikimedia.org/T233183) (owner: 10Volans)
[08:37:34] <wikibugs>	 (03Abandoned) 10Volans: puppetdb microservice: add some filtering [puppet] - 10https://gerrit.wikimedia.org/r/615232 (https://phabricator.wikimedia.org/T244153) (owner: 10Volans)
[08:38:18] <wikibugs>	 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10akosiaris) >>! In T258614#6328624, @jcrespo wrote: > Thanks for your work on this. One clarification, for th...
[08:38:24] <wikibugs>	 (03CR) 10QChris: [C: 03+1] remove gerrit.wmfusercontent.org [dns] - 10https://gerrit.wikimedia.org/r/615557 (https://phabricator.wikimedia.org/T191183) (owner: 10Dzahn)
[08:38:47] <wikibugs>	 (03PS17) 10Alexandros Kosiaris: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos)
[08:39:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime
[08:39:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:05] <logmsgbot>	 !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'proton' for release 'production' .
[08:40:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:51] <XioNoX>	 !log remove pim-rp IPs from last routers - T257573
[08:40:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:56] <stashbot>	 T257573: Remove multicast - https://phabricator.wikimedia.org/T257573
[08:41:53] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:41:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:18] <godog>	 !log test librenms poller from netmon2001
[08:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:52] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:43:52] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:45:28] <logmsgbot>	 !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'proton' for release 'production' .
[08:45:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:40] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] termbox: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615255 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm)
[08:45:48] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mobileapps: Bump memory limits another 20% [deployment-charts] - 10https://gerrit.wikimedia.org/r/615671 (https://phabricator.wikimedia.org/T218733)
[08:45:50] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mobileapps: Lower replicas to 80 from 240 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615672 (https://phabricator.wikimedia.org/T218733)
[08:45:54] <wikibugs>	 (03PS1) 10Ayounsi: Reclaim PIM-RP IPs from the multicast gods [dns] - 10https://gerrit.wikimedia.org/r/615673 (https://phabricator.wikimedia.org/T257573)
[08:46:07] <wikibugs>	 (03PS6) 10ArielGlenn: rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856)
[08:46:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn)
[08:46:36] <wikibugs>	 (03PS7) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566)
[08:46:55] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Reclaim PIM-RP IPs from the multicast gods [dns] - 10https://gerrit.wikimedia.org/r/615673 (https://phabricator.wikimedia.org/T257573) (owner: 10Ayounsi)
[08:47:01] <wikibugs>	 (03Merged) 10jenkins-bot: termbox: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615255 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm)
[08:47:43] <XioNoX>	 godog: be carefull with the pollers/discovery as they update the DB as well, not sure how 2 instances would play together
[08:47:52] <wikibugs>	 (03PS1) 10DCausse: Fix bug that causes wrong prefixes in RDF output [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615443 (https://phabricator.wikimedia.org/T258507)
[08:48:44] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Bump memory limits another 20% [deployment-charts] - 10https://gerrit.wikimedia.org/r/615671 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris)
[08:49:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mobileapps: Lower replicas to 80 from 240 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615672 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris)
[08:49:06] <godog>	 XioNoX: mmhh good point, thanks! yeah I won't mess with the crons more
[08:49:49] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Bump memory limits another 20% [deployment-charts] - 10https://gerrit.wikimedia.org/r/615671 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris)
[08:50:08] <wikibugs>	 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) I am looking at the GUC log linked above and it appears that first delivery was successful on 145 of the [[ https://meta.wikimedia.o...
[08:50:10] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: Lower replicas to 80 from 240 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615672 (https://phabricator.wikimedia.org/T218733) (owner: 10Alexandros Kosiaris)
[08:50:17] <wikibugs>	 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Remove multicast - https://phabricator.wikimedia.org/T257573 (10ayounsi) 05Open→03Resolved Indeed, cleaned up as well:  ` rancid-configs/configs$ ag 208.80.153.194 cr1-codfw.wikimedia.org 1661:                address 208.80.153.194/32;  cr2-codfw.w...
[08:51:49] <wikibugs>	 10Operations, 10serviceops: Recurrent TX bw saturation for mediawiki memcached shards - https://phabricator.wikimedia.org/T258679 (10elukey) p:05Triage→03High
[08:52:49] <wikibugs>	 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) For more quirkiness, [[ https://www.mediawiki.org/wiki/User_talk:Krinkle#GUC_Tool_error,_or? |  I had recently brought up]] that in...
[08:57:19] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Update helm repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/615675
[08:58:13] <wikibugs>	 10Operations, 10Traffic, 10observability, 10serviceops: Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping - https://phabricator.wikimedia.org/T258614 (10jcrespo) I see, thanks.
[08:59:00] <marostegui>	 !log transfer --type=xtrabackup from db1117:3322 to db1107 T257540 
[08:59:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:06] <stashbot>	 T257540: Upgrade m2 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T257540
[09:00:40] <wikibugs>	 (03PS7) 10ArielGlenn: rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856)
[09:02:43] <logmsgbot>	 !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'termbox' for release 'staging' .
[09:02:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix bug that causes wrong prefixes in RDF output [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615443 (https://phabricator.wikimedia.org/T258507) (owner: 10DCausse)
[09:08:16] <icinga-wm>	 PROBLEM - librenms.wikimedia.org requires authentication on netmon2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[09:08:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Update helm repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/615675 (owner: 10Alexandros Kosiaris)
[09:08:58] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[09:09:08] <wikibugs>	 (03PS8) 10Kormat: mariadb: Refactor tendril+zarcillo [puppet] - 10https://gerrit.wikimedia.org/r/615479 (https://phabricator.wikimedia.org/T258566)
[09:09:17] <effie>	 kormat: ^ is that you ?
[09:09:26] <wikibugs>	 (03CR) 10DCausse: "recheck" [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615443 (https://phabricator.wikimedia.org/T258507) (owner: 10DCausse)
[09:09:47] <wikibugs>	 (03Merged) 10jenkins-bot: Update helm repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/615675 (owner: 10Alexandros Kosiaris)
[09:09:50] <wikibugs>	 (03PS20) 10Alexandros Kosiaris: Add discovery records for chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/609403
[09:10:26] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[09:10:27] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[09:10:40] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[09:10:50] <marostegui>	 ^ me
[09:10:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "I 'll merge this in the interest of pushing this forward. If we encounter issues, we 'll solve them then. PCC across the fleet is OK, so I" [puppet] - 10https://gerrit.wikimedia.org/r/609403 (owner: 10Alexandros Kosiaris)
[09:11:15] <effie>	 ah, it is always the DBA™
[09:12:26] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[09:12:30] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 1 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[09:14:02] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[09:14:02] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[09:14:34] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 2 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[09:15:22] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: ganeti: Fix ganeti-mond typo [puppet] - 10https://gerrit.wikimedia.org/r/615676
[09:15:27] <wikibugs>	 (03PS6) 10Hnowlan: ratelimit: add new docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907)
[09:16:35] <wikibugs>	 (03CR) 10Hnowlan: ratelimit: add new docker image (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan)
[09:19:24] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
[09:19:24] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[09:19:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:56] <akosiaris>	 !log lower replica count back to 80 for mobileapps. T218733
[09:20:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:02] <stashbot>	 T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733
[09:20:37] <logmsgbot>	 !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'termbox' for release 'production' .
[09:20:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:20] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "Last PS includes the requested comment, merging." (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 (owner: 10Volans)
[09:22:23] <wikibugs>	 10Operations, 10Arc-Lamp, 10Performance-Team: webperf1002 server close to have /srv partition full - https://phabricator.wikimedia.org/T257931 (10jcrespo) BTW:  ` webperf2002 Disk space WARNING 2020-07-23 09:12:15 0d 0h 38m 11s 3/3 DISK WARNING - free space: /srv 20077 MB (6% inode=99%): `
[09:24:27] <wikibugs>	 (03Merged) 10jenkins-bot: GC: add time-based GC for Image objects [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615423 (owner: 10Volans)
[09:24:53] <logmsgbot>	 !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'termbox' for release 'production' .
[09:24:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:22] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[09:25:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:15] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] transfer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/612384 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[09:27:29] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
[09:27:29] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[09:27:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:53] <wikibugs>	 (03PS5) 10Jcrespo: Firewall.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/612726 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[09:30:02] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[09:30:04] <wikibugs>	 (03PS3) 10Vidhi-Mody: Selenium: Update to WebdriverIO v5 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471)
[09:31:00] <wikibugs>	 (03Abandoned) 10Vidhi-Mody: Selenium: Update to WebdriverIO v5 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615227 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody)
[09:31:34] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "This is ok, but if in the future you add more unit tests, please consider splitting the testing into several files. Putting unit tests for" [software/transferpy] - 10https://gerrit.wikimedia.org/r/612726 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[09:32:27] <wikibugs>	 (03Merged) 10jenkins-bot: Firewall.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/612726 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[09:35:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] wikifeeds: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615256 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm)
[09:36:05] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615256 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm)
[09:36:38] <icinga-wm>	 RECOVERY - librenms.wikimedia.org requires authentication on netmon2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 661 bytes in 0.151 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[09:38:04] <wikibugs>	 (03CR) 10Jcrespo: "It needs rebase again :-(" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[09:38:22] <logmsgbot>	 !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
[09:38:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:01] <wikibugs>	 (03PS1) 10Volans: Release v0.2.6 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/615679
[09:39:22] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: analytics1041.eqiad.wmnet, contint2001.wikimedia.org, contint1001.wikimedia.org, testreduce1001.eqiad.wmnet, aphlict1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[09:40:05] <logmsgbot>	 !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[09:40:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:33] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 6:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[09:45:51] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/615679 (owner: 10Volans)
[09:46:54] <logmsgbot>	 !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[09:46:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:47:09] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] zotero: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615257 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm)
[09:47:35] <wikibugs>	 (03PS7) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600)
[09:48:12] <wikibugs>	 (03Merged) 10jenkins-bot: zotero: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615257 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm)
[09:48:15] <wikibugs>	 (03Merged) 10jenkins-bot: _scaffold: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615259 (https://phabricator.wikimedia.org/T256843) (owner: 10JMeybohm)
[09:48:21] <wikibugs>	 (03PS8) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600)
[09:48:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[09:49:08] <wikibugs>	 (03CR) 10Jcrespo: "2 nitpicks, let me know what you think." (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[09:50:13] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 6:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[09:50:57] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 8:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[09:51:04] <wikibugs>	 (03PS2) 10Volans: Release v0.2.6 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/615679
[09:51:25] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/weight=10; selector: dc=eqiad,service=mobileapps,name=kubernetes.*
[09:51:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:53] <akosiaris>	 !log prepare for pooling kubernetes mobileapps capacity in eqiad. T218733
[09:51:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:58] <stashbot>	 T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733
[09:55:32] <wikibugs>	 10Operations: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10MoritzMuehlenhoff) p:05Triage→03Medium
[09:56:21] <wikibugs>	 (03CR) 10Jcrespo: "> Patch Set 8:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[09:57:58] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[09:59:53] <wikibugs>	 (03CR) 10Privacybatm: "I am pushing a new patch right away" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[10:00:04] <jouncebot>	 mvolz: How many deployers does it take to do Services – Citoid /  Zotero deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1000).
[10:00:46] <wikibugs>	 (03PS4) 10Privacybatm: Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450)
[10:01:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[10:03:20] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] Release v0.2.6 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/615679 (owner: 10Volans)
[10:03:23] <wikibugs>	 10Operations, 10serviceops: Recurrent TX bw saturation for mediawiki memcached shards - https://phabricator.wikimedia.org/T258679 (10elukey) The mc1020 spikes are interesting:  https://grafana.wikimedia.org/d/000000317/memcache-slabs?panelId=60&fullscreen&orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prome...
[10:04:18] <logmsgbot>	 !log jayme@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'zotero' for release 'staging' .
[10:04:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:09] <logmsgbot>	 !log volans@deploy1001 Started deploy [debmonitor/deploy@44aa1ee]: Release v0.2.6
[10:05:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:20] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[10:05:22] <logmsgbot>	 !log volans@deploy1001 Finished deploy [debmonitor/deploy@44aa1ee]: Release v0.2.6 (duration: 00m 14s)
[10:05:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:20] <logmsgbot>	 !log volans@deploy1001 Started deploy [debmonitor/deploy@16d0c45]: Release v0.2.6
[10:06:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:06:56] <logmsgbot>	 !log volans@deploy1001 Finished deploy [debmonitor/deploy@16d0c45]: Release v0.2.6 (duration: 00m 36s)
[10:06:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:08:14] <wikibugs>	 (03CR) 10Jcrespo: "I like that this still works even if source and target are the same host, kudos." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[10:08:51] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Switch the base image to buster from stretch. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683
[10:09:10] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 8:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[10:09:26] <wikibugs>	 (03CR) 10Jcrespo: Transferer.py: Resolve concurrency issue with checksum file names (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[10:11:18] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,service=mobileapps,name=kubernetes.*
[10:11:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:43] <akosiaris>	 !log poole kubernetes in mobileapps/eqiad. T218733
[10:11:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:47] <stashbot>	 T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733
[10:14:45] <logmsgbot>	 !log jayme@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'zotero' for release 'production' .
[10:14:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:08] <logmsgbot>	 !log jayme@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'zotero' for release 'production' .
[10:18:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:25] <wikibugs>	 (03PS3) 10JMeybohm: mobileapps: Update envoy to 1.14.4-1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/615253 (https://phabricator.wikimedia.org/T256843)
[10:20:11] <wikibugs>	 (03PS9) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600)
[10:20:15] <wikibugs>	 (03PS1) 10Privacybatm: transferpy: Change tox development environment to Python3.7 [software/transferpy] - 10https://gerrit.wikimedia.org/r/615688 (https://phabricator.wikimedia.org/T257600)
[10:22:47] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 8:" (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[10:24:26] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,service=mobileapps,name=scb1001.*
[10:24:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:25:34] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,service=mobileapps,name=scb1002.*
[10:25:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:17] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/weight=1; selector: dc=eqiad,service=mobileapps,name=scb*
[10:27:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:27] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/weight=1; selector: dc=eqiad,service=mobileapps,name=scb.*
[10:27:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:30] <wikibugs>	 (03CR) 10Privacybatm: Transferer.py: Resolve concurrency issue with checksum file names (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[10:33:30] <wikibugs>	 (03CR) 10Jcrespo: "Careful, you have now a couple of "IndexError: tuple index out of range" on jenkins." [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[10:39:46] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] Limit concurrency for processMediaModeration job [deployment-charts] - 10https://gerrit.wikimedia.org/r/615572 (https://phabricator.wikimedia.org/T258653) (owner: 10Ppchelko)
[10:42:58] <wikibugs>	 (03PS5) 10Privacybatm: Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450)
[10:44:11] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 4:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[10:44:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:46:32] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:51:53] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "Testing went ok now." [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[10:52:22] <wikibugs>	 (03Merged) 10jenkins-bot: Transferer.py: Resolve concurrency issue with checksum file names [software/transferpy] - 10https://gerrit.wikimedia.org/r/613633 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[10:53:23] <wikibugs>	 (03PS7) 10Hnowlan: ratelimit: add new docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907)
[10:55:36] <wikibugs>	 (03PS4) 10Privacybatm: Make transferpy configurable using a configuration file [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602)
[10:56:35] <wikibugs>	 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki)
[10:56:53] <wikibugs>	 (03PS1) 10Jbond: librenms: Fix hash definition [puppet] - 10https://gerrit.wikimedia.org/r/615702
[10:57:06] <wikibugs>	 (03PS8) 10Ema: varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015)
[10:57:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] librenms: Fix hash definition [puppet] - 10https://gerrit.wikimedia.org/r/615702 (owner: 10Jbond)
[10:58:26] <wikibugs>	 (03PS18) 10Effie Mouzeli: charts for push-notification service [deployment-charts] - 10https://gerrit.wikimedia.org/r/602390 (https://phabricator.wikimedia.org/T250493) (owner: 10MSantos)
[10:59:15] <wikibugs>	 (03CR) 10Ema: [C: 03+2] varnish: stop responding to varnishcheck.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/610051 (https://phabricator.wikimedia.org/T255015) (owner: 10Ema)
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1100).
[11:00:04] <jouncebot>	 dcausse: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[11:00:33] <dcausse>	 o/
[11:01:00] <Lucas_WMDE>	 o/
[11:01:06] <Lucas_WMDE>	 dcausse: do you want to deploy the changes yourself?
[11:01:13] <dcausse>	 Lucas_WMDE: sure I can
[11:01:20] <Lucas_WMDE>	 ok!
[11:01:37] <wikibugs>	 (03PS1) 10Effie Mouzeli: Add certificates and API keys for push-notifications [labs/private] - 10https://gerrit.wikimedia.org/r/615704 (https://phabricator.wikimedia.org/T255042)
[11:01:53] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] "DEPLOYING" [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615443 (https://phabricator.wikimedia.org/T258507) (owner: 10DCausse)
[11:02:06] <wikibugs>	 (03PS2) 10DCausse: [sdoc] fix entity source base URIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615171 (https://phabricator.wikimedia.org/T258474)
[11:02:17] <wikibugs>	 (03CR) 10Jcrespo: "2 comments below:" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[11:04:21] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] "DEPLOYING" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615171 (https://phabricator.wikimedia.org/T258474) (owner: 10DCausse)
[11:04:26] <wikibugs>	 (03CR) 10Muehlenhoff: "Let's also change profile::idp::client::httpd::document_root to /usr/share/smokeping/www in the same patch? CAS will only be enabled once " [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[11:05:21] <wikibugs>	 (03Merged) 10jenkins-bot: [sdoc] fix entity source base URIs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615171 (https://phabricator.wikimedia.org/T258474) (owner: 10DCausse)
[11:08:28] <wikibugs>	 (03CR) 10Jcrespo: "Only one question- otherwise looks good, only needs deep testing on my side. Once merged we can merge at the same time as 612162." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) (owner: 10Privacybatm)
[11:10:18] <wikibugs>	 (03PS1) 10Jbond: librenms: add additional parameters [puppet] - 10https://gerrit.wikimedia.org/r/615706
[11:10:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] librenms: add additional parameters [puppet] - 10https://gerrit.wikimedia.org/r/615706 (owner: 10Jbond)
[11:13:23] <logmsgbot>	 !log dcausse@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T258474: [sdoc] fix entity source base URIs (duration: 01m 07s)
[11:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:29] <stashbot>	 T258474: RDF dumps for Structured Data on Commons are broken - https://phabricator.wikimedia.org/T258474
[11:15:33] <wikibugs>	 (03CR) 10Privacybatm: Transferer.py: Add proper cleanup (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[11:17:07] <wikibugs>	 (03PS1) 10Jbond: librenms: correct group map type [puppet] - 10https://gerrit.wikimedia.org/r/615707
[11:17:46] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,service=mobileapps,name=scb.*
[11:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:01] <akosiaris>	 !log depool scb in mobileapps/eqiad. T218733
[11:18:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:06] <stashbot>	 T218733: Migrate mobileapps to k8s and node 10 - https://phabricator.wikimedia.org/T218733
[11:18:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24091/netmon2001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/615707 (owner: 10Jbond)
[11:24:00] <wikibugs>	 (03Merged) 10jenkins-bot: Fix bug that causes wrong prefixes in RDF output [extensions/Wikibase] (wmf/1.36.0-wmf.1) - 10https://gerrit.wikimedia.org/r/615443 (https://phabricator.wikimedia.org/T258507) (owner: 10DCausse)
[11:25:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[11:26:28] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615665 (owner: 10Muehlenhoff)
[11:26:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Nice!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan)
[11:27:03] <wikibugs>	 10Operations, 10serviceops: Recurrent TX bw saturation for mediawiki memcached shards - https://phabricator.wikimedia.org/T258679 (10elukey) >>! In T258679#6328948, @elukey wrote:  > There is a baseline for slab 136 of constant GET traffic, that should be related to `ITEM WANCache:v:global:SqlBlobStore-blob:en...
[11:28:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lol sorry 😊" [puppet] - 10https://gerrit.wikimedia.org/r/615509 (owner: 10Muehlenhoff)
[11:28:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Also remove priority for Thanos [puppet] - 10https://gerrit.wikimedia.org/r/615509 (owner: 10Muehlenhoff)
[11:28:18] <wikibugs>	 (03CR) 10Jcrespo: Transferer.py: Add proper cleanup (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[11:30:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:31:53] <logmsgbot>	 !log dcausse@deploy1001 Synchronized php-1.36.0-wmf.1/extensions/Wikibase: T258507: Fix bug that causes wrong prefixes in RDF output (duration: 01m 11s)
[11:31:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:59] <stashbot>	 T258507: v: prefix not correctly prefixed in Wikibase when using entitysource config and extra prefixes - https://phabricator.wikimedia.org/T258507
[11:32:38] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:36:06] <dcausse>	 !log European mid-day backport window done
[11:36:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:08] <wikibugs>	 (03PS3) 10Privacybatm: Transferer.py: Add proper cleanup [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450)
[11:42:44] <wikibugs>	 (03PS12) 10Jbond: thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009)
[11:46:00] <wikibugs>	 (03PS1) 10Urbanecm: Log ClosedWikiProvider's start with info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615713 (https://phabricator.wikimedia.org/T258695)
[11:46:03] <wikibugs>	 (03CR) 10Privacybatm: Transferer.py: Add proper cleanup (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[11:46:25] <wikibugs>	 10Operations, 10Puppet, 10User-jbond: require_package should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10jbond) >>! In T195981#5635590, @jbond wrote: > I attempted a [[ https://github.com/puppetlabs/puppet/pull/7802 | patch for this upstream ]] although its not quite wo...
[11:47:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Log ClosedWikiProvider's start with info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615713 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm)
[11:47:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] thanos::frontend: add ssl terminations for thanos.* SNI's [puppet] - 10https://gerrit.wikimedia.org/r/615477 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond)
[11:48:08] <wikibugs>	 (03Merged) 10jenkins-bot: Log ClosedWikiProvider's start with info level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615713 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm)
[11:48:10] <marostegui>	 !log Deploy MCR schema change on db1145:3314
[11:48:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:56] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 745ff20f53e4914cf6e1717c963419e74b68e693: Log ClosedWikiProviders start with info level (T258695) (duration: 01m 05s)
[11:50:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:50:01] <stashbot>	 T258695: Investigate why ClosedWikiProvider doesn't work - https://phabricator.wikimedia.org/T258695
[11:50:14] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1107 [puppet] - 10https://gerrit.wikimedia.org/r/615715
[11:51:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1107 [puppet] - 10https://gerrit.wikimedia.org/r/615715 (owner: 10Marostegui)
[11:51:38] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.009 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[11:53:25] <wikibugs>	 (03CR) 10Privacybatm: "Thank you for the review! please see my reply." (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/613128 (https://phabricator.wikimedia.org/T257602) (owner: 10Privacybatm)
[11:55:49] <wikibugs>	 10Operations: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 (10CDanis)
[11:56:06] <wikibugs>	 (03CR) 10JMeybohm: ""helmfile template" fails for me with:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[12:00:12] <wikibugs>	 (03PS1) 10Jbond: idp: add thanos service [puppet] - 10https://gerrit.wikimedia.org/r/615718
[12:00:36] <Urbanecm>	 !log Stagging at mwdebug1001
[12:00:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:04] <wikibugs>	 (03PS2) 10Privacybatm: Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601)
[12:02:11] <wikibugs>	 (03PS3) 10Jbond: idp: enable ipv6 for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/612824
[12:02:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: add thanos service [puppet] - 10https://gerrit.wikimedia.org/r/615718 (owner: 10Jbond)
[12:02:39] <wikibugs>	 (03PS8) 10ArielGlenn: rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856)
[12:02:59] <Urbanecm>	 !log Stagging at mwdebug1001 ended, run scap pull to clean changes
[12:03:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: enable ipv6 for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/612824 (owner: 10Jbond)
[12:03:02] <wikibugs>	 (03PS4) 10Jbond: idp: enable ipv6 for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/612824
[12:03:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:04] <wikibugs>	 (03PS2) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601)
[12:03:06] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] idp: enable ipv6 for IDP role [puppet] - 10https://gerrit.wikimedia.org/r/612824 (owner: 10Jbond)
[12:03:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm)
[12:04:02] <wikibugs>	 (03CR) 10Jcrespo: "Looks good. No comments except final testing on my side to validate it." [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm)
[12:04:17] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] rename the dump rsyncer script preparing for new one that rsyncs via secondary [puppet] - 10https://gerrit.wikimedia.org/r/614826 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn)
[12:06:36] <wikibugs>	 (03CR) 10Ayounsi: "I don't know enough to review that, but 2 notes:" [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[12:06:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] transferpy: Change tox development environment to Python3.7 [software/transferpy] - 10https://gerrit.wikimedia.org/r/615688 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[12:07:08] <wikibugs>	 (03Merged) 10jenkins-bot: transferpy: Change tox development environment to Python3.7 [software/transferpy] - 10https://gerrit.wikimedia.org/r/615688 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[12:07:33] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[12:08:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[12:08:32] <wikibugs>	 (03CR) 10Jcrespo: "Give me some time to think of a proper alternative and I will answer you soon." [software/transferpy] - 10https://gerrit.wikimedia.org/r/614686 (https://phabricator.wikimedia.org/T256450) (owner: 10Privacybatm)
[12:09:36] <wikibugs>	 (03PS1) 10Jbond: lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009)
[12:12:11] <wikibugs>	 (03CR) 10Jcrespo: "Did I merge things in the wrong order?" [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[12:12:34] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.005 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:14:02] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe2002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.152 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:14:27] <wikibugs>	 (03CR) 10CDanis: "In addition to this stuff you'll also 1) need Pybal restarts and 2) be willing to tolerate some downtime (if you aren't, you'll have to se" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond)
[12:15:55] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: ganeti: Fix ganeti-mond typo [puppet] - 10https://gerrit.wikimedia.org/r/615676
[12:15:56] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe2001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.154 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:17:05] <wikibugs>	 (03PS2) 10ArielGlenn: script for rsyncing dumps via secondary storage server [puppet] - 10https://gerrit.wikimedia.org/r/614839 (https://phabricator.wikimedia.org/T254856)
[12:17:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] script for rsyncing dumps via secondary storage server [puppet] - 10https://gerrit.wikimedia.org/r/614839 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn)
[12:17:35] <Urbanecm>	 !log Stagging at mwdebug1001 again
[12:17:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:30] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.009 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:19:33] <wikibugs>	 (03CR) 10Marostegui: "https://puppet-compiler.wmflabs.org/compiler1002/24062/db1080.eqiad.wmnet/index.html -> aren't those used for https://grafana.wikimedia.or" [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) (owner: 10Kormat)
[12:20:14] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on thanos-fe2003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 1.154 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[12:20:30] <icinga-wm>	 RECOVERY - MD RAID on restbase-dev1004 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[12:20:56] <wikibugs>	 10Operations, 10ops-codfw, 10RESTBase: restbase2009 down - https://phabricator.wikimedia.org/T256863 (10akosiaris) >>! In T256863#6320727, @Papaul wrote: > @Eevans @akosiaris  we have 2 spare on site that we can use to replace this server  > wmf6413 and wmf6414 in netbox both servers are: > HP ProLiant DL360...
[12:20:58] <wikibugs>	 (03PS1) 10Jbond: idp - librenms:  only allow librenms-readers and ops groups for librenms [puppet] - 10https://gerrit.wikimedia.org/r/615721
[12:21:11] <wikibugs>	 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Ottomata) @Joe I don't have full context of how MassMessageJob and JobQueue work here, but at the very least it seems we are able to save th...
[12:21:21] <Urbanecm>	 !log Stagging at mwdebug1001 ended, run scap pull to clean changes
[12:21:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp - librenms:  only allow librenms-readers and ops groups for librenms [puppet] - 10https://gerrit.wikimedia.org/r/615721 (owner: 10Jbond)
[12:23:11] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10hnowlan) a:05Cmjohnson→03hnowlan
[12:23:15] <XioNoX>	 !log remove bogus lo0 IPs from cr3-knams
[12:23:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:49] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on restbase-dev1004 - https://phabricator.wikimedia.org/T253607 (10hnowlan) Thanks @Jclark-ctr ! We'll need to rebuild the raid0 that the cassandra storage is located upon.
[12:23:50] <wikibugs>	 (03PS1) 10Urbanecm: ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695)
[12:24:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=idp site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:24:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm)
[12:24:47] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: add passive server for rsync server [puppet] - 10https://gerrit.wikimedia.org/r/615474 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[12:24:54] <wikibugs>	 (03PS3) 10Filippo Giunchedi: librenms: add passive server for rsync server [puppet] - 10https://gerrit.wikimedia.org/r/615474 (https://phabricator.wikimedia.org/T247967)
[12:25:13] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[12:25:27] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::superset: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/615724
[12:26:00] <wikibugs>	 (03PS2) 10Urbanecm: ClosedWikiProvider: Use testUserForCreation rather than testForAuthentication [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695)
[12:26:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:26:23] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1019: Decrease a bit labsdb1009 weight [puppet] - 10https://gerrit.wikimedia.org/r/615725
[12:26:59] <wikibugs>	 (03CR) 10Urbanecm: "Hello James, I'd appreciate your review here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615723 (https://phabricator.wikimedia.org/T258695) (owner: 10Urbanecm)
[12:27:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Decrease a bit labsdb1009 weight [puppet] - 10https://gerrit.wikimedia.org/r/615725 (owner: 10Marostegui)
[12:28:10] <wikibugs>	 (03CR) 10Kormat: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/615476 (https://phabricator.wikimedia.org/T256879) (owner: 10Kormat)
[12:28:39] <wikibugs>	 10Operations, 10Analytics, 10User-MoritzMuehlenhoff: Replace firejail use in superset with native systemd features - https://phabricator.wikimedia.org/T258700 (10MoritzMuehlenhoff)
[12:29:13] <marostegui>	 !log Decrease labsdb1009 weight a bit, as it is lagging again.
[12:29:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:00] <wikibugs>	 (03PS2) 10Jbond: lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009)
[12:30:02] <wikibugs>	 (03PS1) 10Jbond: thanos-query - lvs: update service to monitoring_setup while we update [puppet] - 10https://gerrit.wikimedia.org/r/615726
[12:30:20] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 38741432 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:31:46] <wikibugs>	 (03PS1) 10Jbond: thanos-query - lvs: update service to production state [puppet] - 10https://gerrit.wikimedia.org/r/615727
[12:32:10] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 872 and 84 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:32:44] <wikibugs>	 (03PS3) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601)
[12:33:22] <wikibugs>	 (03PS1) 10Esanders: Fix VE-RealTime CSP entry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615728
[12:34:38] <wikibugs>	 (03PS3) 10Jbond: lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009)
[12:35:36] <wikibugs>	 (03CR) 10Jbond: "thanks, updated" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond)
[12:35:38] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 9:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[12:36:32] <wikibugs>	 (03PS2) 10Jbond: thanos-query - lvs: update service to production state [puppet] - 10https://gerrit.wikimedia.org/r/615727
[12:40:38] <wikibugs>	 (03PS10) 10Privacybatm: Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600)
[12:41:29] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 9:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[12:41:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] kubernetes: add namespace for api-gateway [puppet] - 10https://gerrit.wikimedia.org/r/615521 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[12:42:46] <wikibugs>	 (03Restored) 10Privacybatm: transferpy: Package transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm)
[12:43:15] <wikibugs>	 (03Abandoned) 10Privacybatm: transferpy: Package transferpy [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/598984 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm)
[12:44:37] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks everyone for the reviews! Going ahead with it now and we can tackle the refactor after the Buster migration" [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[12:44:40] <wikibugs>	 (03Abandoned) 10Privacybatm: [POC1 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614745 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm)
[12:44:43] <wikibugs>	 (03PS2) 10Filippo Giunchedi: smokeping: match documentroot with smokeping installation [puppet] - 10https://gerrit.wikimedia.org/r/615670 (https://phabricator.wikimedia.org/T247967)
[12:45:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Add discovery and disabled LVS components for API gateway (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615512 (https://phabricator.wikimedia.org/T254908) (owner: 10Hnowlan)
[12:45:17] <wikibugs>	 (03Abandoned) 10Privacybatm: [POC2 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/614744 (https://phabricator.wikimedia.org/T257601) (owner: 10Privacybatm)
[12:48:38] <wikibugs>	 (03PS3) 10ArielGlenn: script for rsyncing dumps via secondary storage server [puppet] - 10https://gerrit.wikimedia.org/r/614839 (https://phabricator.wikimedia.org/T254856)
[12:49:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] script for rsyncing dumps via secondary storage server [puppet] - 10https://gerrit.wikimedia.org/r/614839 (https://phabricator.wikimedia.org/T254856) (owner: 10ArielGlenn)
[12:49:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Quick q before I review. Does the WIP in the commit message still hold, or is this ready for review?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[12:52:58] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[12:54:58] <wikibugs>	 (03PS3) 10Privacybatm: Transferer.py: Replace options['port'] usage with `port` local variable [software/transferpy] - 10https://gerrit.wikimedia.org/r/615173 (https://phabricator.wikimedia.org/T257601)
[12:55:00] <wikibugs>	 (03PS4) 10Privacybatm: Firewall.py: Save the target port after reservation [software/transferpy] - 10https://gerrit.wikimedia.org/r/615174 (https://phabricator.wikimedia.org/T257601)
[12:55:02] <wikibugs>	 (03PS6) 10Privacybatm: [POC3 WIP] transferpy: Multiprocess the transfers [software/transferpy] - 10https://gerrit.wikimedia.org/r/615179 (https://phabricator.wikimedia.org/T257601)
[12:55:30] <cdanis>	 ok who made icinga sad
[12:58:13] <cdanis>	 akosiaris: Jul 23 12:46:56 icinga1001 icinga[75759]: Error: Could not find any host matching 'chartmuseum2001.codfw.wmnet' (config file '/etc/nagios/nagios_service.cfg', starting on line 23985)
[12:58:15] <cdanis>	 Jul 23 12:46:56 icinga1001 icinga[75759]: Error: Could not expand hostgroups and/or hosts specified in service (config file '/etc/nagios/nagios_service.cfg', starting on line 23985)
[12:58:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall! Other clients that will need changing are all in puppet, namely the grafana datasource and modules/profile/manifests/thanos/" [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond)
[12:58:36] <wikibugs>	 (03CR) 10Privacybatm: [C: 03+1] Transferer.py: Add unit tests [software/transferpy] - 10https://gerrit.wikimedia.org/r/613024 (https://phabricator.wikimedia.org/T257600) (owner: 10Privacybatm)
[13:00:04] <jouncebot>	 longma and liw: (Dis)respected human, time to deploy Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1300). Please do the needful.
[13:01:22] <wikibugs>	 (03PS1) 10Filippo Giunchedi: librenms: create/update users when using SSO [puppet] - 10https://gerrit.wikimedia.org/r/615730 (https://phabricator.wikimedia.org/T247967)
[13:01:24] <wikibugs>	 (03PS1) 10Filippo Giunchedi: librenms: set bootstrap/cache permissions [puppet] - 10https://gerrit.wikimedia.org/r/615731 (https://phabricator.wikimedia.org/T247967)
[13:01:26] <wikibugs>	 (03PS1) 10Filippo Giunchedi: role: force mpm_prefork for netmon/librenms [puppet] - 10https://gerrit.wikimedia.org/r/615732 (https://phabricator.wikimedia.org/T247967)
[13:02:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] librenms: create/update users when using SSO [puppet] - 10https://gerrit.wikimedia.org/r/615730 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[13:03:34] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: move thanos-query clients to https [puppet] - 10https://gerrit.wikimedia.org/r/615733 (https://phabricator.wikimedia.org/T151009)
[13:05:30] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] profile: move thanos-query clients to https [puppet] - 10https://gerrit.wikimedia.org/r/615733 (https://phabricator.wikimedia.org/T151009) (owner: 10Filippo Giunchedi)
[13:05:33] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond)
[13:08:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Disable backports on stretch for production [puppet] - 10https://gerrit.wikimedia.org/r/613611 (https://phabricator.wikimedia.org/T256877) (owner: 10Muehlenhoff)
[13:09:50] <wikibugs>	 10Operations, 10netops, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Migrate role::netmon to Buster - https://phabricator.wikimedia.org/T247967 (10fgiunchedi)
[13:11:36] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/615734
[13:18:47] <wikibugs>	 (03PS4) 10ArielGlenn: script for rsyncing dumps via secondary storage server [puppet] - 10https://gerrit.wikimedia.org/r/614839 (https://phabricator.wikimedia.org/T254856)
[13:20:54] <akosiaris>	 cdanis: looking
[13:22:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch Superset to CAS (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/615736
[13:24:49] <akosiaris>	 ah, the codfw.wmnet thing. /me fixing
[13:26:20] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: ganeti: Fix ganeti-mond typo [puppet] - 10https://gerrit.wikimedia.org/r/615676
[13:26:22] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: chartmuseum: Skip the domain names in service [puppet] - 10https://gerrit.wikimedia.org/r/615737
[13:26:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: Fix ganeti-mond typo [puppet] - 10https://gerrit.wikimedia.org/r/615676 (owner: 10Alexandros Kosiaris)
[13:26:43] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] chartmuseum: Skip the domain names in service [puppet] - 10https://gerrit.wikimedia.org/r/615737 (owner: 10Alexandros Kosiaris)
[13:29:43] <wikibugs>	 (03CR) 10Hnowlan: "> Patch Set 20:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[13:32:13] <wikibugs>	 (03CR) 10Hnowlan: [V: 03+2 C: 03+2] ratelimit: add new docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615168 (https://phabricator.wikimedia.org/T254907) (owner: 10Hnowlan)
[13:34:59] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: flip netmon2001 back to ldap auth [puppet] - 10https://gerrit.wikimedia.org/r/615738 (https://phabricator.wikimedia.org/T247967)
[13:35:02] <xover>	 I'm seeing multiple onwiki reports of problems related to Commons, that may be unrelated but kinda smell of a common infrastructure problem somewhere.
[13:35:21] <wikibugs>	 10Operations: stop using $::site in description field of service.yaml - https://phabricator.wikimedia.org/T258697 (10Joe) p:05Triage→03Low that description is used in a comment in pybal, where the $::site is evaluated correctly.  The problem here is that it was reused to make the icinga alerts unique AIUI. b...
[13:35:33] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[13:35:44] <akosiaris>	 cdanis: ^
[13:35:46] <akosiaris>	 fixed
[13:36:07] <cdanis>	 :) ty
[13:36:10] <xover>	 Fæ reported trouble getting members of a category from the API: https://commons.wikimedia.org/wiki/Commons:Village_pump/Technical#Please_help_prioritize_the_Commons_API_"error_500"_bug_on_searches_and_category_queries
[13:36:38] <xover>	 Multiple reports that FileImporter fails: https://commons.wikimedia.org/wiki/Commons:Village_pump/Technical#File_importer_is_broken
[13:36:43] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns5001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:37:26] <xover>	 (imports tried from enWS, jaWP, and a third by a Chinese user)
[13:37:57] <xover>	 And then there was a report of the ia-upload tool failing an upload.
[13:37:58] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: librenms: create/update users when using SSO [puppet] - 10https://gerrit.wikimedia.org/r/615730 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[13:38:56] <xover>	 (ia-upload, for those not aware, is a toolforge tool that grabs book scans from the Internet Archive and upload them to Commons)
[13:38:57] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns5002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:40:12] <xover>	 I believe a manual/UploadWizard upload of the same ~300MB PDF (i.e. chunked upload) file also failed, but haven't tested that myself.
[13:41:17] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns2001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:41:17] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns2002 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:41:51] <xover>	 Fæ's API problem is apparently a couple of weeks old and ongoing, while the two upload/import problems look like they may have started yesterday.
[13:42:03] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, 10Patch-For-Review: TY pages in a subdomain of wikipedia and set hide banner cookie - https://phabricator.wikimedia.org/T251780 (10Krinkle) 05Resolved→03Open
[13:42:33] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: discovery: Add helm-charts discovery stanzas [puppet] - 10https://gerrit.wikimedia.org/r/615739
[13:42:37] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch Superset to CAS (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/615736
[13:43:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Switch Superset to CAS (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/615736 (owner: 10Muehlenhoff)
[13:44:48] <akosiaris>	 xover: Allowed memory size of 698351616 bytes exhausted (tried to allocate 6128792 bytes) doesn't look very promising. That's a lot of memory
[13:45:04] <cdanis>	 akosiaris: it sounds like that's new behavior on unchanged API calls, though
[13:45:41] <akosiaris>	 could also be new larger uploads as well. Difficult to say at this point
[13:45:42] <wikibugs>	 (03PS4) 10Vidhi-Mody: Selenium: Update to WebdriverIO v5 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471)
[13:45:57] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns1001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:45:59] <icinga-wm>	 PROBLEM - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns4001 is CRITICAL: File not found: /var/lib/gdnsd/discovery-helm-charts.state https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:46:04] <akosiaris>	 we can try subscribing a few people to the task, they might have some more insight 
[13:46:09] <cdanis>	 akosiaris: does this seem like a thing CPT could/should look at?
[13:46:21] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] discovery: Add helm-charts discovery stanzas [puppet] - 10https://gerrit.wikimedia.org/r/615739 (owner: 10Alexandros Kosiaris)
[13:46:23] <wikibugs>	 (03PS3) 10Muehlenhoff: Switch Superset to CAS (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/615736
[13:47:24] <akosiaris>	 cdanis: I think so
[13:47:59] <cdanis>	 excuse me, I guess I mean 'Platform Engineering'?
[13:48:06] <cdanis>	 did they have a separate workboard for their clinic duty?
[13:48:14] <akosiaris>	 multiple ones IIRC
[13:48:18] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns5001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:48:19] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:48:22] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns4001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:48:28] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns5002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:48:34] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:48:34] <icinga-wm>	 RECOVERY - Confd template for /var/lib/gdnsd/discovery-helm-charts.state on dns2002 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring
[13:49:13] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=.*
[13:49:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:58] <xover>	 I know metdata handling for PDF and DjVu files is suboptimal: OCR text layer wrapped in an XML structure is stored in DB fields along with things like dates, dpi, and x / y resolution.
[13:51:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "Should be good to go now" [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[13:52:54] <cdanis>	 xover: hope you don't mind but I quoted you on https://phabricator.wikimedia.org/T255981, and put it on CPT's radar
[13:53:28] <xover>	 cdanis: thanks!
[13:55:36] <wikibugs>	 (03PS1) 10Jbond: puppetboard: increase buffer size [puppet] - 10https://gerrit.wikimedia.org/r/615742
[13:57:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24096/" [puppet] - 10https://gerrit.wikimedia.org/r/615742 (owner: 10Jbond)
[13:58:51] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/615742 (owner: 10Jbond)
[14:00:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/615738 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[14:00:52] <icinga-wm>	 PROBLEM - ganeti-metad running on ganeti3003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-metad https://wikitech.wikimedia.org/wiki/Ganeti
[14:02:36] <wikibugs>	 10Operations, 10ops-eqiad: please connect eqiad's RIPE Atlas anchor to one of the SCSes - https://phabricator.wikimedia.org/T258221 (10RobH)
[14:03:13] <wikibugs>	 (03PS1) 10Ema: Add missing field: uri_query [software/atskafka] - 10https://gerrit.wikimedia.org/r/615744 (https://phabricator.wikimedia.org/T254317)
[14:03:19] <icinga-wm>	 PROBLEM - ganeti-metad running on ganeti2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-metad https://wikitech.wikimedia.org/wiki/Ganeti
[14:04:39] <cdanis>	 akosiaris: do you know what's up with metad?
[14:05:02] <akosiaris>	 hmm
[14:05:50] <akosiaris>	 quite possibly eventual consistency. I had to rename that check
[14:05:56] * akosiaris verifying
[14:06:18] <wikibugs>	 10Operations, 10ops-eqsin, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10RobH)
[14:06:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[14:07:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: flip netmon2001 back to ldap auth [puppet] - 10https://gerrit.wikimedia.org/r/615738 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[14:07:03] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "> Patch Set 2:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[14:07:06] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: flip netmon2001 back to ldap auth [puppet] - 10https://gerrit.wikimedia.org/r/615738 (https://phabricator.wikimedia.org/T247967)
[14:11:36] <wikibugs>	 (03PS2) 10Ema: Add missing field: uri_query [software/atskafka] - 10https://gerrit.wikimedia.org/r/615744 (https://phabricator.wikimedia.org/T254317)
[14:12:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:13:33] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Add missing field: uri_query [software/atskafka] - 10https://gerrit.wikimedia.org/r/615744 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema)
[14:13:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:22:33] <wikibugs>	 (03PS4) 10Muehlenhoff: Add CAS support to Superset (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/615736
[14:23:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] role: force mpm_prefork for netmon/librenms [puppet] - 10https://gerrit.wikimedia.org/r/615732 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[14:23:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: set bootstrap/cache permissions [puppet] - 10https://gerrit.wikimedia.org/r/615731 (https://phabricator.wikimedia.org/T247967) (owner: 10Filippo Giunchedi)
[14:23:37] <wikibugs>	 (03PS2) 10Filippo Giunchedi: librenms: set bootstrap/cache permissions [puppet] - 10https://gerrit.wikimedia.org/r/615731 (https://phabricator.wikimedia.org/T247967)
[14:24:53] <icinga-wm>	 PROBLEM - ganeti-metad running on ganeti5003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-metad https://wikitech.wikimedia.org/wiki/Ganeti
[14:25:02] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/615734 (owner: 10Muehlenhoff)
[14:25:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] profile::superset: Remove stretch support [puppet] - 10https://gerrit.wikimedia.org/r/615724 (owner: 10Muehlenhoff)
[14:25:50] <wikibugs>	 (03PS2) 10Filippo Giunchedi: role: force mpm_prefork for netmon/librenms [puppet] - 10https://gerrit.wikimedia.org/r/615732 (https://phabricator.wikimedia.org/T247967)
[14:25:59] <icinga-wm>	 PROBLEM - librenms.wikimedia.org requires authentication on netmon2001 is CRITICAL: connect to address 208.80.153.110 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[14:29:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove unused role [puppet] - 10https://gerrit.wikimedia.org/r/615734 (owner: 10Muehlenhoff)
[14:34:25] <wikibugs>	 (03PS5) 10Muehlenhoff: Add CAS support to Superset [puppet] - 10https://gerrit.wikimedia.org/r/615736
[14:36:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable CAS for Superset [puppet] - 10https://gerrit.wikimedia.org/r/615754
[14:38:46] <icinga-wm>	 PROBLEM - ganeti-metad running on ganeti4003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), command name ganeti-metad https://wikitech.wikimedia.org/wiki/Ganeti
[14:41:37] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/24098/" [puppet] - 10https://gerrit.wikimedia.org/r/615736 (owner: 10Muehlenhoff)
[14:47:43] <wikibugs>	 (03PS2) 10Ayounsi: Routers interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/613641
[14:48:17] <wikibugs>	 (03PS2) 10Ayounsi: Add routers interfaces support to wmf-netbox plugin [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642
[14:48:59] <wikibugs>	 (03PS6) 10Muehlenhoff: Add CAS support to Superset [puppet] - 10https://gerrit.wikimedia.org/r/615736
[14:50:57] <wikibugs>	 (03CR) 10Ayounsi: "This change is ready for review." [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642 (owner: 10Ayounsi)
[14:51:50] <wikibugs>	 (03CR) 10Ayounsi: "This change is ready for review." [homer/public] - 10https://gerrit.wikimedia.org/r/613641 (owner: 10Ayounsi)
[14:51:52] <wikibugs>	 (03PS1) 10Elukey: druid: allow different package/class prefixes for logging/alarming [puppet] - 10https://gerrit.wikimedia.org/r/615759 (https://phabricator.wikimedia.org/T244482)
[14:52:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: smokeping: default to active/active [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675)
[14:57:08] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] druid: allow different package/class prefixes for logging/alarming [puppet] - 10https://gerrit.wikimedia.org/r/615759 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey)
[14:57:12] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24101/" [puppet] - 10https://gerrit.wikimedia.org/r/615759 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey)
[14:57:16] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 54 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:59:47] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: ganeti: Fix wconfd monitoring typo [puppet] - 10https://gerrit.wikimedia.org/r/615761
[15:01:36] <wikibugs>	 (03PS2) 10Filippo Giunchedi: smokeping: default to active/active [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675)
[15:02:56] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 566 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:03:17] <wikibugs>	 (03PS1) 10Elukey: role::druid::test_analytics::worker: set middlemanager java opts [puppet] - 10https://gerrit.wikimedia.org/r/615762 (https://phabricator.wikimedia.org/T244482)
[15:03:39] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: Fix wconfd monitoring typo [puppet] - 10https://gerrit.wikimedia.org/r/615761 (owner: 10Alexandros Kosiaris)
[15:03:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::druid::test_analytics::worker: set middlemanager java opts [puppet] - 10https://gerrit.wikimedia.org/r/615762 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey)
[15:09:50] <wikibugs>	 (03CR) 10Muehlenhoff: Modernise Apache config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615459 (owner: 10Muehlenhoff)
[15:10:26] <wikibugs>	 (03PS2) 10Muehlenhoff: Modernise Apache config [puppet] - 10https://gerrit.wikimedia.org/r/615459
[15:10:37] <wikibugs>	 (03PS3) 10Filippo Giunchedi: smokeping: default to active/active [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675)
[15:14:47] <wikibugs>	 (03PS4) 10Filippo Giunchedi: smokeping: default to active/active [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675)
[15:16:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/24105/" [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675) (owner: 10Filippo Giunchedi)
[15:17:57] <wikibugs>	 (03PS21) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906)
[15:19:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[15:21:56] <wikibugs>	 (03PS22) 10Hnowlan: api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906)
[15:23:03] <wikibugs>	 (03PS5) 10JMeybohm: Add helm-charts discovery record [dns] - 10https://gerrit.wikimedia.org/r/609165 (https://phabricator.wikimedia.org/T253843)
[15:23:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] api-gateway: Basic envoy chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/609808 (https://phabricator.wikimedia.org/T254906) (owner: 10Hnowlan)
[15:27:28] <wikibugs>	 (03PS2) 10Cmjohnson: Revert "Adding cloudcephosd servers to private vlan" [dns] - 10https://gerrit.wikimedia.org/r/615436
[15:27:31] <wikibugs>	 (03CR) 10Cmjohnson: [V: 03+2 C: 03+2] Revert "Adding cloudcephosd servers to private vlan" [dns] - 10https://gerrit.wikimedia.org/r/615436 (owner: 10Cmjohnson)
[15:32:03] <wikibugs>	 10Operations, 10User-MoritzMuehlenhoff: Stop using mod_access_compat - https://phabricator.wikimedia.org/T258686 (10MoritzMuehlenhoff)
[15:35:14] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[15:36:47] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Update T250887 mitigations (duration: 01m 05s)
[15:36:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:37:27] <wikibugs>	 (03PS1) 10Cmjohnson: Addig cloudcephosd to cloud-host vlan [dns] - 10https://gerrit.wikimedia.org/r/615765 (https://phabricator.wikimedia.org/T251619)
[15:40:30] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Addig cloudcephosd to cloud-host vlan [dns] - 10https://gerrit.wikimedia.org/r/615765 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson)
[15:42:36] <wikibugs>	 (03CR) 10Zfilipin: [C: 03+1] "`npm t` and `npm run selenium` pass! https://phabricator.wikimedia.org/P12029" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/614829 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody)
[15:42:49] <wikibugs>	 (03CR) 10Volans: "Few comments inline, pure from the python PoV" (035 comments) [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/613642 (owner: 10Ayounsi)
[15:50:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ayounsi) Current status on the switches side is that vlans (cloud-hosts + cloud-storage) a...
[15:54:28] <wikibugs>	 (03CR) 10JMeybohm: "> > you need to explicitly indicate the environment:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[15:58:10] <wikibugs>	 (03PS1) 10Volans: GC: fix reported counter [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788
[15:58:19] <wikibugs>	 (03CR) 10Volans: GC: fix reported counter (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788 (owner: 10Volans)
[15:58:52] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: ganeti: Remove metad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/615789
[15:59:14] <wikibugs>	 (03PS1) 10Cmjohnson: Adding the mgmt dns entries created by netbox to dns file (not yet automated) [dns] - 10https://gerrit.wikimedia.org/r/615790 (https://phabricator.wikimedia.org/T251619)
[15:59:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] GC: fix reported counter [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788 (owner: 10Volans)
[16:00:04] <jouncebot>	 godog and _joe_: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1600).
[16:00:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] ganeti: Remove metad from monitoring [puppet] - 10https://gerrit.wikimedia.org/r/615789 (owner: 10Alexandros Kosiaris)
[16:01:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] remove gerrit.wmfusercontent.org [dns] - 10https://gerrit.wikimedia.org/r/615557 (https://phabricator.wikimedia.org/T191183) (owner: 10Dzahn)
[16:01:56] <wikibugs>	 (03PS2) 10Dzahn: remove gerrit.wmfusercontent.org [dns] - 10https://gerrit.wikimedia.org/r/615557 (https://phabricator.wikimedia.org/T191183)
[16:02:51] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Adding the mgmt dns entries created by netbox to dns file (not yet automated) [dns] - 10https://gerrit.wikimedia.org/r/615790 (https://phabricator.wikimedia.org/T251619) (owner: 10Cmjohnson)
[16:07:24] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] remove gerrit.wmfusercontent.org [dns] - 10https://gerrit.wikimedia.org/r/615557 (https://phabricator.wikimedia.org/T191183) (owner: 10Dzahn)
[16:07:29] <wikibugs>	 (03PS3) 10Dzahn: remove gerrit.wmfusercontent.org [dns] - 10https://gerrit.wikimedia.org/r/615557 (https://phabricator.wikimedia.org/T191183)
[16:09:00] <wikibugs>	 (03PS2) 10Volans: GC: fix reported counter [software/debmonitor] - 10https://gerrit.wikimedia.org/r/615788
[16:11:08] <wikibugs>	 (03PS1) 10CDanis: secure.wm.o: tighten redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615792
[16:11:30] <wikibugs>	 (03CR) 10CDanis: "Manually tested with httpbb on mwdebug1002." [puppet] - 10https://gerrit.wikimedia.org/r/615792 (owner: 10CDanis)
[16:11:54] <wikibugs>	 (03PS2) 10CDanis: secure.wm.o: tighten redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615792 (https://phabricator.wikimedia.org/T151977)
[16:13:11] <wikibugs>	 (03PS1) 10Jbond: python3: add tox checks for python3 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/615793
[16:13:12] <wikibugs>	 10Operations, 10netbox: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10ayounsi) No idea if it's useful here but came across https://github.com/jeremyschulman/netbox-plugin-auth-saml2
[16:14:06] <wikibugs>	 (03CR) 10Jbond: "Ready for at least a first pass." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/615793 (owner: 10Jbond)
[16:14:11] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] secure.wm.o: tighten redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615792 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis)
[16:15:13] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] secure.wm.o: tighten redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615792 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis)
[16:25:45] <wikibugs>	 (03PS1) 10CDanis: httpbb: secure.wm.o: test the tightened redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615795 (https://phabricator.wikimedia.org/T151977)
[16:28:48] <wikibugs>	 (03PS1) 10Dzahn: phabricator: set aphlict to disabled in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615796
[16:30:49] <wikibugs>	 (03PS27) 10Ryan Kemper: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse)
[16:31:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse)
[16:33:19] <wikibugs>	 (03PS28) 10Ryan Kemper: [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse)
[16:34:24] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] [wdqs] add a new streaming updater profile [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse)
[16:35:42] <wikibugs>	 (03PS1) 10Dzahn: ATS: add backend for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/615797
[16:36:29] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] "pcc looks fine: https://puppet-compiler.wmflabs.org/compiler1003/24106/" [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse)
[16:36:53] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] "Optional: These support comments (# syntax) so you could add one here, either explaining the vulnerability or just listing the task number" [puppet] - 10https://gerrit.wikimedia.org/r/615795 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis)
[16:37:52] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "This is probably okay to go in now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615263 (owner: 10Lucas Werkmeister (WMDE))
[16:38:06] <wikibugs>	 (03PS2) 10CDanis: httpbb: secure.wm.o: test the tightened redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615795 (https://phabricator.wikimedia.org/T151977)
[16:38:50] <wikibugs>	 (03PS1) 10CDanis: ATS: force cache revalidation on secure.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/615799 (https://phabricator.wikimedia.org/T151977)
[16:39:15] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] httpbb: secure.wm.o: test the tightened redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615795 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis)
[16:39:37] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] httpbb: secure.wm.o: test the tightened redirect rule [puppet] - 10https://gerrit.wikimedia.org/r/615795 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis)
[16:40:40] <wikibugs>	 (03PS2) 10Dzahn: visualdiff: update git branch from ruthenium to scandium [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906)
[16:42:38] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[16:52:44] <wikibugs>	 10Operations, 10Security-Team, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: Open Redirect in secure.wikimedia.org - https://phabricator.wikimedia.org/T151977 (10sbassett) 05Stalled→03Resolved a:03CDanis Thanks, @cdanis.  Looks to be fixed.  Resolving and making public.
[16:52:51] <wikibugs>	 10Operations, 10Security-Team, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: Open Redirect in secure.wikimedia.org - https://phabricator.wikimedia.org/T151977 (10sbassett)
[16:56:02] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload
[16:56:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:57:25] <logmsgbot>	 !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
[16:57:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:59:35] <wikibugs>	 (03PS1) 10Vidhi-Mody: Selenium: Update to WebdriverIO v6 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615801 (https://phabricator.wikimedia.org/T255471)
[17:00:04] <jouncebot>	 halfak and accraze: #bothumor I � Unicode. All rise for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1700).
[17:00:54] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[17:04:06] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] ATS: force cache revalidation on secure.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/615799 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis)
[17:06:46] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] ATS: force cache revalidation on secure.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/615799 (https://phabricator.wikimedia.org/T151977) (owner: 10CDanis)
[17:11:40] <wikibugs>	 10Operations, 10Security-Team, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: Open Redirect in secure.wikimedia.org - https://phabricator.wikimedia.org/T151977 (10ArielGlenn) 05Resolved→03Open Almost resolved, heh.
[17:16:44] <wikibugs>	 10Operations, 10Security-Team, 10Traffic, 10Wikimedia-Apache-configuration, and 3 others: Open Redirect in secure.wikimedia.org - https://phabricator.wikimedia.org/T151977 (10CDanis) 05Open→03Resolved Re-validation forced for ATS-BE, and also a Varnish cache ban has been put in place, so we should no l...
[17:22:36] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload
[17:22:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:18] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/P
[17:24:24] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[17:24:26] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/P
[17:24:57] <dcausse>	 ?
[17:25:35] <icinga-wm>	 PROBLEM - LVS wdqs codfw port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:25:53] <cdanis>	 here
[17:25:55] <rzl>	 ryankemper: ^
[17:26:16] <ryankemper>	 looking
[17:27:01] <rzl>	 here if you need anything
[17:27:02] <cdanis>	 Pybal says the `readiness-probe` endpoint is timing out after 5 seconds on all WDQS boxes
[17:27:08] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[17:27:30] <ryankemper>	 I'll probably want to de-pool these now while investigating, trying to figure out if they're all down or just a subset
[17:27:32] <cdanis>	 17:11 was the first occurrence and it got worse from there
[17:27:57] <rzl>	 "marked down but pooled" suggests it's all of them, or at least more than pybal will depool at once
[17:28:03] <ryankemper>	 ack, thanks
[17:28:07] <rzl>	 cdanis: have I got that right? ^
[17:28:13] <volans>	 is it related to the running cookbooks above?
[17:28:15] <cdanis>	 more than pybal will depool at once
[17:28:43] <wikibugs>	 (03CR) 10Ebernhardson: "pcc looks as expected: https://puppet-compiler.wmflabs.org/compiler1003/24107/an-airflow1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/615582 (owner: 10Ebernhardson)
[17:28:58] <cdanis>	 ryankemper: https://phabricator.wikimedia.org/P12032
[17:29:32] <cdanis>	 I believe 'partially up' means it isn't actually passing the readiness probe, but depooling it would lead to Pybal having too many depooled
[17:29:43] <ryankemper>	 what's the easiest way to find puppet's last run time?
[17:29:53] <cdanis>	 ryankemper: sudo /etc/update-motd.d/97-last-puppet-run
[17:30:59] <icinga-wm>	 PROBLEM - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:31:14] <cdanis>	 ryankemper: did your change only roll out to codfw, or is it in eqiad too?
[17:31:23] <volans>	 ryankemper: either puppetboard or if you need on a lot of hosts via cumin
[17:31:25] <ryankemper>	 We merged a puppet change for some work we're doing so I was wondering if that somehow broke things, but the associated services shouldn't have restarted
[17:31:40] <cdanis>	 I'm not sure of wdqs provisioning, but it might be prudent to DNS Discovery-depool codfw for wdqs services
[17:32:40] <cdanis>	 trying it by hand, I do see the `/readiness-probe` handler time out on wdqs2003
[17:33:00] <cdanis>	 looks like it gets rewritten by nginx to a trivial sparql:     rewrite ^/readiness-probe$ /sparql?query=%20ASK%7B%20%3Fx%20%3Fy%20%3Fz%20%7D;
[17:33:01] <ryankemper>	 cdanis: the change we rolled out is for all instances but we only actually restarted services on our canary instance in eqiad
[17:33:56] <ryankemper>	 I'm realizing I don't have a great notion of what handles incoming requests for wdqs, i.e. if we have nginx in front of it or what
[17:34:03] <cdanis>	 we do
[17:34:09] <ryankemper>	 the `DNS Discovery-depool codfw for wdqs services` sounds like a good idea since we're not seeing any problems on eqiad currently
[17:34:27] <volans>	 ryankemper: was this the change?
[17:34:28] <volans>	 https://puppetboard.wikimedia.org/report/wdqs2004.codfw.wmnet/902d0affb63bd8bd9f79db2cadce5e611f1359e4
[17:34:40] <cdanis>	 ack, I don't think codfw is doing useful work right now: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&from=now-1h&to=now
[17:35:14] <ryankemper>	 volans: yes
[17:35:32] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=wdqs.*,name=codfw
[17:35:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:35:40] <ryankemper>	 also just realized I only fed eqiad nodes to the `pcc` command so that change is looking suspect
[17:35:58] <cdanis>	 notably: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=7&fullscreen&orgId=1&refresh=1m&from=now-1h&to=now
[17:36:07] <volans>	 that change was applied to wdqs2004 at 16:45:58.
[17:36:11] <cdanis>	 I'm guessing from this graph that the codfw nodes stopped being able to count how many triples they're serving once puppet ran
[17:36:41] <ryankemper>	 Okay well first things first let's get that change backed out
[17:36:47] <ryankemper>	 Working on a revert patch
[17:37:34] <dcausse>	 ryankemper: did you find the cause?
[17:37:55] * gehel is just back, need any help?
[17:38:18] <ryankemper>	 not specifically but `cdanis` noted that codfw's triple graphite metrics aren't coming through anymore
[17:38:26] <ryankemper>	 and the problem started when the automated puppet ran occurred on codfw
[17:38:31] <dcausse>	 blazegraph is stuck on these nodes
[17:38:52] <cdanis>	 codfw will be DNS-Discovery-depooled in another 2 minutes btw (5 minute TTL)
[17:39:20] <ryankemper>	 okay, probably not worth reverting given the 2 mins
[17:39:33] <ryankemper>	 to dcausse's point here's logs for blazegraph on 2003 https://www.irccloud.com/pastebin/QlLehAS8/
[17:40:39] <dcausse>	 curl -d 'query=SELECT * WHERE {?s ?p ?o . } limit 1&format=json' http://localhost:9999/bigdata/sparql is not responding on the few codfw nodes I tried
[17:41:26] <ryankemper>	 gehel: tldr is all of wdqs codfw is down, eqiad seems totally fine, and it's presumably related to the application of https://puppetboard.wikimedia.org/report/wdqs2004.codfw.wmnet/902d0affb63bd8bd9f79db2cadce5e611f1359e4
[17:41:46] <dcausse>	 how can I see ^ ?
[17:42:15] <gehel>	 dcausse: https://gerrit.wikimedia.org/r/c/operations/puppet/+/615795
[17:43:00] <cdanis>	 dcausse: puppetboard is ops-only because it can have prod auth secrets recorded in it by accident, but, here's teh diff that was applied here: https://phabricator.wikimedia.org/P12033
[17:43:38] <gehel>	 ryankemper: are you reverting that change? need any help?
[17:44:02] <cdanis>	 gehel: codfw is no longer serving wdqs requests (dnsdisc-depooled) so I think we're working on a fix instead of a revert
[17:44:11] <ryankemper>	 I can revert, I wasn't sure if we wanted to given that we're not serving requests from codfw
[17:44:16] <ryankemper>	 ^
[17:44:31] <dcausse>	 I'd like to debug few things first
[17:44:38] <dcausse>	 blazegraph has not been restarted
[17:44:47] <ryankemper>	 Yup, go ahead
[17:44:48] <dcausse>	 only the main endpoint is stuck
[17:45:03] <gehel>	 ryankemper: make sure you have the revert ready, just in case
[17:45:12] <ryankemper>	 good point, will open up a patch
[17:45:40] <gehel>	 ryankemper: you can create a revert directly from the gerrit UI
[17:45:49] <dcausse>	 wdqs2008 is fine
[17:45:53] <gehel>	 I don't see anything obviously wrong in https://phabricator.wikimedia.org/P12033
[17:46:21] <cdanis>	 the diff that was applied in eqiad looks similar
[17:46:30] <dcausse>	 it's not all codfw, I hope it's not T242453 accross all the codfw fleet...
[17:46:31] <stashbot>	 T242453: Deadlock in blazegraph blocking all queries and updates - https://phabricator.wikimedia.org/T242453
[17:46:38] <ryankemper>	 Yeah my thinking was given none of blazegraph, categories, updater were restarted, I don't quite understand how things would have broken
[17:46:53] <dcausse>	 taking few stackdumps
[17:47:00] <gehel>	 this looks like a coincidence to me
[17:47:24] <gehel>	 dcausse: should we restart one of the stuck server, see if it recovers?
[17:47:28] <dcausse>	 yes
[17:47:48] <gehel>	 dcausse: let us know when you're good on the thread dumps and on which server
[17:47:49] <ryankemper>	 Let me know when you have the trace and then I can restart blazegraph on that instance
[17:48:47] <dcausse>	 restarted blazegraph on wdqs2001
[17:49:06] <wikibugs>	 10Operations, 10MassMessage, 10MediaWiki-JobQueue, 10Platform Engineering: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Quiddity) Just noting for the record, I had similar problems on Monday, whilst delivering TechNews. It delivered duplicates to 6 Wiktionary...
[17:49:10] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmne
[17:49:10] <icinga-wm>	 .wmnet, wdqs1004.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:49:38] <cdanis>	 dcausse: is blazegraph serving on 2001 yet?  Pybal still reports it failing the readiness probe
[17:49:39] <ryankemper>	 Well this is now seeming awfully coincidental
[17:49:42] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[17:50:01] <wikibugs>	 (03PS1) 10ZPapierski: Migrate wcqs to wcqs-beta.wmfcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/615810
[17:50:01] <cdanis>	 ehm
[17:50:05] <icinga-wm>	 PROBLEM - LVS wdqs-ssl eqiad port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:50:05] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 5.491 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:50:06] <cdanis>	 I think we're about to have a full outage
[17:50:20] <icinga-wm>	 PROBLEM - Check the last execution of mediawiki_job_wikidata-updateQueryServiceLag on mwmaint1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki_job_wikidata-updateQueryServiceLag https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[17:50:21] <ryankemper>	 Agreed
[17:50:26] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1007.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmne
[17:50:26] <icinga-wm>	 .wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[17:50:28] <ryankemper>	 Opening up the revert
[17:50:29] <icinga-wm>	 PROBLEM - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:50:38] <ryankemper>	 Should we just try restarting blazegraphs on eqiad now?
[17:51:09] <jynus>	 I can now confirm user impact, simple queries don't run for me
[17:51:10] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "[wdqs] add a new streaming updater profile" [puppet] - 10https://gerrit.wikimedia.org/r/615784
[17:51:12] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:51:32] <wikibugs>	 (03CR) 10CDanis: [V: 03+2 C: 03+2] Revert "[wdqs] add a new streaming updater profile" [puppet] - 10https://gerrit.wikimedia.org/r/615784 (owner: 10Ryan Kemper)
[17:51:37] <icinga-wm>	 RECOVERY - LVS wdqs codfw port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:51:52] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2001 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:51:55] <ryankemper>	 ^ Does this imply the blazegraph restart fixed 2001 and now codfw is responsive?
[17:52:04] <cdanis>	 I think it does
[17:52:07] <ryankemper>	 Should we revert and restart blazegraph fleetwide or just restart without revert
[17:52:08] <cdanis>	 I'm merging your puppet patch anyway
[17:52:10] <ryankemper>	 Okay
[17:52:13] <ryankemper>	 Sounds good
[17:52:14] <jynus>	 eqiad still down
[17:52:15] <logmsgbot>	 !log cdanis@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=wdqs.*,name=codfw
[17:52:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:20] <cdanis>	 and repooling codfw
[17:52:30] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1005.eqiad.wmnet, wdqs1004.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[17:52:35] <ryankemper>	 Should I restart blazegraph on an eqiad instance to get it serving, or just wait for your repool cdanis 
[17:52:43] <jynus>	 I am only reporting user impact, as it is the only thing I know how to do
[17:52:56] <ryankemper>	 thanks jynus, helpful to know :)
[17:53:11] <gehel>	 ryankemper: wait for the patch to be merged, puppet-apply and restart
[17:53:18] <cdanis>	 !log ❌cdanis@cumin1001.eqiad.wmnet ~ 🕑☕ sudo cumin -b10 'wdqs*' "run-puppet-agent --unless-version 1a4ae81"
[17:53:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:15] <cdanis>	 ryankemper: okay, revert is applied on all wdqs* hosts, please start restarting blazegraphs
[17:54:23] <ryankemper>	 Proceeding
[17:56:22] <dcausse>	 internal clusters were fine, it's only server receiving queries from outside
[17:58:00] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:58:33] <ryankemper>	 Manually restarted blazegraph on `wdqs1003` to get eqiad back up asap, and am now restarting every wdqs instance except `wdqs1003` and `wdqs2001` which we've already restarted:
[17:58:52] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1004 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[17:58:57] <icinga-wm>	 RECOVERY - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 483 bytes in 1.170 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:58:57] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2002 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[17:59:00] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1007 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[17:59:02] <ryankemper>	 !log sudo -E cumin -b 10 'A:wdqs-all and not A:wdqs-test and not P{wdqs1003.eqiad.wmnet} and not P{wdqs2001.codfw.wmnet}' 'sudo systemctl restart wdqs-blazegraph.service'
[17:59:02] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[17:59:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:08] <jynus>	 checking
[17:59:25] <jynus>	 query went through, I think we are back
[17:59:25] <icinga-wm>	 RECOVERY - LVS wdqs-ssl eqiad port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 483 bytes in 1.023 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:59:34] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[17:59:38] <cdanis>	 okay, we have some healthy wdqsen in both clusters now, so, we are out of outage
[17:59:41] <jynus>	 waiting for confirmation at https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=2&fullscreen&orgId=1&from=1595516372253&to=1595527172254&var-cluster_name=wdqs
[17:59:47] <Lucas_WMDE>	 yup, WDQS works again for me too
[17:59:47] <icinga-wm>	 RECOVERY - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:59:52] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:59:54] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[17:59:56] <cdanis>	 (it's important to remember that DNS Discovery does *not* consider backend healthiness in its 'decisions')
[17:59:58] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.686 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[18:00:00] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:00:03] <ryankemper>	 ack re backend readiness
[18:00:05] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: #bothumor I � Unicode. All rise for Morning backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1800).
[18:00:05] <jouncebot>	 Amir1: A patch you scheduled for Morning backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[18:00:14] <Amir1>	 o/
[18:00:18] <wikibugs>	 (03PS2) 10ZPapierski: Migrate wcqs to wcqs-beta.wmfcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/615810
[18:00:23] <gehel>	 ryankemper: \o/
[18:00:23] <rzl>	 RoanKattouw, Niharika, Urbanecm, Amir1: please hold off on deploying anything
[18:00:24] <jynus>	 rate of queries going up
[18:00:30] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:00:38] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:00:43] <jynus>	 close to previous rate
[18:00:45] <Amir1>	 sure
[18:00:46] <Niharika>	 rzl: ack
[18:00:46] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1004 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:00:49] <Urbanecm>	 rzl: ack
[18:00:50] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2002 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:00:54] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:00:56] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.035 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[18:00:59] <jynus>	 https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=2&fullscreen&orgId=1&from=1595516372000&to=1595527254469&var-cluster_name=wdqs
[18:01:14] <icinga-wm>	 RECOVERY - Check the last execution of mediawiki_job_wikidata-updateQueryServiceLag on mwmaint1002 is OK: OK: Status of the systemd unit mediawiki_job_wikidata-updateQueryServiceLag https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:01:28] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:01:34] <ryankemper>	 So now that the dust is settling, sounds like we've got two theories to follow: one that something in the puppet change broke everything and the other being the deadlock issue that dcausse referenced
[18:01:42] <jynus>	 will this have impacted dispatch lag, or that was part of the internal "not impacted" part?
[18:01:51] <jynus>	 not dispatch
[18:01:59] <jynus>	 the api lag, not sure what that is called
[18:02:05] <Lucas_WMDE>	 maxlag?
[18:02:05] <dcausse>	 jynus: yes it will
[18:02:10] <jynus>	 yes, that
[18:02:12] <cdanis>	 ryankemper: that sounds right to me, yeah -- although if it was the former, I would have expected it to follow Puppet runs more closely
[18:02:16] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:02:16] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:02:21] <jynus>	 ok, checking mw api now
[18:02:27] <jynus>	 for wikidata
[18:02:35] <rzl>	 ryankemper et all: whenever you're satisfied prod is stable again,  can you highlight the folks in that jouncebot message again and give them the all clear please
[18:02:38] <Lucas_WMDE>	 maxlag on https://grafana.wikimedia.org/d/000000170/wikidata-edits?refresh=1m&orgId=1&from=1595525551004&to=1595527351005 has gaps
[18:02:38] <dcausse>	 and internal cluster would have been affested too
[18:02:40] <cdanis>	 ryankemper: I didn't actually verify that it did follow puppet runs; that's just what it possibly looked like from the graph (failures staggered across a plausible-enough time interval)
[18:02:52] <ryankemper>	 cdanis: right, understood
[18:03:02] <cdanis>	 icinga alerts for wdqs and for LVS-for-wdqs have cleared
[18:03:08] <ryankemper>	  my mental model on the deadlock is if it were somewhat related to load / cpu usage etc we could have had a domino-type effect
[18:03:16] <ryankemper>	 since if it's temporally independent we would never see the behavior we saw today
[18:03:25] <cdanis>	 ryankemper: yeah, you can also get a similar-looking effect from a "query of death" from a user
[18:03:36] <dcausse>	 deadlock is perhaps related to a bad query
[18:03:45] <cdanis>	 gets LB'd to one server, crashes it --> user retries --> bad query goes to another server --> wash rinse repeat
[18:03:45] <ryankemper>	 that would make a lot of sense
[18:03:52] <dcausse>	 if repeated enough it'll bring all clusters down
[18:03:58] <ryankemper>	 rzl: which jouncebot message are you referring to? things are stable enough now for me to give them the all-clear
[18:03:59] <jynus>	 not sure I see any impact on mediawiki api for wikibase behaviour
[18:04:13] <rzl>	 ryankemper: ack, thanks!
[18:04:21] <cdanis>	 Niharika: Amir1: you can proceed :)
[18:04:23] <rzl>	 RoanKattouw, Niharika, Urbanecm, Amir1: disregard my last, go ahead at your convenience :) thanks
[18:04:32] <Amir1>	 :)
[18:04:35] <jynus>	 (I know it is unrelated, but thinking about usual complain about maxlag)
[18:05:21] <gehel>	 jynus: since the WDQS lag wasn't reported by dead servers, it will probably take a few minutes to propagate back to Wikidata maxlag
[18:05:22] <wikibugs>	 (03PS3) 10ZPapierski: Migrate wcqs to wcqs-beta.wmfcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/615810
[18:05:47] <jynus>	 gehel: I see, so the bug cancelled itself :-D
[18:05:58] <wikibugs>	 (03PS3) 10Dzahn: visualdiff: update git branch from ruthenium to scandium [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906)
[18:06:20] <Niharika>	 I have a meeting right now. Urbanecm would you be able to swat?
[18:06:35] <Urbanecm>	 Niharika: I believe Amir1 is able to self-service :)
[18:07:08] <Amir1>	 Sure, I can do it
[18:07:20] <jynus>	 so everything I am looking at looks health except, understandable, the lag
[18:07:21] <Amir1>	 it's also risky things, so I need to test lots of things
[18:07:56] <cdanis>	 ryankemper: FWIW, my money is on the query-of-death idea -- codfw hosts broke within a few minutes of each other, but no eqiad hosts until ~10 minutes after my dnsdisc-depool of codfw
[18:08:18] <cdanis>	 and then when they did break, all the eqiad hosts broke ~simultaneously at 17:45
[18:08:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "scandium-only: https://puppet-compiler.wmflabs.org/compiler1003/24110/scandium.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn)
[18:09:37] <ryankemper>	 yes, that would also explain why *almost* all but not literally all of codfw got into a bad state
[18:09:40] <cdanis>	 uh
[18:09:44] <cdanis>	 codfw hosts are going down again
[18:09:55] <dcausse>	 it's not done
[18:10:12] <ryankemper>	 so, we need to figure out what query is doing it and then possibly figure out the actual user and make them stop?
[18:10:21] <cdanis>	 yeah
[18:10:23] <icinga-wm>	 PROBLEM - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:10:23] <ryankemper>	 meanwhile i'll need to play whackamole and try to restart blazegraph enough to keep service going
[18:10:33] <cdanis>	 we need to find the query at fault, and block it somehow
[18:10:41] <ryankemper>	 any volunteers to try to fix the actual problem while I play whack-a-mole
[18:10:46] <ryankemper>	 we can also do it the other way around
[18:11:04] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled https://wikitec
[18:11:04] <icinga-wm>	 iki/PyBal
[18:11:22] <dcausse>	 the query is unlikely to have a chance to be logged if it's causing the deadlock (at least from the backend
[18:11:45] <Urbanecm>	 perhaps a stupid idea: could decreasing query timeout help?
[18:11:45] <cdanis>	 dcausse: does blazegraph only log at the end of query execution?  and not the start?
[18:11:50] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[18:12:20] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "BACC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615263 (owner: 10Lucas Werkmeister (WMDE))
[18:12:27] <icinga-wm>	 PROBLEM - LVS wdqs codfw port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:12:28] <dcausse>	 cdanis: yes, we could log before but we might get the same info from webrequest logs
[18:12:35] <ryankemper>	 this might be the tail end of a/the query? https://www.irccloud.com/pastebin/r44IH5FF/
[18:12:37] <gehel>	 cdanis: yes, blazgraph only log after query completion
[18:12:50] <cdanis>	 logging only at the end of query execution is one of my distributed systems pet peeves :)
[18:13:04] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2002.codfw.wmne
[18:13:04] <icinga-wm>	  but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:13:10] <wikibugs>	 (03Merged) 10jenkins-bot: extension-list: Load WikibaseRepo via JSON [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615263 (owner: 10Lucas Werkmeister (WMDE))
[18:13:15] <ryankemper>	 !log restarted blazegraph on 2001
[18:13:17] <gehel>	 actually not entirely true, it logs a number of operations, but probably nothing that will help us too much here
[18:13:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:13:28] <ryankemper>	 (that log message was a little too vague...oh well)
[18:13:52] <ryankemper>	 gehel: any thoughts on Urbanecm 's timeout idea btw
[18:14:04] <gehel>	 I doubt it will help much
[18:14:09] <dcausse>	 does /var/log/nginx/access.log contain info (can't access it)
[18:14:16] <ryankemper>	 looking
[18:14:27] <gehel>	 but we might try (the timeout thing)
[18:14:54] <ryankemper>	 dcausse: here's a susbet of what it looks like
[18:15:02] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[18:15:06] <ryankemper>	 https://www.irccloud.com/pastebin/AdxovFit/%2Fvar%2Flog%2Fnginx%2Faccess.log
[18:15:21] <gehel>	 users with too many requests in error should be banned by throttling eventually, but if the server freezes, that's not actually going to help
[18:15:35] <ryankemper>	 `2001` is back up, going to whack the 3 remaining codfw nodes
[18:15:53] <Amir1>	 https://www.irccloud.com/pastebin/Cx6mXSc7/
[18:16:09] <wikibugs>	 (03CR) 10Ladsgroup: "https://www.irccloud.com/pastebin/Cx6mXSc7/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615263 (owner: 10Lucas Werkmeister (WMDE))
[18:16:24] <wikibugs>	 (03CR) 10Dzahn: "This was a noop on scandium (so far i did not touch anything manually so the branch is still ruthenium there as before and puppet does not" [puppet] - 10https://gerrit.wikimedia.org/r/613309 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn)
[18:16:54] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Load WikibaseClient from extension.json file instead of php one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613235 (https://phabricator.wikimedia.org/T256228) (owner: 10Ladsgroup)
[18:17:31] <jynus>	 this is the one from the paste: https://logstash.wikimedia.org/goto/1d2fabcb65d0c0520c3e58f31f3ca786
[18:17:39] <wikibugs>	 (03Merged) 10jenkins-bot: Load WikibaseClient from extension.json file instead of php one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/613235 (https://phabricator.wikimedia.org/T256228) (owner: 10Ladsgroup)
[18:17:49] <icinga-wm>	 RECOVERY - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 485 bytes in 7.922 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:17:54] <jynus>	 but I don't know if it is the one causing issues, just searched it on logs
[18:18:17] <dcausse>	 ryankemper: we might perhaps have the line in /var/log/nginx/error.log when nginx bails on a gateway error?
[18:18:43] <wikibugs>	 (03CR) 10ZPapierski: [C: 04-1] "This patch behaves correctly, but requires a compatible oauth consumer, so I'm blocking it until one is available." [puppet] - 10https://gerrit.wikimedia.org/r/615810 (owner: 10ZPapierski)
[18:20:17] <cdanis>	 the first error.log entry I see when the 'new' round of errors begin at 18:12:01 is from maps2003
[18:20:22] <cdanis>	 on wdqs2003
[18:21:09] <dcausse>	 cdanis: could you copy /var/log/nginx/error.log somewhere I can read on wdqs2007, please?
[18:21:21] <jynus>	 the error logs are full of stuff, not sure what to search for
[18:21:39] <cdanis>	 dcausse: copy in your homedir
[18:21:44] <dcausse>	 thanks!
[18:21:47] <mutante>	 !log testreduce1001 - rm -rf /srv/testreduce and run puppet to re-clone testreduce to it from the scandium branch (T257906)
[18:21:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:21:52] <stashbot>	 T257906: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906
[18:22:08] <cdanis>	 also, AI for later: file a task to get devs access to those logs
[18:22:26] <icinga-wm>	 PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2003 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:22:48] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[18:23:29] <icinga-wm>	 PROBLEM - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:23:42] <cdanis>	 okay, so, operating on the theory that these are user queries that are being routed to codfw, here's the ones that are doing that and are failing at the Varnish level too: https://logstash.wikimedia.org/goto/bcbe3c97cc8f541aa44db95a523841af
[18:24:20] <icinga-wm>	 RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2003 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[18:24:40] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2003 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[18:25:10] <cdanis>	 it's important to note that this of course includes 'anyone who is sending close-to-significant query traffic to WDQS',
[18:25:13] <icinga-wm>	 RECOVERY - LVS wdqs-ssl codfw port 443/tcp - Wikidata Query Service - HTTPS IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 483 bytes in 1.165 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:25:20] <cdanis>	 looking at high time_firstbytes helps filter some
[18:25:25] <icinga-wm>	 RECOVERY - LVS wdqs codfw port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:25:36] <cdanis>	 I am going to disable notifs for those
[18:26:02] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:26:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:26:48] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:27:30] <wikibugs>	 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): Puppet tab in Horizon unusably slow - https://phabricator.wikimedia.org/T149589 (10Andrew)
[18:28:35] <Amir1>	 Deploying this big change
[18:28:47] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service: wdqs admins should have access to nginx logs on wdqs machines - https://phabricator.wikimedia.org/T258739 (10Dzahn)
[18:28:48] <Amir1>	 Keep in mind for performance, errors, etc.
[18:28:58] <mutante>	 cdanis: i made that task, hope it helped 
[18:29:21] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/Wikibase.php: [[gerrit:613235|Load WikibaseClient from extension.json file instead of php one (T257437 T256228 T88258)]] (duration: 01m 05s)
[18:29:26] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:29:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:29] <stashbot>	 T88258: Convert WikibaseRepository, WikibaseClient, WikibaseLib and WikibaseView to use extension registration - https://phabricator.wikimedia.org/T88258
[18:29:29] <stashbot>	 T257437: Deploy Client to production using extension registration - https://phabricator.wikimedia.org/T257437
[18:29:29] <stashbot>	 T256228: Convert WikibaseClient to use extension registration - https://phabricator.wikimedia.org/T256228
[18:34:28] <ryankemper>	 mutante: thanks
[18:35:26] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:35:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=thanos-sidecar site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:35:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:36:26] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[18:38:02] <ryankemper>	 time for another round of whack a mole
[18:39:04] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([wdqs2003.codfw.wmnet, wdqs2007.codfw.wmnet, wdqs2002.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[18:39:37] <Amir1>	 !log BACC is done
[18:39:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:40:58] <icinga-wm>	 PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query
[18:41:12] <zpapierski_>	 dcausse: I don't know how can I help, but could give me read writes to that error.log?
[18:41:56] <zpapierski_>	 thx!
[18:42:17] <dcausse>	 done
[18:42:22] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[18:44:08] <mutante>	 zpapierski_: dcausse:  "sudo journalctl -u nginx" should work already, btw
[18:44:14] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.125 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[18:44:19] <mutante>	 because journalctl * is in your sudo privs
[18:44:24] <ryankemper>	 !log Restarted blazegraph on following codfw wdqs nodes: 2007, 2003, and 2002
[18:44:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:38] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:46:46] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:47:20] <cdanis>	 looks like wdqs2001 is presently unhealthy, but not others
[18:48:08] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:48:12] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service: wdqs admins should have access to nginx logs on wdqs machines - https://phabricator.wikimedia.org/T258739 (10Dzahn) `sudo journalctl -u nginx` should already work but it does not contain the same information that is in the error.log...
[18:48:28] <icinga-wm>	 PROBLEM - Thanos query has high gRPC client errors on icinga1001 is CRITICAL: job=thanos-query https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query
[18:49:58] <wikibugs>	 10Operations: Grant IRC operator privileges to Urbanecm in #wikimedia-operations - https://phabricator.wikimedia.org/T258741 (10Urbanecm)
[18:50:48] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[18:51:00] <ryankemper>	 !log restarted blazegraph on codfw wdqs2001
[18:51:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:51:55] <ryankemper>	 So quick update, we've got some people spelunking for a potential bad query/actor, we've blocked a suspect IP at the varnish level and restarted blazegraph on affected codfw nodes so we're waiting to see if we get another round of outage
[18:52:36] <ryankemper>	 Also note the suspect ip was apparently entering via `ulsfo` which would hit `codfw` which lines up w/ the behavior we've seen
[18:54:06] <icinga-wm>	 RECOVERY - Thanos query has high gRPC client errors on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query
[18:57:06] <wikibugs>	 (03PS1) 10Dzahn: admins: let wdqs-admins view nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739)
[18:57:26] <dcausse>	 thanks ^
[18:59:02] <wikibugs>	 (03PS1) 10Mholloway: Bump wikifeeds to 2020-07-23-185301-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/615819
[19:00:04] <jouncebot>	 longma and liw: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - American+European Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T1900).
[19:01:56] <longma>	 Hello. The train is blocked currently so we might not have a deployment during the window
[19:02:47] <wikibugs>	 (03CR) 10DCausse: "thanks!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn)
[19:04:53] <wikibugs>	 (03CR) 10Mholloway: [C: 03+2] Bump wikifeeds to 2020-07-23-185301-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/615819 (owner: 10Mholloway)
[19:05:58] <wikibugs>	 (03Merged) 10jenkins-bot: Bump wikifeeds to 2020-07-23-185301-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/615819 (owner: 10Mholloway)
[19:06:05] <wikibugs>	 (03PS1) 10Dzahn: admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739)
[19:07:01] <wikibugs>	 (03CR) 10Dzahn: admins: let wdqs-admins view nginx logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn)
[19:07:22] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] admins: let wdqs-admins view nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn)
[19:07:38] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn)
[19:09:01] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
[19:09:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:38] <RhinosF1>	 Any chanop around?
[19:11:02] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[19:11:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:22] <RhinosF1>	 _joe_: mind a pm about a chanopy thing?
[19:12:52] <_joe_>	 what's up?
[19:13:30] <logmsgbot>	 !log mholloway-shell@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[19:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:11] <wikibugs>	 (03PS1) 10Bstorm: wiki-replicas: Add clouddb naming to regexes [puppet] - 10https://gerrit.wikimedia.org/r/615823 (https://phabricator.wikimedia.org/T257987)
[19:19:36] <icinga-wm>	 PROBLEM - Long running screen/tmux on weblog1001 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 9621, 1733239s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[19:21:09] <mutante>	 ^ it's possible to whitelist hosts where long running screen/tmux should never alert
[19:21:21] <mutante>	 if that is desired for weblog* ..not sure
[19:26:54] <wikibugs>	 (03PS1) 10Ladsgroup: labs: Load Wikibase Repo using extension.json instead of php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615825 (https://phabricator.wikimedia.org/T257436)
[19:27:07] <wikibugs>	 10Operations, 10Traffic, 10Sustainability (Incident Followup): upload.wikimedia.org returns HTTP status code 503 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517 (10Aklapper) >>! In T106517#5666579, @ema wrote: > I cannot reproduce with URLs such as https://upload.wikimedia.org/wikiped...
[19:29:05] <wikibugs>	 (03PS1) 10Dzahn: do not monitor long-running screens on weblog* hosts [puppet] - 10https://gerrit.wikimedia.org/r/615826
[19:29:54] <wikibugs>	 (03CR) 10Dzahn: "19:19 <+icinga-wm> PROBLEM - Long running screen/tmux on weblog1001 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 9621," [puppet] - 10https://gerrit.wikimedia.org/r/615826 (owner: 10Dzahn)
[19:33:51] <wikibugs>	 (03CR) 10Ladsgroup: "noop for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615825 (https://phabricator.wikimedia.org/T257436) (owner: 10Ladsgroup)
[19:35:32] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] labs: Load Wikibase Repo using extension.json instead of php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615825 (https://phabricator.wikimedia.org/T257436) (owner: 10Ladsgroup)
[19:36:15] <wikibugs>	 (03Merged) 10jenkins-bot: labs: Load Wikibase Repo using extension.json instead of php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/615825 (https://phabricator.wikimedia.org/T257436) (owner: 10Ladsgroup)
[19:43:20] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] admins: let wdqs-admins view nginx logs [puppet] - 10https://gerrit.wikimedia.org/r/615818 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn)
[19:43:30] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] admins: let wdqs-admins run jstack as root [puppet] - 10https://gerrit.wikimedia.org/r/615821 (https://phabricator.wikimedia.org/T258739) (owner: 10Dzahn)
[19:48:05] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kubernetes2017.codfw.wmnet - https://phabricator.wikimedia.org/T258745 (10RobH)
[19:48:16] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install kubernetes2017.codfw.wmnet - https://phabricator.wikimedia.org/T258745 (10RobH)
[19:51:29] <wikibugs>	 (03PS1) 10Andrew Bogott: Rename cloudcephosd1004 through 1015. [puppet] - 10https://gerrit.wikimedia.org/r/615828 (https://phabricator.wikimedia.org/T251619)
[19:52:11] <wikibugs>	 (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1002/24111/phab1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/615796 (owner: 10Dzahn)
[19:52:32] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Rename cloudcephosd1004 through 1015. [puppet] - 10https://gerrit.wikimedia.org/r/615828 (https://phabricator.wikimedia.org/T251619) (owner: 10Andrew Bogott)
[19:54:18] <wikibugs>	 (03PS1) 10BryanDavis: dynamicproxy: Only redirect to wmcloud if proxy is registered [puppet] - 10https://gerrit.wikimedia.org/r/615829 (https://phabricator.wikimedia.org/T258730)
[19:58:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10RobH)
[19:58:30] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: (Need By: TBD) rack/setup/install kubernetes1017.eqiad.wmnet - https://phabricator.wikimedia.org/T258747 (10RobH)
[20:00:09] <wikibugs>	 (03PS2) 10BryanDavis: dynamicproxy: Only redirect to wmcloud if proxy is registered [puppet] - 10https://gerrit.wikimedia.org/r/615829 (https://phabricator.wikimedia.org/T258730)
[20:01:44] <wikibugs>	 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10Dzahn) Merging the change above was  a noop on scandium. I did not manually touch it so far, so the git repo at /srv/testreduce is unchang...
[20:07:34] <wikibugs>	 (03PS1) 10Dzahn: parsoid: remove vd_server and vd_client from parsoid::testing role [puppet] - 10https://gerrit.wikimedia.org/r/615831 (https://phabricator.wikimedia.org/T257906)
[20:10:09] <wikibugs>	 (03CR) 10BryanDavis: "100% untested at this point" [puppet] - 10https://gerrit.wikimedia.org/r/615829 (https://phabricator.wikimedia.org/T258730) (owner: 10BryanDavis)
[20:17:40] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-30) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10RobH)
[20:18:48] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-30) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10RobH)
[20:20:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (<enter due date here>) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10RobH)
[20:20:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (<enter due date here>) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10RobH)
[20:21:16] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (2020-09-30) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10RobH)
[20:22:46] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-14) rack/setup/install dbprov2003.codfw.wmnet - https://phabricator.wikimedia.org/T258749 (10RobH)
[20:22:51] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (2020-09-14) rack/setup/install dbprov1003.eqiad.wmnet - https://phabricator.wikimedia.org/T258750 (10RobH)
[20:26:54] <icinga-wm>	 PROBLEM - Disk space on prometheus5001 is CRITICAL: connect to address 10.132.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus5001&var-datasource=eqsin+prometheus/ops
[20:27:10] <icinga-wm>	 PROBLEM - Check systemd state on prometheus5001 is CRITICAL: connect to address 10.132.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:27:10] <icinga-wm>	 PROBLEM - Check size of conntrack table on prometheus5001 is CRITICAL: connect to address 10.132.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[20:33:27] <wikibugs>	 (03PS1) 10Andrew Bogott: WMCS Ceph: add address entries for new OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619)
[20:34:36] <icinga-wm>	 PROBLEM - Check systemd state on prometheus5001 is CRITICAL: connect to address 10.132.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:34:36] <icinga-wm>	 PROBLEM - Check size of conntrack table on prometheus5001 is CRITICAL: connect to address 10.132.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[20:36:18] <icinga-wm>	 PROBLEM - configured eth on prometheus5001 is CRITICAL: connect to address 10.132.0.33 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[20:36:26] <herron>	 ^^ that's me remiaging (which is slow to eqsin and downtime expired)
[20:37:41] <wikibugs>	 (03PS2) 10Andrew Bogott: WMCS Ceph: add address entries for new OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619)
[20:39:38] <wikibugs>	 10Operations: Grant IRC operator privileges to Urbanecm in #wikimedia-operations - https://phabricator.wikimedia.org/T258741 (10Aklapper) May want to update https://meta.wikimedia.org/wiki/IRC/wikimedia-ops/Operators once done
[20:41:39] <wikibugs>	 (03PS3) 10BryanDavis: dynamicproxy: Only redirect to wmcloud if proxy is registered [puppet] - 10https://gerrit.wikimedia.org/r/615829 (https://phabricator.wikimedia.org/T258730)
[20:44:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne...
[20:45:39] <wikibugs>	 (03CR) 10BryanDavis: "Tested via manual application to /etc/nginx/lua/domainproxy.lua on proxy-01.proxy-codfw1dev.codfw1dev.wikimedia.cloud." [puppet] - 10https://gerrit.wikimedia.org/r/615829 (https://phabricator.wikimedia.org/T258730) (owner: 10BryanDavis)
[20:47:18] <wikibugs>	 (03PS1) 10RobH: updating with new power cord skus [software] - 10https://gerrit.wikimedia.org/r/615836
[20:48:21] <wikibugs>	 (03CR) 10RobH: [C: 03+2] updating with new power cord skus [software] - 10https://gerrit.wikimedia.org/r/615836 (owner: 10RobH)
[20:57:30] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[20:58:09] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[20:58:09] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[20:58:09] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[20:58:10] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[20:58:10] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[20:58:12] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:58:12] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:58:13] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:58:13] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[20:58:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:59:11] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[20:59:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:26] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:00:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:39] <mutante>	 andrewbogott: fyi, the host rename breaks icinga config because of the relation between switches and the cloudcephos files. maybe it will just go away after the next puppet run though, i dunno yet
[21:00:56] <andrewbogott>	 :(
[21:01:02] <andrewbogott>	 it'll probably clear on its own after a run or two
[21:02:34] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:02:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:03:11] <mutante>	 Error: 'cloudsw1-c8-eqiad.mgmt.eqiad.wmnet' is not a valid parent 
[21:03:18] <mutante>	 let's check again later
[21:07:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1008.eqiad.wmnet', 'c...
[21:10:36] <wikibugs>	 (03PS3) 10Andrew Bogott: WMCS Ceph: add address entries for new OSD nodes [puppet] - 10https://gerrit.wikimedia.org/r/615832 (https://phabricator.wikimedia.org/T251619)
[21:10:38] <wikibugs>	 (03PS1) 10Andrew Bogott: cloudcephosd nodes:  Experiment with using a hw raid for the / volume [puppet] - 10https://gerrit.wikimedia.org/r/615838 (https://phabricator.wikimedia.org/T251619)
[21:11:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] cloudcephosd nodes:  Experiment with using a hw raid for the / volume [puppet] - 10https://gerrit.wikimedia.org/r/615838 (https://phabricator.wikimedia.org/T251619) (owner: 10Andrew Bogott)
[21:13:10] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne...
[21:14:55] <wikibugs>	 10Operations, 10SRE-Access-Requests, 10Wikidata, 10Wikidata-Query-Service, 10Patch-For-Review: wdqs admins should have access to nginx logs, jstack on wdqs machines - https://phabricator.wikimedia.org/T258739 (10RKemper)
[21:15:16] <icinga-wm>	 RECOVERY - Check size of conntrack table on prometheus5001 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[21:15:16] <icinga-wm>	 RECOVERY - Check systemd state on prometheus5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:18:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:18:36] <wikibugs>	 (03CR) 10Herron: [C: 03+1] profile: add prometheus instance for external metrics [puppet] - 10https://gerrit.wikimedia.org/r/615288 (https://phabricator.wikimedia.org/T180105) (owner: 10Cwhite)
[21:19:24] <RhinosF1>	 DannyS712: can you remember which script unsets everyone from a user group? I think you asked for it to be ran before.
[21:19:57] <Majavah>	 RhinosF1: emptyUserGroup.php iirc
[21:20:12] <RhinosF1>	 Majavah: that would be obvious
[21:20:28] <logmsgbot>	 !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@c99c626]: airflow: centralize installation specific airflow Variables
[21:20:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:43] <Majavah>	 clearly you haven't used createAndPromote.php
[21:21:01] <RhinosF1>	 Majavah: last time I tried, I gave up
[21:21:02] <logmsgbot>	 !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@c99c626]: airflow: centralize installation specific airflow Variables (duration: 00m 34s)
[21:21:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:21:20] <Majavah>	 heh. anyways I'm off to bed
[21:27:07] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[21:27:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:20] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:29:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:28] <icinga-wm>	 RECOVERY - Disk space on prometheus5001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus5001&var-datasource=eqsin+prometheus/ops
[21:30:43] <DannyS712>	 yeah I've asked a few times - eg T250575
[21:30:44] <stashbot>	 T250575: Remove user rights on test2.wikipedia.org for undeployed extension EducationProgram - https://phabricator.wikimedia.org/T250575
[21:31:03] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne...
[21:34:11] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "I'm not super familiar with the inner workings of smokeping, but the approach LGTM as long as it is valid to have alerts unset in the targ" [puppet] - 10https://gerrit.wikimedia.org/r/615760 (https://phabricator.wikimedia.org/T258675) (owner: 10Filippo Giunchedi)
[21:38:04] <icinga-wm>	 RECOVERY - configured eth on prometheus5001 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[21:38:42] <wikibugs>	 (03CR) 10Herron: [C: 03+1] mariadb: Remove puppet mysql grants for m1 misc databases [puppet] - 10https://gerrit.wikimedia.org/r/523702 (https://phabricator.wikimedia.org/T198939) (owner: 10Jcrespo)
[21:39:35] <wikibugs>	 (03PS13) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041)
[21:41:02] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (11) node(s) change every puppet run: cloudcephosd1004.eqiad.wmnet, cloudcephosd1007.eqiad.wmnet, cloudcephosd1010.eqiad.wmnet, cloudcephosd1006.eqiad.wmnet, contint2001.wikimedia.org, cloudcephosd1009.eqiad.wmnet, cloudcephosd1005.eqiad.wmnet, cloudcephosd1008.eqiad.wmnet, contint1001.wikimedia.org, testred
[21:41:02] <icinga-wm>	 et, aphlict1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[21:41:19] <wikibugs>	 (03PS14) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041)
[21:42:37] <wikibugs>	 (03PS15) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041)
[21:45:02] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[21:45:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:44] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[21:48:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:53:02] <wikibugs>	 (03PS2) 10Dzahn: ATS: add backend for aphlict [puppet] - 10https://gerrit.wikimedia.org/r/615797 (https://phabricator.wikimedia.org/T238593)
[21:53:07] <wikibugs>	 (03PS2) 10Dzahn: phabricator: set aphlict to disabled in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/615796 (https://phabricator.wikimedia.org/T238593)
[21:53:14] <wikibugs>	 (03PS16) 10Jeena Huneidi: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041)
[21:53:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1004.eqiad.wmnet'] `...
[21:54:13] <wikibugs>	 (03CR) 10Herron: [C: 03+1] lvs - thanos-query: update to use port 443 instead of port 80 [puppet] - 10https://gerrit.wikimedia.org/r/615720 (https://phabricator.wikimedia.org/T151009) (owner: 10Jbond)
[21:54:52] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/615733 (https://phabricator.wikimedia.org/T151009) (owner: 10Filippo Giunchedi)
[21:55:39] <wikibugs>	 (03PS1) 10Dzahn: aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593)
[21:56:50] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi)
[21:56:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn)
[21:58:21] <wikibugs>	 (03Merged) 10jenkins-bot: Kask: Use Releng Cassandra Image [deployment-charts] - 10https://gerrit.wikimedia.org/r/609894 (https://phabricator.wikimedia.org/T224041) (owner: 10Jeena Huneidi)
[22:04:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne...
[22:14:20] <mutante>	 andrewbogott: it still has 7 errors
[22:15:09] <andrewbogott>	 mutante: I will look shortly.  Currently lost in partman :(
[22:15:34] <wikibugs>	 (03CR) 10Dzahn: "noop on phab1001 https://puppet-compiler.wmflabs.org/compiler1001/24112/" [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn)
[22:16:18] <mutante>	 andrewbogott: ack, thanks
[22:18:53] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[22:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:04] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:21:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:13] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1004.eqiad.wmnet'] `...
[22:35:31] <wikibugs>	 (03PS1) 10BryanDavis: toolforge: Temp handling for tools.wmflabs.org/wpcleaner [puppet] - 10https://gerrit.wikimedia.org/r/615872 (https://phabricator.wikimedia.org/T257495)
[22:36:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne...
[22:36:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne...
[22:37:10] <andrewbogott>	 mutante: in addition to my ping in the other channel… I don't see which 7 errors you're seeing.  Did they go away on their own?
[22:37:43] <andrewbogott>	 oh maybe it's because everything is downtimed for reimage
[22:41:57] <wikibugs>	 (03PS1) 10CDanis: appserver hiera: nginx is no more, long live envoyproxy [puppet] - 10https://gerrit.wikimedia.org/r/615874
[22:42:24] <wikibugs>	 (03PS2) 10Dzahn: aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593)
[22:43:17] <mutante>	 andrewbogott: it's these: sudo icinga -v /etc/icinga/icinga.cfg  | grep Errors
[22:43:27] <mutante>	 Error: 'cloudsw1-c8-eqiad.mgmt.eqiad.wmnet' is not a valid parent for host ...
[22:43:31] <mutante>	 followed by the 7 new hosts
[22:43:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn)
[22:44:00] <mutante>	 that means it can't reload the config to add new things
[22:44:24] <andrewbogott>	 Ok, will recheck
[22:44:49] <mutante>	 there is a parent/child relationship between the cloud switches and these hosts
[22:45:05] <mutante>	 and for some reason the parent switches are not valid anymore now
[22:45:35] <mutante>	 this stuff is there to avoid that all the hosts are alerting when one switch is down (afaict)
[22:46:22] <wikibugs>	 (03PS2) 10BryanDavis: toolforge: Temp handling for tools.wmflabs.org/wpcleaner [puppet] - 10https://gerrit.wikimedia.org/r/615872 (https://phabricator.wikimedia.org/T257495)
[22:46:53] <wikibugs>	 (03CR) 10CDanis: "pcc https://puppet-compiler.wmflabs.org/compiler1001/24113/mw2335.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/615874 (owner: 10CDanis)
[22:47:03] <mutante>	 to avoid this you may have to properly decom the old hosts and then add the new hosts.. not sure
[22:48:01] <mutante>	 direct renaming often has (similar) issues
[22:50:51] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[22:50:51] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[22:50:51] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[22:50:51] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[22:50:51] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[22:50:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1011.eqiad.wmnet'] `...
[22:50:54] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[22:50:54] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[22:50:54] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[22:50:54] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[22:50:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:50:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:15] <logmsgbot>	 !log jhuneidi@deploy1001 helmfile [STAGING] Ran 'sync' command on namespace 'echostore' for release 'staging' .
[22:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:53] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[22:51:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:51:57] <mutante>	 what's with the duplicate logging today 
[22:52:26] <mutante>	 stashbot is doing everything 4 times?
[22:52:26] <stashbot>	 See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help.
[22:52:55] <mutante>	 !log stashbot quadruple log test
[22:52:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:53:06] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:53:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:54:25] <wikibugs>	 (03PS3) 10Dzahn: aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593)
[22:55:26] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[22:55:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:48] <bd808>	 mutante: I think it is just stashbot falling behind on processing the rapid fire !log messages from the cookbooks
[22:57:06] <wikibugs>	 10Operations, 10Parsoid, 10serviceops, 10Parsoid-Tests, 10Patch-For-Review: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 (10wkandek) p:05Triage→03Medium
[22:57:48] <mutante>	 bd808: ah, yep. maybe the difference between one host at a time or a regex
[22:58:10] <bd808>	 It does look funny here in the channel, but it seems to have the right number of entires in the actual log
[22:59:55] <mutante>	 ack
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200723T2300).
[23:00:11] <andrewbogott>	 bleh, there isn't info about how to purge a host from icinga in the docs anymore
[23:01:01] <mutante>	 it's part of the decom cookbook 
[23:01:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1005.eqiad.wmnet', 'c...
[23:07:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] aphlict: add scap deploy target and missing parameters [puppet] - 10https://gerrit.wikimedia.org/r/615842 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn)
[23:13:11] <wikibugs>	 (03PS3) 10BryanDavis: toolforge: Temp handling for tools.wmflabs.org/wpcleaner [puppet] - 10https://gerrit.wikimedia.org/r/615872 (https://phabricator.wikimedia.org/T257495)
[23:16:17] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne...
[23:16:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmne...
[23:16:42] <wikibugs>	 (03PS1) 10CDanis: WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877
[23:17:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877 (owner: 10CDanis)
[23:18:00] <cdanis>	 yah I know jerkins
[23:18:33] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "Tested via manual application on tools-legacy-redirector.tools.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/615872 (https://phabricator.wikimedia.org/T257495) (owner: 10BryanDavis)
[23:18:35] <wikibugs>	 (03PS2) 10CDanis: WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877
[23:19:42] <wikibugs>	 (03PS3) 10CDanis: WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877
[23:19:59] <wikibugs>	 (03PS1) 10Dzahn: aphlict: add phab_deploy_finalize and rollback scripts [puppet] - 10https://gerrit.wikimedia.org/r/615879 (https://phabricator.wikimedia.org/T238593)
[23:21:02] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877 (owner: 10CDanis)
[23:29:38] <wikibugs>	 (03PS1) 10Tim Starling: Revert "Remove lilypond for now" [puppet] - 10https://gerrit.wikimedia.org/r/615851
[23:30:21] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[23:30:21] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[23:30:21] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[23:30:22] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime
[23:30:24] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[23:30:24] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[23:30:25] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[23:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:30:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:31:13] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1011.eqiad.wmnet'] `...
[23:32:20] <wikibugs>	 (03PS4) 10CDanis: WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877 (https://phabricator.wikimedia.org/T258648)
[23:32:32] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[23:32:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:46] <wikibugs>	 (03PS1) 10BryanDavis: toolforge: Allow large POST to tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/615881 (https://phabricator.wikimedia.org/T258760)
[23:33:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877 (https://phabricator.wikimedia.org/T258648) (owner: 10CDanis)
[23:34:56] <wikibugs>	 (03CR) 10CDanis: "This is mostly-correct, as far as I can tell from PCC*, but fails CI because I'm not smart enough to edit the fixtures there." [puppet] - 10https://gerrit.wikimedia.org/r/615877 (https://phabricator.wikimedia.org/T258648) (owner: 10CDanis)
[23:37:50] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudcephosd10[04-15].wikimedia.org - https://phabricator.wikimedia.org/T251619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1007.eqiad.wmnet', 'c...
[23:41:17] <wikibugs>	 (03CR) 10BryanDavis: toolforge: Allow large POST to tools.wmflabs.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615881 (https://phabricator.wikimedia.org/T258760) (owner: 10BryanDavis)
[23:42:02] <icinga-wm>	 PROBLEM - Check systemd state on webperf1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:43:00] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] toolforge: Temp handling for tools.wmflabs.org/wpcleaner (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/615872 (https://phabricator.wikimedia.org/T257495) (owner: 10BryanDavis)
[23:59:37] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] Revert "Remove lilypond for now" [puppet] - 10https://gerrit.wikimedia.org/r/615851 (owner: 10Tim Starling)
[23:59:40] <wikibugs>	 (03CR) 10Reedy: [C: 03+1] Revert "Remove lilypond for now" [puppet] - 10https://gerrit.wikimedia.org/r/615851 (owner: 10Tim Starling)